grafana

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Grafana and LGTM Stack Skill

Grafana与LGTM栈技能

Overview

概述

The LGTM stack provides a complete observability solution with comprehensive visualization and dashboard capabilities:

Loki: Log aggregation and querying (LogQL)
Grafana: Visualization, dashboarding, alerting, and exploration
Tempo: Distributed tracing (TraceQL)
Mimir: Long-term metrics storage (Prometheus-compatible)

This skill covers setup, configuration, dashboard creation, panel design, querying, alerting, templating, and production observability best practices.

LGTM栈提供了一套完整的可观测性解决方案，具备全面的可视化与仪表板能力：

Loki：日志聚合与查询（LogQL）
Grafana：可视化、仪表板搭建、告警与探索功能
Tempo：分布式追踪（TraceQL）
Mimir：长期指标存储（兼容Prometheus）

本技能涵盖安装配置、仪表板创建、面板设计、查询编写、告警设置、模板化以及生产环境可观测性最佳实践。

When to Use This Skill

适用场景

Primary Use Cases

核心使用场景

Creating or modifying Grafana dashboards
Designing panels and visualizations (graphs, stats, tables, heatmaps, etc.)
Writing queries (PromQL, LogQL, TraceQL)
Configuring data sources (Prometheus, Loki, Tempo, Mimir)
Setting up alerting rules and notification policies
Implementing dashboard variables and templates
Dashboard provisioning and GitOps workflows
Troubleshooting observability queries
Analyzing application performance, errors, or system behavior

创建或修改Grafana仪表板
设计面板与可视化效果（图表、统计卡片、表格、热力图等）
编写查询语句（PromQL、LogQL、TraceQL）
配置数据源（Prometheus、Loki、Tempo、Mimir）
设置告警规则与通知策略
实现仪表板变量与模板
仪表板自动化部署与GitOps工作流
排查可观测性查询问题
分析应用性能、错误或系统行为

Who Uses This Skill

适用人群

senior-software-engineer (PRIMARY): Production observability setup, LGTM stack deployment, dashboard architecture (use infrastructure skills for deployment)
software-engineer: Application dashboards, service metrics visualization

资深软件工程师（核心用户）：生产环境可观测性搭建、LGTM栈部署、仪表板架构设计（部署相关请使用基础设施技能）
软件工程师：应用仪表板搭建、服务指标可视化

LGTM Stack Components

LGTM栈组件

Loki - Log Aggregation

Loki - 日志聚合

Architecture - Loki

Loki架构

Horizontally scalable log aggregation inspired by Prometheus

Indexes only metadata (labels), not log content
Cost-effective storage with object stores (S3, GCS, etc.)
LogQL query language similar to PromQL

受Prometheus启发的水平可扩展日志聚合系统

仅索引元数据（标签），不索引日志内容
结合对象存储（S3、GCS等）实现低成本存储
类PromQL的查询语言LogQL

Key Concepts - Loki

Loki核心概念

Labels for indexing (low cardinality)
Log streams identified by unique label sets
Parsers: logfmt, JSON, regex, pattern
Line filters and label filters

用于索引的低基数标签
由唯一标签集标识的日志流
解析器：logfmt、JSON、正则表达式、模式匹配
行过滤与标签过滤

Grafana - Visualization

Grafana - 可视化

Features

功能特性

Multi-datasource dashboarding
Panel types: Graph, Stat, Table, Heatmap, Bar Chart, Pie Chart, Gauge, Logs, Traces, Time Series
Templating and variables for dynamic dashboards
Alerting (unified alerting with contact points and notification policies)
Dashboard provisioning and GitOps integration
Role-based access control (RBAC)
Explore mode for ad-hoc queries
Annotations for event markers
Dashboard folders and organization

多数据源仪表板支持
面板类型：时间序列图、统计卡片、表格、热力图、柱状图、饼图、仪表盘、日志面板、追踪面板、时序图
模板化与变量实现动态仪表板
统一告警（含联系点与通知策略）
仪表板自动化部署与GitOps集成
基于角色的访问控制（RBAC）
探索模式支持临时查询
用于事件标记的注解功能
仪表板文件夹与组织管理

Tempo - Distributed Tracing

Tempo - 分布式追踪

Architecture - Tempo

Tempo架构

Scalable distributed tracing backend

Cost-effective trace storage
TraceQL for trace querying
Integration with logs and metrics (trace-to-logs, trace-to-metrics)
OpenTelemetry compatible

水平可扩展的分布式追踪后端

低成本的追踪存储
用于追踪查询的TraceQL
与日志和指标的集成（追踪转日志、追踪转指标）
兼容OpenTelemetry

Mimir - Metrics Storage

Mimir - 指标存储

Architecture - Mimir

Mimir架构

Horizontally scalable long-term Prometheus storage

Multi-tenancy support
Query federation
High availability
Prometheus remote_write compatible

水平可扩展的长期Prometheus指标存储

多租户支持
查询联邦
高可用性
兼容Prometheus remote_write

Dashboard Design and Best Practices

仪表板设计与最佳实践

Dashboard Organization Principles

仪表板组织原则

Hierarchy: Overview -> Service -> Component -> Deep Dive
Golden Signals: Latency, Traffic, Errors, Saturation (RED/USE method)
Variable-driven: Use templates for flexibility across environments
Consistent Layouts: Grid alignment (24-column grid), logical top-to-bottom flow
Performance: Limit queries, use query caching, appropriate time intervals

层级结构：概览 -> 服务 -> 组件 -> 深度排查
黄金信号：延迟、流量、错误、饱和度（RED/USE方法）
变量驱动：使用模板实现跨环境的灵活性
布局一致：网格对齐（24列网格）、逻辑从上到下的流向
性能优化：限制查询数量、使用查询缓存、选择合适的时间间隔

Panel Types and When to Use Them

面板类型与适用场景

Panel Type	Use Case	Best For
Time Series / Graph	Trends over time	Request rates, latency, resource usage
Stat	Single metric value	Error rates, current values, percentage
Gauge	Progress toward limit	CPU usage, memory, disk space
Bar Gauge	Comparative values	Top N items, distribution
Table	Structured data	Service lists, error details, resource inventory
Pie Chart	Proportions	Traffic distribution, error breakdown
Heatmap	Distribution over time	Latency percentiles, request patterns
Logs	Log streams	Error investigation, debugging
Traces	Distributed tracing	Performance analysis, dependency mapping

面板类型	适用场景	最佳用途
时间序列图/折线图	趋势变化分析	请求速率、延迟、资源使用率
统计卡片	单指标展示	错误率、当前值、百分比
仪表盘	接近阈值的进度展示	CPU使用率、内存、磁盘空间
条形仪表盘	多值对比	Top N项、分布情况
表格	结构化数据展示	服务列表、错误详情、资源清单
饼图	占比分析	流量分布、错误细分
热力图	时间维度的分布分析	延迟分位数、请求模式
日志面板	日志流查看	错误排查、调试
追踪面板	分布式追踪分析	性能分析、依赖映射

Panel Configuration Best Practices

面板配置最佳实践

Titles and Descriptions

标题与描述

Clear, descriptive titles: Include units and metric context
Tooltips: Add description fields for panel documentation
Examples:
- Good: "P95 Latency (seconds) by Endpoint"
- Bad: "Latency"

清晰、具描述性的标题：包含单位与指标上下文
工具提示：添加描述字段作为面板文档
示例:
- 优秀："按端点统计的P95延迟（秒）"
- 不佳："延迟"

Legends and Labels

图例与标签

Show legends only when needed (multiple series)
Use
```
{{label}}
```
format for dynamic legend names
Place legends appropriately (bottom, right, or hidden)
Sort by value when showing Top N

仅在多序列时显示图例
使用
```
{{label}}
```
格式动态命名图例
合理放置图例（底部、右侧或隐藏）
展示Top N时按数值排序

Axes and Units

坐标轴与单位

Always label axes with units
Use appropriate unit formats (seconds, bytes, percent, requests/sec)
Set reasonable min/max ranges to avoid misleading scales
Use logarithmic scales for wide value ranges

始终为坐标轴添加单位标签
使用合适的单位格式（秒、字节、百分比、请求/秒）
设置合理的最小/最大值范围，避免误导性刻度
数值范围较大时使用对数刻度

Thresholds and Colors

阈值与颜色

Use thresholds for visual cues (green/yellow/red)
Standard threshold pattern:
- Green: Normal operation
- Yellow: Warning (action may be needed)
- Red: Critical (immediate attention required)
Examples:
- Error rate: 0% (green), 1% (yellow), 5% (red)
- P95 latency: <1s (green), 1-3s (yellow), >3s (red)

使用阈值提供视觉提示（绿/黄/红）
标准阈值模式:
- 绿色：正常运行
- 黄色：警告（可能需要采取行动）
- 红色：严重（需立即处理）
示例:
- 错误率：0%（绿）、1%（黄）、5%（红）
- P95延迟：<1秒（绿）、1-3秒（黄）、>3秒（红）

Links and Drilldowns

链接与钻取

Link panels to related dashboards
Use data links for context (logs, traces, related services)
Create drill-down paths: Overview -> Service -> Component -> Details
Link to runbooks for alert panels

将面板链接到相关仪表板
使用数据链接提供上下文（日志、追踪、关联服务）
创建钻取路径：概览 -> 服务 -> 组件 -> 详情
为告警面板链接到运行手册

Dashboard Variables and Templating

仪表板变量与模板

Dashboard variables enable reusable, dynamic dashboards that work across environments, services, and time ranges.

仪表板变量可实现可复用的动态仪表板，适配不同环境、服务与时间范围。

Variable Types

变量类型

Type	Purpose	Example
Query	Populate from data source	Namespaces, services, pods
Custom	Static list of options	Environments (prod/staging/dev)
Interval	Time interval selection	Auto-adjusted query intervals
Datasource	Switch between data sources	Multiple Prometheus instances
Constant	Hidden values for queries	Cluster name, region
Text box	Free-form input	Custom filters

类型	用途	示例
查询型	从数据源获取选项	命名空间、服务、Pod
自定义型	静态选项列表	环境（生产/预发布/开发）
时间间隔型	时间间隔选择	自动调整的查询间隔
数据源型	切换不同数据源	多Prometheus实例
常量型	查询中使用的隐藏值	集群名称、区域
文本框型	自由输入内容	自定义过滤器

Common Variable Patterns

常见变量配置示例

json

{
  "templating": {
    "list": [
      {
        "name": "datasource",
        "type": "datasource",
        "query": "prometheus",
        "description": "Select Prometheus data source"
      },
      {
        "name": "namespace",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(kube_pod_info, namespace)",
        "multi": true,
        "includeAll": true,
        "description": "Kubernetes namespace filter"
      },
      {
        "name": "app",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, app)",
        "multi": true,
        "includeAll": true,
        "description": "Application filter (depends on namespace)"
      },
      {
        "name": "interval",
        "type": "interval",
        "auto": true,
        "auto_count": 30,
        "auto_min": "10s",
        "options": ["1m", "5m", "15m", "30m", "1h", "6h", "12h", "1d"],
        "description": "Query resolution interval"
      },
      {
        "name": "environment",
        "type": "custom",
        "options": [
          { "text": "Production", "value": "prod" },
          { "text": "Staging", "value": "staging" },
          { "text": "Development", "value": "dev" }
        ],
        "current": { "text": "Production", "value": "prod" }
      }
    ]
  }
}

json

{
  "templating": {
    "list": [
      {
        "name": "datasource",
        "type": "datasource",
        "query": "prometheus",
        "description": "选择Prometheus数据源"
      },
      {
        "name": "namespace",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(kube_pod_info, namespace)",
        "multi": true,
        "includeAll": true,
        "description": "Kubernetes命名空间过滤器"
      },
      {
        "name": "app",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, app)",
        "multi": true,
        "includeAll": true,
        "description": "应用过滤器（依赖命名空间）"
      },
      {
        "name": "interval",
        "type": "interval",
        "auto": true,
        "auto_count": 30,
        "auto_min": "10s",
        "options": ["1m", "5m", "15m", "30m", "1h", "6h", "12h", "1d"],
        "description": "查询精度间隔"
      },
      {
        "name": "environment",
        "type": "custom",
        "options": [
          { "text": "Production", "value": "prod" },
          { "text": "Staging", "value": "staging" },
          { "text": "Development", "value": "dev" }
        ],
        "current": { "text": "Production", "value": "prod" }
      }
    ]
  }
}

Variable Usage in Queries

查询中的变量引用

Variables are referenced with

$variable_name

${variable_name}

syntax:

promql

undefined

使用

$variable_name

或

${variable_name}

语法引用变量：

promql

undefined

Simple variable reference

简单变量引用

rate(http_requests_total{namespace="$namespace"}[5m])

Multi-select with regex match

多选正则匹配

rate(http_requests_total{namespace=~"$namespace"}[5m])

Variable in legend

图例中的变量

rate(http_requests_total{app="$app"}[5m]) by (method)

Legend format: "{{method}}"

图例格式: "{{method}}"

Using interval variable for adaptive queries

使用间隔变量实现自适应查询

rate(http_requests_total[$__interval])

Chained variables (app depends on namespace)

链式变量（应用依赖命名空间）

rate(http_requests_total{namespace="$namespace", app="$app"}[5m])

undefined

rate(http_requests_total{namespace="$namespace", app="$app"}[5m])

undefined

Advanced Variable Techniques

高级变量技巧

Regex filtering:

json

{
  "name": "pod",
  "type": "query",
  "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
  "regex": "/^$app-.*/",
  "description": "Filter pods by app prefix"
}

All option with custom value:

json

{
  "name": "status",
  "type": "custom",
  "options": ["200", "404", "500"],
  "includeAll": true,
  "allValue": ".*",
  "description": "HTTP status code filter"
}

Dependent variables (variable chain):

```
$datasource
```
(datasource type)
```
$cluster
```
(query: depends on datasource)
```
$namespace
```
(query: depends on cluster)
```
$app
```
(query: depends on namespace)
```
$pod
```
(query: depends on app)

正则过滤:

json

{
  "name": "pod",
  "type": "query",
  "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
  "regex": "/^$app-.*/",
  "description": "按应用前缀过滤Pod"
}

含自定义值的全选选项:

json

{
  "name": "status",
  "type": "custom",
  "options": ["200", "404", "500"],
  "includeAll": true,
  "allValue": ".*",
  "description": "HTTP状态码过滤器"
}

依赖变量（变量链）:

```
$datasource
```
（数据源类型）
```
$cluster
```
（查询：依赖数据源）
```
$namespace
```
（查询：依赖集群）
```
$app
```
（查询：依赖命名空间）
```
$pod
```
（查询：依赖应用）

Annotations

注解

Annotations display events as vertical markers on time series panels:

json

{
  "annotations": {
    "list": [
      {
        "name": "Deployments",
        "datasource": "Prometheus",
        "expr": "changes(kube_deployment_spec_replicas{namespace=\"$namespace\"}[5m])",
        "tagKeys": "deployment,namespace",
        "textFormat": "Deployment: {{deployment}}",
        "iconColor": "blue"
      },
      {
        "name": "Alerts",
        "datasource": "Loki",
        "expr": "{app=\"alertmanager\"} | json | alertname!=\"\"",
        "textFormat": "Alert: {{alertname}}",
        "iconColor": "red"
      }
    ]
  }
}

注解在时间序列面板上以垂直标记展示事件：

json

{
  "annotations": {
    "list": [
      {
        "name": "部署事件",
        "datasource": "Prometheus",
        "expr": "changes(kube_deployment_spec_replicas{namespace=\"$namespace\"}[5m])",
        "tagKeys": "deployment,namespace",
        "textFormat": "部署: {{deployment}}",
        "iconColor": "blue"
      },
      {
        "name": "告警事件",
        "datasource": "Loki",
        "expr": "{app=\"alertmanager\"} | json | alertname!=\"\"",
        "textFormat": "告警: {{alertname}}",
        "iconColor": "red"
      }
    ]
  }
}

Dashboard Performance Optimization

仪表板性能优化

Query Optimization

查询优化

Limit number of panels (< 15 per dashboard)
Use appropriate time ranges (avoid queries over months)
Leverage
```
$__interval
```
for adaptive sampling
Avoid high-cardinality grouping (too many series)
Use query caching when available

限制面板数量（单仪表板不超过15个）
使用合适的时间范围（避免跨月查询）
利用
```
$__interval
```
实现自适应采样
避免高基数分组（过多序列）
启用查询缓存（如支持）

Panel Performance

面板性能

Set max data points to reasonable values
Use instant queries for current-state panels
Combine related metrics into single queries when possible
Disable auto-refresh on heavy dashboards

设置合理的最大数据点数量
对当前状态面板使用即时查询
可能时将相关指标合并为单个查询
对重型仪表板禁用自动刷新

Dashboard as Code and Provisioning

即代码化仪表板与自动化部署

Dashboard Provisioning

仪表板自动化部署

Dashboard provisioning enables GitOps workflows and version-controlled dashboard definitions.

仪表板自动化部署支持GitOps工作流与版本控制的仪表板定义。

Provisioning Provider Configuration

部署提供者配置

File:

/etc/grafana/provisioning/dashboards/dashboards.yaml

yaml

apiVersion: 1

providers:
  - name: "default"
    orgId: 1
    folder: ""
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/provisioning/dashboards

  - name: "application"
    orgId: 1
    folder: "Applications"
    type: file
    disableDeletion: true
    editable: false
    options:
      path: /var/lib/grafana/dashboards/application

  - name: "infrastructure"
    orgId: 1
    folder: "Infrastructure"
    type: file
    options:
      path: /var/lib/grafana/dashboards/infrastructure

文件:

/etc/grafana/provisioning/dashboards/dashboards.yaml

yaml

apiVersion: 1

providers:
  - name: "default"
    orgId: 1
    folder: ""
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/provisioning/dashboards

  - name: "application"
    orgId: 1
    folder: "Applications"
    type: file
    disableDeletion: true
    editable: false
    options:
      path: /var/lib/grafana/dashboards/application

  - name: "infrastructure"
    orgId: 1
    folder: "Infrastructure"
    type: file
    options:
      path: /var/lib/grafana/dashboards/infrastructure

Dashboard JSON Structure

仪表板JSON结构

Complete dashboard JSON with metadata and provisioning:

json

{
  "dashboard": {
    "title": "Application Observability - ${app}",
    "uid": "app-observability",
    "tags": ["observability", "application"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s",
    "templating": { "list": [] },
    "panels": [],
    "links": []
  },
  "overwrite": true,
  "folderId": null,
  "folderUid": null
}

包含元数据与部署配置的完整仪表板JSON：

json

{
  "dashboard": {
    "title": "应用可观测性 - ${app}",
    "uid": "app-observability",
    "tags": ["observability", "application"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s",
    "templating": { "list": [] },
    "panels": [],
    "links": []
  },
  "overwrite": true,
  "folderId": null,
  "folderUid": null
}

Kubernetes ConfigMap Provisioning

Kubernetes ConfigMap部署

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  application-dashboard.json: |
    {
      "dashboard": {
        "title": "Application Metrics",
        "uid": "app-metrics",
        "tags": ["application"],
        "panels": []
      }
    }

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  application-dashboard.json: |
    {
      "dashboard": {
        "title": "应用指标",
        "uid": "app-metrics",
        "tags": ["application"],
        "panels": []
      }
    }

Grafana Operator (CRD)

Grafana Operator（CRD）

yaml

apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: application-observability
  namespace: monitoring
spec:
  instanceSelector:
    matchLabels:
      dashboards: "grafana"
  json: |
    {
      "dashboard": {
        "title": "Application Observability",
        "panels": []
      }
    }

yaml

apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: application-observability
  namespace: monitoring
spec:
  instanceSelector:
    matchLabels:
      dashboards: "grafana"
  json: |
    {
      "dashboard": {
        "title": "应用可观测性",
        "panels": []
      }
    }

Data Source Provisioning

数据源自动化部署

Loki Data Source

Loki数据源

File:

/etc/grafana/provisioning/datasources/loki.yaml

yaml

apiVersion: 1

datasources:
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      maxLines: 1000
      derivedFields:
        - datasourceUid: tempo_uid
          matcherRegex: "trace_id=(\\w+)"
          name: TraceID
          url: "$${__value.raw}"
    editable: false

文件:

/etc/grafana/provisioning/datasources/loki.yaml

yaml

apiVersion: 1

datasources:
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      maxLines: 1000
      derivedFields:
        - datasourceUid: tempo_uid
          matcherRegex: "trace_id=(\\w+)"
          name: TraceID
          url: "$${__value.raw}"
    editable: false

Tempo Data Source

Tempo数据源

File:

/etc/grafana/provisioning/datasources/tempo.yaml

yaml

apiVersion: 1

datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    uid: tempo_uid
    jsonData:
      httpMethod: GET
      tracesToLogs:
        datasourceUid: loki_uid
        tags: ["job", "instance", "pod", "namespace"]
        mappedTags: [{ key: "service.name", value: "service" }]
        spanStartTimeShift: "1h"
        spanEndTimeShift: "1h"
      tracesToMetrics:
        datasourceUid: prometheus_uid
        tags: [{ key: "service.name", value: "service" }]
      serviceMap:
        datasourceUid: prometheus_uid
      nodeGraph:
        enabled: true
    editable: false

文件:

/etc/grafana/provisioning/datasources/tempo.yaml

yaml

apiVersion: 1

datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    uid: tempo_uid
    jsonData:
      httpMethod: GET
      tracesToLogs:
        datasourceUid: loki_uid
        tags: ["job", "instance", "pod", "namespace"]
        mappedTags: [{ key: "service.name", value: "service" }]
        spanStartTimeShift: "1h"
        spanEndTimeShift: "1h"
      tracesToMetrics:
        datasourceUid: prometheus_uid
        tags: [{ key: "service.name", value: "service" }]
      serviceMap:
        datasourceUid: prometheus_uid
      nodeGraph:
        enabled: true
    editable: false

Mimir/Prometheus Data Source

Mimir/Prometheus数据源

File:

/etc/grafana/provisioning/datasources/mimir.yaml

yaml

apiVersion: 1

datasources:
  - name: Mimir
    type: prometheus
    access: proxy
    url: http://mimir:8080/prometheus
    uid: prometheus_uid
    jsonData:
      httpMethod: POST
      exemplarTraceIdDestinations:
        - datasourceUid: tempo_uid
          name: trace_id
      prometheusType: Mimir
      prometheusVersion: 2.40.0
      cacheLevel: "High"
      incrementalQuerying: true
      incrementalQueryOverlapWindow: 10m
    editable: false

文件:

/etc/grafana/provisioning/datasources/mimir.yaml

yaml

apiVersion: 1

datasources:
  - name: Mimir
    type: prometheus
    access: proxy
    url: http://mimir:8080/prometheus
    uid: prometheus_uid
    jsonData:
      httpMethod: POST
      exemplarTraceIdDestinations:
        - datasourceUid: tempo_uid
          name: trace_id
      prometheusType: Mimir
      prometheusVersion: 2.40.0
      cacheLevel: "High"
      incrementalQuerying: true
      incrementalQueryOverlapWindow: 10m
    editable: false

Alerting

告警

Alert Rule Configuration

告警规则配置

Grafana unified alerting supports multi-datasource alerts with flexible evaluation and routing.

Grafana统一告警支持多数据源告警，具备灵活的评估与路由能力。

Prometheus/Mimir Alert Rule

Prometheus/Mimir告警规则

File:

/etc/grafana/provisioning/alerting/rules.yaml

yaml

apiVersion: 1

groups:
  - name: application_alerts
    interval: 1m
    rules:
      - uid: error_rate_high
        title: High Error Rate
        condition: A
        data:
          - refId: A
            queryType: ""
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: prometheus_uid
            model:
              expr: |
                sum(rate(http_requests_total{status=~"5.."}[5m]))
                /
                sum(rate(http_requests_total[5m]))
                > 0.05
              intervalMs: 1000
              maxDataPoints: 43200
        noDataState: NoData
        execErrState: Error
        for: 5m
        annotations:
          description: 'Error rate is {{ printf "%.2f" $values.A.Value }}% (threshold: 5%)'
          summary: Application error rate is above threshold
          runbook_url: https://wiki.company.com/runbooks/high-error-rate
        labels:
          severity: critical
          team: platform
        isPaused: false

      - uid: high_latency
        title: High P95 Latency
        condition: A
        data:
          - refId: A
            datasourceUid: prometheus_uid
            model:
              expr: |
                histogram_quantile(0.95,
                  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
                ) > 2
        for: 10m
        annotations:
          description: "P95 latency is {{ $values.A.Value }}s on endpoint {{ $labels.endpoint }}"
          runbook_url: https://wiki.company.com/runbooks/high-latency
        labels:
          severity: warning

文件:

/etc/grafana/provisioning/alerting/rules.yaml

yaml

apiVersion: 1

groups:
  - name: application_alerts
    interval: 1m
    rules:
      - uid: error_rate_high
        title: 错误率过高
        condition: A
        data:
          - refId: A
            queryType: ""
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: prometheus_uid
            model:
              expr: |
                sum(rate(http_requests_total{status=~"5.."}[5m]))
                /
                sum(rate(http_requests_total[5m]))
                > 0.05
              intervalMs: 1000
              maxDataPoints: 43200
        noDataState: NoData
        execErrState: Error
        for: 5m
        annotations:
          description: '错误率为{{ printf "%.2f" $values.A.Value }}%（阈值：5%）'
          summary: 应用错误率超过阈值
          runbook_url: https://wiki.company.com/runbooks/high-error-rate
        labels:
          severity: critical
          team: platform
        isPaused: false

      - uid: high_latency
        title: P95延迟过高
        condition: A
        data:
          - refId: A
            datasourceUid: prometheus_uid
            model:
              expr: |
                histogram_quantile(0.95,
                  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
                ) > 2
        for: 10m
        annotations:
          description: "端点{{ $labels.endpoint }}的P95延迟为{{ $values.A.Value }}秒"
          runbook_url: https://wiki.company.com/runbooks/high-latency
        labels:
          severity: warning

Loki Alert Rule

Loki告警规则

yaml

apiVersion: 1

groups:
  - name: log_based_alerts
    interval: 1m
    rules:
      - uid: error_spike
        title: Error Log Spike
        condition: A
        data:
          - refId: A
            queryType: ""
            datasourceUid: loki_uid
            model:
              expr: |
                sum(rate({app="api"} | json | level="error" [5m]))
                > 10
        for: 2m
        annotations:
          description: "Error log rate is {{ $values.A.Value }} logs/sec"
          summary: Spike in error logs detected
        labels:
          severity: warning

      - uid: critical_error_pattern
        title: Critical Error Pattern Detected
        condition: A
        data:
          - refId: A
            datasourceUid: loki_uid
            model:
              expr: |
                sum(count_over_time({app="api"}
                  |~ "OutOfMemoryError|StackOverflowError|FatalException" [5m]
                )) > 0
        for: 0m
        annotations:
          description: "Critical error pattern found in logs"
        labels:
          severity: critical
          page: true

yaml

apiVersion: 1

groups:
  - name: log_based_alerts
    interval: 1m
    rules:
      - uid: error_spike
        title: 错误日志突增
        condition: A
        data:
          - refId: A
            queryType: ""
            datasourceUid: loki_uid
            model:
              expr: |
                sum(rate({app="api"} | json | level="error" [5m]))
                > 10
        for: 2m
        annotations:
          description: "错误日志速率为{{ $values.A.Value }}条/秒"
          summary: 检测到错误日志突增
        labels:
          severity: warning

      - uid: critical_error_pattern
        title: 检测到严重错误模式
        condition: A
        data:
          - refId: A
            datasourceUid: loki_uid
            model:
              expr: |
                sum(count_over_time({app="api"}
                  |~ "OutOfMemoryError|StackOverflowError|FatalException" [5m]
                )) > 0
        for: 0m
        annotations:
          description: "日志中发现严重错误模式"
        labels:
          severity: critical
          page: true

Contact Points and Notification Policies

联系点与通知策略

File:

/etc/grafana/provisioning/alerting/contactpoints.yaml

yaml

apiVersion: 1

contactPoints:
  - orgId: 1
    name: slack-critical
    receivers:
      - uid: slack_critical
        type: slack
        settings:
          url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
          title: "{{ .GroupLabels.alertname }}"
          text: |
            {{ range .Alerts }}
            *Alert:* {{ .Labels.alertname }}
            *Summary:* {{ .Annotations.summary }}
            *Description:* {{ .Annotations.description }}
            *Severity:* {{ .Labels.severity }}
            {{ end }}
        disableResolveMessage: false

  - orgId: 1
    name: pagerduty-oncall
    receivers:
      - uid: pagerduty_oncall
        type: pagerduty
        settings:
          integrationKey: YOUR_INTEGRATION_KEY
          severity: critical
          class: infrastructure

  - orgId: 1
    name: email-team
    receivers:
      - uid: email_team
        type: email
        settings:
          addresses: team@company.com
          singleEmail: true

notificationPolicies:
  - orgId: 1
    receiver: slack-critical
    group_by: ["alertname", "namespace"]
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    routes:
      - receiver: pagerduty-oncall
        matchers:
          - severity = critical
          - page = true
        group_wait: 10s
        repeat_interval: 1h
        continue: true

      - receiver: email-team
        matchers:
          - severity = warning
          - team = platform
        group_interval: 10m
        repeat_interval: 12h

文件:

/etc/grafana/provisioning/alerting/contactpoints.yaml

yaml

apiVersion: 1

contactPoints:
  - orgId: 1
    name: slack-critical
    receivers:
      - uid: slack_critical
        type: slack
        settings:
          url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
          title: "{{ .GroupLabels.alertname }}"
          text: |
            {{ range .Alerts }}
            *告警:* {{ .Labels.alertname }}
            *摘要:* {{ .Annotations.summary }}
            *描述:* {{ .Annotations.description }}
            *级别:* {{ .Labels.severity }}
            {{ end }}
        disableResolveMessage: false

  - orgId: 1
    name: pagerduty-oncall
    receivers:
      - uid: pagerduty_oncall
        type: pagerduty
        settings:
          integrationKey: YOUR_INTEGRATION_KEY
          severity: critical
          class: infrastructure

  - orgId: 1
    name: email-team
    receivers:
      - uid: email_team
        type: email
        settings:
          addresses: team@company.com
          singleEmail: true

notificationPolicies:
  - orgId: 1
    receiver: slack-critical
    group_by: ["alertname", "namespace"]
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    routes:
      - receiver: pagerduty-oncall
        matchers:
          - severity = critical
          - page = true
        group_wait: 10s
        repeat_interval: 1h
        continue: true

      - receiver: email-team
        matchers:
          - severity = warning
          - team = platform
        group_interval: 10m
        repeat_interval: 12h

LogQL Query Patterns

LogQL查询模式

Basic Log Queries

基础日志查询

Stream Selection

流选择

logql

undefined

logql

undefined

Simple label matching

简单标签匹配

{namespace="production", app="api"}

Regex matching

正则匹配

{app=~"api|web|worker"}

Not equal

不等于

{env!="staging"}

Multiple conditions

多条件

{namespace="production", app="api", level!="debug"}

undefined

{namespace="production", app="api", level!="debug"}

undefined

Line Filters

行过滤

logql

undefined

logql

undefined

Contains

包含指定内容

{app="api"} |= "error"

Does not contain

不包含指定内容

{app="api"} != "debug"

Regex match

正则匹配

{app="api"} |~ "error|exception|fatal"

Case insensitive

不区分大小写

{app="api"} |~ "(?i)error"

Chaining filters

链式过滤

{app="api"} |= "error" != "timeout"

undefined

{app="api"} |= "error" != "timeout"

undefined

Parsing and Extraction

解析与提取

JSON Parsing

JSON解析

logql

undefined

logql

undefined

Parse JSON logs

解析JSON日志

{app="api"} | json

Extract specific fields

提取特定字段

{app="api"} | json message="msg", level="severity"

Filter on extracted field

过滤提取的字段

{app="api"} | json | level="error"

Nested JSON

嵌套JSON

{app="api"} | json | line_format "{{.response.status}}"

undefined

{app="api"} | json | line_format "{{.response.status}}"

undefined

Logfmt Parsing

Logfmt解析

logql

undefined

logql

undefined

Parse logfmt (key=value)

解析logfmt格式（key=value）

{app="api"} | logfmt

Extract specific fields

提取特定字段

{app="api"} | logfmt level, caller, msg

Filter parsed fields

过滤解析后的字段

{app="api"} | logfmt | level="error"

undefined

{app="api"} | logfmt | level="error"

undefined

Pattern Parsing

模式匹配解析

logql

undefined

logql

undefined

Extract with pattern

使用模式提取

{app="nginx"} | pattern

<ip> - - <_> "<method> <uri> <_>" <status> <_>

{app="nginx"} | pattern

<ip> - - <_> "<method> <uri> <_>" <status> <_>

Filter on extracted values

过滤提取的值

{app="nginx"} | pattern

<_> <status> <_>

| status >= 400

{app="nginx"} | pattern

<_> <status> <_>

| status >= 400

Complex pattern

复杂模式

{app="api"} | pattern

level=<level> msg="<msg>" duration=<duration>ms

undefined

{app="api"} | pattern

level=<level> msg="<msg>" duration=<duration>ms

undefined

Aggregations and Metrics

聚合与指标

Count Queries

计数查询

logql

undefined

logql

undefined

Count log lines over time

统计指定时间范围内的日志行数

count_over_time({app="api"}[5m])

Rate of logs

日志速率

rate({app="api"}[5m])

Errors per second

每秒错误数

sum(rate({app="api"} |= "error" [5m])) by (namespace)

Error ratio

错误占比

sum(rate({app="api"} |= "error" [5m])) / sum(rate({app="api"}[5m]))

undefined

sum(rate({app="api"} |= "error" [5m])) / sum(rate({app="api"}[5m]))

undefined

Extracted Metrics

提取指标

logql

undefined

logql

undefined

Average duration

平均耗时

avg_over_time({app="api"} | logfmt | unwrap duration [5m]) by (endpoint)

P95 latency

P95延迟

quantile_over_time(0.95, {app="api"} | regexp

duration=(?P<duration>[0-9.]+)ms

| unwrap duration [5m]) by (method)

quantile_over_time(0.95, {app="api"} | regexp

duration=(?P<duration>[0-9.]+)ms

| unwrap duration [5m]) by (method)

Top 10 error messages

Top 10错误消息

topk(10, sum by (msg) ( count_over_time({app="api"} | json | level="error" [1h] ) ) )

undefined

topk(10, sum by (msg) ( count_over_time({app="api"} | json | level="error" [1h] ) ) )

undefined

TraceQL Query Patterns

TraceQL查询模式

Basic Trace Queries

基础追踪查询

traceql

undefined

traceql

undefined

Find traces by service

按服务查找追踪

{ .service.name = "api" }

HTTP status codes

HTTP状态码

{ .http.status_code = 500 }

Combine conditions

多条件组合

{ .service.name = "api" && .http.status_code >= 400 }

Duration filter

耗时过滤

{ duration > 1s }

undefined

{ duration > 1s }

undefined

Advanced TraceQL

高级TraceQL

traceql

undefined

traceql

undefined

Parent-child relationship

父子关系

{ .service.name = "frontend" }

{ .service.name = "backend" && .http.status_code = 500 }

{ .service.name = "frontend" }

{ .service.name = "backend" && .http.status_code = 500 }

Descendant spans

后代跨度

{ .service.name = "api" }

{ .db.system = "postgresql" && duration > 1s }

{ .service.name = "api" }

{ .db.system = "postgresql" && duration > 1s }

Failed database queries

失败的数据库查询

{ .service.name = "api" }

{ .db.system = "postgresql" && status = "error" }

undefined

{ .service.name = "api" }

{ .db.system = "postgresql" && status = "error" }

undefined

Complete Dashboard Examples

完整仪表板示例

Application Observability Dashboard

应用可观测性仪表板

json

{
  "dashboard": {
    "title": "Application Observability - ${app}",
    "tags": ["observability", "application"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "templating": {
      "list": [
        {
          "name": "app",
          "type": "query",
          "datasource": "Mimir",
          "query": "label_values(up, app)",
          "current": {
            "selected": false,
            "text": "api",
            "value": "api"
          }
        },
        {
          "name": "namespace",
          "type": "query",
          "datasource": "Mimir",
          "query": "label_values(up{app=\"$app\"}, namespace)",
          "multi": true,
          "includeAll": true
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (method, status)",
            "legendFormat": "{{method}} - {{status}}"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 0
        },
        "yaxes": [
          {
            "format": "reqps",
            "label": "Requests/sec"
          }
        ]
      },
      {
        "id": 2,
        "title": "P95 Latency",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (le, endpoint))",
            "legendFormat": "{{endpoint}}"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 0
        },
        "yaxes": [
          {
            "format": "s",
            "label": "Duration"
          }
        ],
        "thresholds": [
          {
            "value": 1,
            "colorMode": "critical",
            "fill": true,
            "line": true,
            "op": "gt"
          }
        ]
      },
      {
        "id": 3,
        "title": "Error Rate",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval]))",
            "legendFormat": "Error %"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 8
        },
        "yaxes": [
          {
            "format": "percentunit",
            "max": 1,
            "min": 0
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [0.01],
                "type": "gt"
              },
              "operator": {
                "type": "and"
              },
              "query": {
                "params": ["A", "5m", "now"]
              },
              "reducer": {
                "type": "avg"
              },
              "type": "query"
            }
          ],
          "frequency": "1m",
          "handler": 1,
          "name": "Error Rate Alert",
          "noDataState": "no_data",
          "notifications": []
        }
      },
      {
        "id": 4,
        "title": "Recent Error Logs",
        "type": "logs",
        "datasource": "Loki",
        "targets": [
          {
            "expr": "{app=\"$app\", namespace=~\"$namespace\"} | json | level=\"error\"",
            "refId": "A"
          }
        ],
        "options": {
          "showTime": true,
          "showLabels": false,
          "showCommonLabels": false,
          "wrapLogMessage": true,
          "dedupStrategy": "none",
          "enableLogDetails": true
        },
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 8
        }
      }
    ],
    "links": [
      {
        "title": "Explore Logs",
        "url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"expr\":\"{app=\\\"$app\\\",namespace=~\\\"$namespace\\\"}\"}]}",
        "type": "link",
        "icon": "doc"
      },
      {
        "title": "Explore Traces",
        "url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"{resource.service.name=\\\"$app\\\"}\",\"queryType\":\"traceql\"}]}",
        "type": "link",
        "icon": "gf-traces"
      }
    ]
  }
}

json

{
  "dashboard": {
    "title": "应用可观测性 - ${app}",
    "tags": ["observability", "application"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "templating": {
      "list": [
        {
          "name": "app",
          "type": "query",
          "datasource": "Mimir",
          "query": "label_values(up, app)",
          "current": {
            "selected": false,
            "text": "api",
            "value": "api"
          }
        },
        {
          "name": "namespace",
          "type": "query",
          "datasource": "Mimir",
          "query": "label_values(up{app=\"$app\"}, namespace)",
          "multi": true,
          "includeAll": true
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "请求速率",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (method, status)",
            "legendFormat": "{{method}} - {{status}}"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 0
        },
        "yaxes": [
          {
            "format": "reqps",
            "label": "请求数/秒"
          }
        ]
      },
      {
        "id": 2,
        "title": "P95延迟",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (le, endpoint))",
            "legendFormat": "{{endpoint}}"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 0
        },
        "yaxes": [
          {
            "format": "s",
            "label": "耗时"
          }
        ],
        "thresholds": [
          {
            "value": 1,
            "colorMode": "critical",
            "fill": true,
            "line": true,
            "op": "gt"
          }
        ]
      },
      {
        "id": 3,
        "title": "错误率",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval]))",
            "legendFormat": "错误率%"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 8
        },
        "yaxes": [
          {
            "format": "percentunit",
            "max": 1,
            "min": 0
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [0.01],
                "type": "gt"
              },
              "operator": {
                "type": "and"
              },
              "query": {
                "params": ["A", "5m", "now"]
              },
              "reducer": {
                "type": "avg"
              },
              "type": "query"
            }
          ],
          "frequency": "1m",
          "handler": 1,
          "name": "错误率告警",
          "noDataState": "no_data",
          "notifications": []
        }
      },
      {
        "id": 4,
        "title": "近期错误日志",
        "type": "logs",
        "datasource": "Loki",
        "targets": [
          {
            "expr": "{app=\"$app\", namespace=~\"$namespace\"} | json | level=\"error\"",
            "refId": "A"
          }
        ],
        "options": {
          "showTime": true,
          "showLabels": false,
          "showCommonLabels": false,
          "wrapLogMessage": true,
          "dedupStrategy": "none",
          "enableLogDetails": true
        },
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 8
        }
      }
    ],
    "links": [
      {
        "title": "探索日志",
        "url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"expr\":\"{app=\\\"$app\\\",namespace=~\\\"$namespace\\\"}\"}]}",
        "type": "link",
        "icon": "doc"
      },
      {
        "title": "探索追踪",
        "url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"{resource.service.name=\\\"$app\\\"}\",\"queryType\":\"traceql\"}]}",
        "type": "link",
        "icon": "gf-traces"
      }
    ]
  }
}

LGTM Stack Configuration

LGTM栈配置

Loki Configuration

Loki配置

File:

loki.yaml

yaml

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  aws:
    s3: s3://us-east-1/my-loki-bucket
    s3forcepathstyle: true
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache
    shared_store: s3

limits_config:
  retention_period: 744h # 31 days
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  max_query_series: 500
  max_query_lookback: 30d
  reject_old_samples: true
  reject_old_samples_max_age: 168h

compactor:
  working_directory: /loki/compactor
  shared_store: s3
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h

文件:

loki.yaml

yaml

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  aws:
    s3: s3://us-east-1/my-loki-bucket
    s3forcepathstyle: true
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache
    shared_store: s3

limits_config:
  retention_period: 744h # 31天
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  max_query_series: 500
  max_query_lookback: 30d
  reject_old_samples: true
  reject_old_samples_max_age: 168h

compactor:
  working_directory: /loki/compactor
  shared_store: s3
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h

Tempo Configuration

Tempo配置

File:

tempo.yaml

yaml

server:
  http_listen_port: 3200
  grpc_listen_port: 9096

distributor:
  receivers:
    otlp:
      protocols:
        http:
        grpc:
    jaeger:
      protocols:
        thrift_http:
        grpc:

ingester:
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 720h # 30 days

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com
      region: us-east-1
    wal:
      path: /var/tempo/wal

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: primary
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://mimir:9009/api/v1/push
        send_exemplars: true

文件:

tempo.yaml

yaml

server:
  http_listen_port: 3200
  grpc_listen_port: 9096

distributor:
  receivers:
    otlp:
      protocols:
        http:
        grpc:
    jaeger:
      protocols:
        thrift_http:
        grpc:

ingester:
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 720h # 30天

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com
      region: us-east-1
    wal:
      path: /var/tempo/wal

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: primary
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://mimir:9009/api/v1/push
        send_exemplars: true

Production Best Practices

生产环境最佳实践

Performance Optimization

性能优化

Query Optimization

查询优化

Use label filters before line filters
Limit time ranges for expensive queries
Use
```
unwrap
```
instead of parsing when possible
Cache query results with query frontend

先使用标签过滤，再使用行过滤
对复杂查询限制时间范围
可能时使用
```
unwrap
```
替代解析
通过查询前端缓存查询结果

Dashboard Performance

仪表板性能

Limit number of panels (< 15 per dashboard)
Use appropriate time intervals
Avoid high-cardinality grouping
Use
```
$__interval
```
for adaptive sampling

限制面板数量（单仪表板不超过15个）
使用合适的时间间隔
避免高基数分组
使用
```
$__interval
```
实现自适应采样

Storage Optimization

存储优化

Configure retention policies
Use compaction for Loki and Tempo
Implement tiered storage (hot/warm/cold)
Monitor storage growth

配置保留策略
对Loki与Tempo启用压缩
实现分层存储（热/温/冷）
监控存储增长

Security Best Practices

安全最佳实践

Authentication

认证

Enable auth (
```
auth_enabled: true
```
in Loki/Tempo)
Use OAuth/LDAP for Grafana
Implement multi-tenancy with org isolation

启用认证（Loki/Tempo中设置
```
auth_enabled: true
```
）
Grafana使用OAuth/LDAP认证
实现多租户与组织隔离

Authorization

授权

Configure RBAC in Grafana
Limit datasource access by team
Use folder permissions for dashboards

Grafana中配置RBAC
按团队限制数据源访问
对仪表板使用文件夹权限

Network Security

网络安全

TLS for all components
Network policies in Kubernetes
Rate limiting at ingress

所有组件启用TLS
Kubernetes中配置网络策略
入口处设置速率限制

Troubleshooting

故障排查

Common Issues

常见问题

High Cardinality: Too many unique label combinations
- Solution: Reduce label dimensions, use log parsing instead
Query Timeouts: Complex queries on large datasets
- Solution: Reduce time range, use aggregations, add query limits
Storage Growth: Unbounded retention
- Solution: Configure retention policies, enable compaction
Missing Traces: Incomplete trace data
- Solution: Check sampling rates, verify instrumentation

高基数：过多唯一标签组合
- 解决方案：减少标签维度，改用日志解析
查询超时：大数据集上的复杂查询
- 解决方案：缩小时间范围、使用聚合、添加查询限制
存储增长过快：无限制保留
- 解决方案：配置保留策略、启用压缩
追踪缺失：追踪数据不完整
- 解决方案：检查采样率、验证埋点