grafana

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Grafana and LGTM Stack Skill

Grafana与LGTM栈技能

Overview

概述

The LGTM stack provides a complete observability solution with comprehensive visualization and dashboard capabilities:
  • Loki: Log aggregation and querying (LogQL)
  • Grafana: Visualization, dashboarding, alerting, and exploration
  • Tempo: Distributed tracing (TraceQL)
  • Mimir: Long-term metrics storage (Prometheus-compatible)
This skill covers setup, configuration, dashboard creation, panel design, querying, alerting, templating, and production observability best practices.
LGTM栈提供了一套完整的可观测性解决方案,具备全面的可视化与仪表板能力:
  • Loki:日志聚合与查询(LogQL)
  • Grafana:可视化、仪表板搭建、告警与探索功能
  • Tempo:分布式追踪(TraceQL)
  • Mimir:长期指标存储(兼容Prometheus)
本技能涵盖安装配置、仪表板创建、面板设计、查询编写、告警设置、模板化以及生产环境可观测性最佳实践。

When to Use This Skill

适用场景

Primary Use Cases

核心使用场景

  • Creating or modifying Grafana dashboards
  • Designing panels and visualizations (graphs, stats, tables, heatmaps, etc.)
  • Writing queries (PromQL, LogQL, TraceQL)
  • Configuring data sources (Prometheus, Loki, Tempo, Mimir)
  • Setting up alerting rules and notification policies
  • Implementing dashboard variables and templates
  • Dashboard provisioning and GitOps workflows
  • Troubleshooting observability queries
  • Analyzing application performance, errors, or system behavior
  • 创建或修改Grafana仪表板
  • 设计面板与可视化效果(图表、统计卡片、表格、热力图等)
  • 编写查询语句(PromQL、LogQL、TraceQL)
  • 配置数据源(Prometheus、Loki、Tempo、Mimir)
  • 设置告警规则与通知策略
  • 实现仪表板变量与模板
  • 仪表板自动化部署与GitOps工作流
  • 排查可观测性查询问题
  • 分析应用性能、错误或系统行为

Who Uses This Skill

适用人群

  • senior-software-engineer (PRIMARY): Production observability setup, LGTM stack deployment, dashboard architecture (use infrastructure skills for deployment)
  • software-engineer: Application dashboards, service metrics visualization
  • 资深软件工程师(核心用户):生产环境可观测性搭建、LGTM栈部署、仪表板架构设计(部署相关请使用基础设施技能)
  • 软件工程师:应用仪表板搭建、服务指标可视化

LGTM Stack Components

LGTM栈组件

Loki - Log Aggregation

Loki - 日志聚合

Architecture - Loki

Loki架构

Horizontally scalable log aggregation inspired by Prometheus
  • Indexes only metadata (labels), not log content
  • Cost-effective storage with object stores (S3, GCS, etc.)
  • LogQL query language similar to PromQL
受Prometheus启发的水平可扩展日志聚合系统
  • 仅索引元数据(标签),不索引日志内容
  • 结合对象存储(S3、GCS等)实现低成本存储
  • 类PromQL的查询语言LogQL

Key Concepts - Loki

Loki核心概念

  • Labels for indexing (low cardinality)
  • Log streams identified by unique label sets
  • Parsers: logfmt, JSON, regex, pattern
  • Line filters and label filters
  • 用于索引的低基数标签
  • 由唯一标签集标识的日志流
  • 解析器:logfmt、JSON、正则表达式、模式匹配
  • 行过滤与标签过滤

Grafana - Visualization

Grafana - 可视化

Features

功能特性

  • Multi-datasource dashboarding
  • Panel types: Graph, Stat, Table, Heatmap, Bar Chart, Pie Chart, Gauge, Logs, Traces, Time Series
  • Templating and variables for dynamic dashboards
  • Alerting (unified alerting with contact points and notification policies)
  • Dashboard provisioning and GitOps integration
  • Role-based access control (RBAC)
  • Explore mode for ad-hoc queries
  • Annotations for event markers
  • Dashboard folders and organization
  • 多数据源仪表板支持
  • 面板类型:时间序列图、统计卡片、表格、热力图、柱状图、饼图、仪表盘、日志面板、追踪面板、时序图
  • 模板化与变量实现动态仪表板
  • 统一告警(含联系点与通知策略)
  • 仪表板自动化部署与GitOps集成
  • 基于角色的访问控制(RBAC)
  • 探索模式支持临时查询
  • 用于事件标记的注解功能
  • 仪表板文件夹与组织管理

Tempo - Distributed Tracing

Tempo - 分布式追踪

Architecture - Tempo

Tempo架构

Scalable distributed tracing backend
  • Cost-effective trace storage
  • TraceQL for trace querying
  • Integration with logs and metrics (trace-to-logs, trace-to-metrics)
  • OpenTelemetry compatible
水平可扩展的分布式追踪后端
  • 低成本的追踪存储
  • 用于追踪查询的TraceQL
  • 与日志和指标的集成(追踪转日志、追踪转指标)
  • 兼容OpenTelemetry

Mimir - Metrics Storage

Mimir - 指标存储

Architecture - Mimir

Mimir架构

Horizontally scalable long-term Prometheus storage
  • Multi-tenancy support
  • Query federation
  • High availability
  • Prometheus remote_write compatible
水平可扩展的长期Prometheus指标存储
  • 多租户支持
  • 查询联邦
  • 高可用性
  • 兼容Prometheus remote_write

Dashboard Design and Best Practices

仪表板设计与最佳实践

Dashboard Organization Principles

仪表板组织原则

  1. Hierarchy: Overview -> Service -> Component -> Deep Dive
  2. Golden Signals: Latency, Traffic, Errors, Saturation (RED/USE method)
  3. Variable-driven: Use templates for flexibility across environments
  4. Consistent Layouts: Grid alignment (24-column grid), logical top-to-bottom flow
  5. Performance: Limit queries, use query caching, appropriate time intervals
  1. 层级结构:概览 -> 服务 -> 组件 -> 深度排查
  2. 黄金信号:延迟、流量、错误、饱和度(RED/USE方法)
  3. 变量驱动:使用模板实现跨环境的灵活性
  4. 布局一致:网格对齐(24列网格)、逻辑从上到下的流向
  5. 性能优化:限制查询数量、使用查询缓存、选择合适的时间间隔

Panel Types and When to Use Them

面板类型与适用场景

Panel TypeUse CaseBest For
Time Series / GraphTrends over timeRequest rates, latency, resource usage
StatSingle metric valueError rates, current values, percentage
GaugeProgress toward limitCPU usage, memory, disk space
Bar GaugeComparative valuesTop N items, distribution
TableStructured dataService lists, error details, resource inventory
Pie ChartProportionsTraffic distribution, error breakdown
HeatmapDistribution over timeLatency percentiles, request patterns
LogsLog streamsError investigation, debugging
TracesDistributed tracingPerformance analysis, dependency mapping
面板类型适用场景最佳用途
时间序列图/折线图趋势变化分析请求速率、延迟、资源使用率
统计卡片单指标展示错误率、当前值、百分比
仪表盘接近阈值的进度展示CPU使用率、内存、磁盘空间
条形仪表盘多值对比Top N项、分布情况
表格结构化数据展示服务列表、错误详情、资源清单
饼图占比分析流量分布、错误细分
热力图时间维度的分布分析延迟分位数、请求模式
日志面板日志流查看错误排查、调试
追踪面板分布式追踪分析性能分析、依赖映射

Panel Configuration Best Practices

面板配置最佳实践

Titles and Descriptions

标题与描述

  • Clear, descriptive titles: Include units and metric context
  • Tooltips: Add description fields for panel documentation
  • Examples:
    • Good: "P95 Latency (seconds) by Endpoint"
    • Bad: "Latency"
  • 清晰、具描述性的标题:包含单位与指标上下文
  • 工具提示:添加描述字段作为面板文档
  • 示例:
    • 优秀:"按端点统计的P95延迟(秒)"
    • 不佳:"延迟"

Legends and Labels

图例与标签

  • Show legends only when needed (multiple series)
  • Use
    {{label}}
    format for dynamic legend names
  • Place legends appropriately (bottom, right, or hidden)
  • Sort by value when showing Top N
  • 仅在多序列时显示图例
  • 使用
    {{label}}
    格式动态命名图例
  • 合理放置图例(底部、右侧或隐藏)
  • 展示Top N时按数值排序

Axes and Units

坐标轴与单位

  • Always label axes with units
  • Use appropriate unit formats (seconds, bytes, percent, requests/sec)
  • Set reasonable min/max ranges to avoid misleading scales
  • Use logarithmic scales for wide value ranges
  • 始终为坐标轴添加单位标签
  • 使用合适的单位格式(秒、字节、百分比、请求/秒)
  • 设置合理的最小/最大值范围,避免误导性刻度
  • 数值范围较大时使用对数刻度

Thresholds and Colors

阈值与颜色

  • Use thresholds for visual cues (green/yellow/red)
  • Standard threshold pattern:
    • Green: Normal operation
    • Yellow: Warning (action may be needed)
    • Red: Critical (immediate attention required)
  • Examples:
    • Error rate: 0% (green), 1% (yellow), 5% (red)
    • P95 latency: <1s (green), 1-3s (yellow), >3s (red)
  • 使用阈值提供视觉提示(绿/黄/红)
  • 标准阈值模式:
    • 绿色:正常运行
    • 黄色:警告(可能需要采取行动)
    • 红色:严重(需立即处理)
  • 示例:
    • 错误率:0%(绿)、1%(黄)、5%(红)
    • P95延迟:<1秒(绿)、1-3秒(黄)、>3秒(红)

Links and Drilldowns

链接与钻取

  • Link panels to related dashboards
  • Use data links for context (logs, traces, related services)
  • Create drill-down paths: Overview -> Service -> Component -> Details
  • Link to runbooks for alert panels
  • 将面板链接到相关仪表板
  • 使用数据链接提供上下文(日志、追踪、关联服务)
  • 创建钻取路径:概览 -> 服务 -> 组件 -> 详情
  • 为告警面板链接到运行手册

Dashboard Variables and Templating

仪表板变量与模板

Dashboard variables enable reusable, dynamic dashboards that work across environments, services, and time ranges.
仪表板变量可实现可复用的动态仪表板,适配不同环境、服务与时间范围。

Variable Types

变量类型

TypePurposeExample
QueryPopulate from data sourceNamespaces, services, pods
CustomStatic list of optionsEnvironments (prod/staging/dev)
IntervalTime interval selectionAuto-adjusted query intervals
DatasourceSwitch between data sourcesMultiple Prometheus instances
ConstantHidden values for queriesCluster name, region
Text boxFree-form inputCustom filters
类型用途示例
查询型从数据源获取选项命名空间、服务、Pod
自定义型静态选项列表环境(生产/预发布/开发)
时间间隔型时间间隔选择自动调整的查询间隔
数据源型切换不同数据源多Prometheus实例
常量型查询中使用的隐藏值集群名称、区域
文本框型自由输入内容自定义过滤器

Common Variable Patterns

常见变量配置示例

json
{
  "templating": {
    "list": [
      {
        "name": "datasource",
        "type": "datasource",
        "query": "prometheus",
        "description": "Select Prometheus data source"
      },
      {
        "name": "namespace",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(kube_pod_info, namespace)",
        "multi": true,
        "includeAll": true,
        "description": "Kubernetes namespace filter"
      },
      {
        "name": "app",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, app)",
        "multi": true,
        "includeAll": true,
        "description": "Application filter (depends on namespace)"
      },
      {
        "name": "interval",
        "type": "interval",
        "auto": true,
        "auto_count": 30,
        "auto_min": "10s",
        "options": ["1m", "5m", "15m", "30m", "1h", "6h", "12h", "1d"],
        "description": "Query resolution interval"
      },
      {
        "name": "environment",
        "type": "custom",
        "options": [
          { "text": "Production", "value": "prod" },
          { "text": "Staging", "value": "staging" },
          { "text": "Development", "value": "dev" }
        ],
        "current": { "text": "Production", "value": "prod" }
      }
    ]
  }
}
json
{
  "templating": {
    "list": [
      {
        "name": "datasource",
        "type": "datasource",
        "query": "prometheus",
        "description": "选择Prometheus数据源"
      },
      {
        "name": "namespace",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(kube_pod_info, namespace)",
        "multi": true,
        "includeAll": true,
        "description": "Kubernetes命名空间过滤器"
      },
      {
        "name": "app",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, app)",
        "multi": true,
        "includeAll": true,
        "description": "应用过滤器(依赖命名空间)"
      },
      {
        "name": "interval",
        "type": "interval",
        "auto": true,
        "auto_count": 30,
        "auto_min": "10s",
        "options": ["1m", "5m", "15m", "30m", "1h", "6h", "12h", "1d"],
        "description": "查询精度间隔"
      },
      {
        "name": "environment",
        "type": "custom",
        "options": [
          { "text": "Production", "value": "prod" },
          { "text": "Staging", "value": "staging" },
          { "text": "Development", "value": "dev" }
        ],
        "current": { "text": "Production", "value": "prod" }
      }
    ]
  }
}

Variable Usage in Queries

查询中的变量引用

Variables are referenced with
$variable_name
or
${variable_name}
syntax:
promql
undefined
使用
$variable_name
${variable_name}
语法引用变量:
promql
undefined

Simple variable reference

简单变量引用

rate(http_requests_total{namespace="$namespace"}[5m])
rate(http_requests_total{namespace="$namespace"}[5m])

Multi-select with regex match

多选正则匹配

rate(http_requests_total{namespace=~"$namespace"}[5m])
rate(http_requests_total{namespace=~"$namespace"}[5m])

Variable in legend

图例中的变量

rate(http_requests_total{app="$app"}[5m]) by (method)
rate(http_requests_total{app="$app"}[5m]) by (method)

Legend format: "{{method}}"

图例格式: "{{method}}"

Using interval variable for adaptive queries

使用间隔变量实现自适应查询

rate(http_requests_total[$__interval])
rate(http_requests_total[$__interval])

Chained variables (app depends on namespace)

链式变量(应用依赖命名空间)

rate(http_requests_total{namespace="$namespace", app="$app"}[5m])
undefined
rate(http_requests_total{namespace="$namespace", app="$app"}[5m])
undefined

Advanced Variable Techniques

高级变量技巧

Regex filtering:
json
{
  "name": "pod",
  "type": "query",
  "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
  "regex": "/^$app-.*/",
  "description": "Filter pods by app prefix"
}
All option with custom value:
json
{
  "name": "status",
  "type": "custom",
  "options": ["200", "404", "500"],
  "includeAll": true,
  "allValue": ".*",
  "description": "HTTP status code filter"
}
Dependent variables (variable chain):
  1. $datasource
    (datasource type)
  2. $cluster
    (query: depends on datasource)
  3. $namespace
    (query: depends on cluster)
  4. $app
    (query: depends on namespace)
  5. $pod
    (query: depends on app)
正则过滤:
json
{
  "name": "pod",
  "type": "query",
  "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
  "regex": "/^$app-.*/",
  "description": "按应用前缀过滤Pod"
}
含自定义值的全选选项:
json
{
  "name": "status",
  "type": "custom",
  "options": ["200", "404", "500"],
  "includeAll": true,
  "allValue": ".*",
  "description": "HTTP状态码过滤器"
}
依赖变量(变量链):
  1. $datasource
    (数据源类型)
  2. $cluster
    (查询:依赖数据源)
  3. $namespace
    (查询:依赖集群)
  4. $app
    (查询:依赖命名空间)
  5. $pod
    (查询:依赖应用)

Annotations

注解

Annotations display events as vertical markers on time series panels:
json
{
  "annotations": {
    "list": [
      {
        "name": "Deployments",
        "datasource": "Prometheus",
        "expr": "changes(kube_deployment_spec_replicas{namespace=\"$namespace\"}[5m])",
        "tagKeys": "deployment,namespace",
        "textFormat": "Deployment: {{deployment}}",
        "iconColor": "blue"
      },
      {
        "name": "Alerts",
        "datasource": "Loki",
        "expr": "{app=\"alertmanager\"} | json | alertname!=\"\"",
        "textFormat": "Alert: {{alertname}}",
        "iconColor": "red"
      }
    ]
  }
}
注解在时间序列面板上以垂直标记展示事件:
json
{
  "annotations": {
    "list": [
      {
        "name": "部署事件",
        "datasource": "Prometheus",
        "expr": "changes(kube_deployment_spec_replicas{namespace=\"$namespace\"}[5m])",
        "tagKeys": "deployment,namespace",
        "textFormat": "部署: {{deployment}}",
        "iconColor": "blue"
      },
      {
        "name": "告警事件",
        "datasource": "Loki",
        "expr": "{app=\"alertmanager\"} | json | alertname!=\"\"",
        "textFormat": "告警: {{alertname}}",
        "iconColor": "red"
      }
    ]
  }
}

Dashboard Performance Optimization

仪表板性能优化

Query Optimization

查询优化

  • Limit number of panels (< 15 per dashboard)
  • Use appropriate time ranges (avoid queries over months)
  • Leverage
    $__interval
    for adaptive sampling
  • Avoid high-cardinality grouping (too many series)
  • Use query caching when available
  • 限制面板数量(单仪表板不超过15个)
  • 使用合适的时间范围(避免跨月查询)
  • 利用
    $__interval
    实现自适应采样
  • 避免高基数分组(过多序列)
  • 启用查询缓存(如支持)

Panel Performance

面板性能

  • Set max data points to reasonable values
  • Use instant queries for current-state panels
  • Combine related metrics into single queries when possible
  • Disable auto-refresh on heavy dashboards
  • 设置合理的最大数据点数量
  • 对当前状态面板使用即时查询
  • 可能时将相关指标合并为单个查询
  • 对重型仪表板禁用自动刷新

Dashboard as Code and Provisioning

即代码化仪表板与自动化部署

Dashboard Provisioning

仪表板自动化部署

Dashboard provisioning enables GitOps workflows and version-controlled dashboard definitions.
仪表板自动化部署支持GitOps工作流与版本控制的仪表板定义。

Provisioning Provider Configuration

部署提供者配置

File:
/etc/grafana/provisioning/dashboards/dashboards.yaml
yaml
apiVersion: 1

providers:
  - name: "default"
    orgId: 1
    folder: ""
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/provisioning/dashboards

  - name: "application"
    orgId: 1
    folder: "Applications"
    type: file
    disableDeletion: true
    editable: false
    options:
      path: /var/lib/grafana/dashboards/application

  - name: "infrastructure"
    orgId: 1
    folder: "Infrastructure"
    type: file
    options:
      path: /var/lib/grafana/dashboards/infrastructure
文件:
/etc/grafana/provisioning/dashboards/dashboards.yaml
yaml
apiVersion: 1

providers:
  - name: "default"
    orgId: 1
    folder: ""
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/provisioning/dashboards

  - name: "application"
    orgId: 1
    folder: "Applications"
    type: file
    disableDeletion: true
    editable: false
    options:
      path: /var/lib/grafana/dashboards/application

  - name: "infrastructure"
    orgId: 1
    folder: "Infrastructure"
    type: file
    options:
      path: /var/lib/grafana/dashboards/infrastructure

Dashboard JSON Structure

仪表板JSON结构

Complete dashboard JSON with metadata and provisioning:
json
{
  "dashboard": {
    "title": "Application Observability - ${app}",
    "uid": "app-observability",
    "tags": ["observability", "application"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s",
    "templating": { "list": [] },
    "panels": [],
    "links": []
  },
  "overwrite": true,
  "folderId": null,
  "folderUid": null
}
包含元数据与部署配置的完整仪表板JSON:
json
{
  "dashboard": {
    "title": "应用可观测性 - ${app}",
    "uid": "app-observability",
    "tags": ["observability", "application"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s",
    "templating": { "list": [] },
    "panels": [],
    "links": []
  },
  "overwrite": true,
  "folderId": null,
  "folderUid": null
}

Kubernetes ConfigMap Provisioning

Kubernetes ConfigMap部署

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  application-dashboard.json: |
    {
      "dashboard": {
        "title": "Application Metrics",
        "uid": "app-metrics",
        "tags": ["application"],
        "panels": []
      }
    }
yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  application-dashboard.json: |
    {
      "dashboard": {
        "title": "应用指标",
        "uid": "app-metrics",
        "tags": ["application"],
        "panels": []
      }
    }

Grafana Operator (CRD)

Grafana Operator(CRD)

yaml
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: application-observability
  namespace: monitoring
spec:
  instanceSelector:
    matchLabels:
      dashboards: "grafana"
  json: |
    {
      "dashboard": {
        "title": "Application Observability",
        "panels": []
      }
    }
yaml
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: application-observability
  namespace: monitoring
spec:
  instanceSelector:
    matchLabels:
      dashboards: "grafana"
  json: |
    {
      "dashboard": {
        "title": "应用可观测性",
        "panels": []
      }
    }

Data Source Provisioning

数据源自动化部署

Loki Data Source

Loki数据源

File:
/etc/grafana/provisioning/datasources/loki.yaml
yaml
apiVersion: 1

datasources:
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      maxLines: 1000
      derivedFields:
        - datasourceUid: tempo_uid
          matcherRegex: "trace_id=(\\w+)"
          name: TraceID
          url: "$${__value.raw}"
    editable: false
文件:
/etc/grafana/provisioning/datasources/loki.yaml
yaml
apiVersion: 1

datasources:
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      maxLines: 1000
      derivedFields:
        - datasourceUid: tempo_uid
          matcherRegex: "trace_id=(\\w+)"
          name: TraceID
          url: "$${__value.raw}"
    editable: false

Tempo Data Source

Tempo数据源

File:
/etc/grafana/provisioning/datasources/tempo.yaml
yaml
apiVersion: 1

datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    uid: tempo_uid
    jsonData:
      httpMethod: GET
      tracesToLogs:
        datasourceUid: loki_uid
        tags: ["job", "instance", "pod", "namespace"]
        mappedTags: [{ key: "service.name", value: "service" }]
        spanStartTimeShift: "1h"
        spanEndTimeShift: "1h"
      tracesToMetrics:
        datasourceUid: prometheus_uid
        tags: [{ key: "service.name", value: "service" }]
      serviceMap:
        datasourceUid: prometheus_uid
      nodeGraph:
        enabled: true
    editable: false
文件:
/etc/grafana/provisioning/datasources/tempo.yaml
yaml
apiVersion: 1

datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    uid: tempo_uid
    jsonData:
      httpMethod: GET
      tracesToLogs:
        datasourceUid: loki_uid
        tags: ["job", "instance", "pod", "namespace"]
        mappedTags: [{ key: "service.name", value: "service" }]
        spanStartTimeShift: "1h"
        spanEndTimeShift: "1h"
      tracesToMetrics:
        datasourceUid: prometheus_uid
        tags: [{ key: "service.name", value: "service" }]
      serviceMap:
        datasourceUid: prometheus_uid
      nodeGraph:
        enabled: true
    editable: false

Mimir/Prometheus Data Source

Mimir/Prometheus数据源

File:
/etc/grafana/provisioning/datasources/mimir.yaml
yaml
apiVersion: 1

datasources:
  - name: Mimir
    type: prometheus
    access: proxy
    url: http://mimir:8080/prometheus
    uid: prometheus_uid
    jsonData:
      httpMethod: POST
      exemplarTraceIdDestinations:
        - datasourceUid: tempo_uid
          name: trace_id
      prometheusType: Mimir
      prometheusVersion: 2.40.0
      cacheLevel: "High"
      incrementalQuerying: true
      incrementalQueryOverlapWindow: 10m
    editable: false
文件:
/etc/grafana/provisioning/datasources/mimir.yaml
yaml
apiVersion: 1

datasources:
  - name: Mimir
    type: prometheus
    access: proxy
    url: http://mimir:8080/prometheus
    uid: prometheus_uid
    jsonData:
      httpMethod: POST
      exemplarTraceIdDestinations:
        - datasourceUid: tempo_uid
          name: trace_id
      prometheusType: Mimir
      prometheusVersion: 2.40.0
      cacheLevel: "High"
      incrementalQuerying: true
      incrementalQueryOverlapWindow: 10m
    editable: false

Alerting

告警

Alert Rule Configuration

告警规则配置

Grafana unified alerting supports multi-datasource alerts with flexible evaluation and routing.
Grafana统一告警支持多数据源告警,具备灵活的评估与路由能力。

Prometheus/Mimir Alert Rule

Prometheus/Mimir告警规则

File:
/etc/grafana/provisioning/alerting/rules.yaml
yaml
apiVersion: 1

groups:
  - name: application_alerts
    interval: 1m
    rules:
      - uid: error_rate_high
        title: High Error Rate
        condition: A
        data:
          - refId: A
            queryType: ""
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: prometheus_uid
            model:
              expr: |
                sum(rate(http_requests_total{status=~"5.."}[5m]))
                /
                sum(rate(http_requests_total[5m]))
                > 0.05
              intervalMs: 1000
              maxDataPoints: 43200
        noDataState: NoData
        execErrState: Error
        for: 5m
        annotations:
          description: 'Error rate is {{ printf "%.2f" $values.A.Value }}% (threshold: 5%)'
          summary: Application error rate is above threshold
          runbook_url: https://wiki.company.com/runbooks/high-error-rate
        labels:
          severity: critical
          team: platform
        isPaused: false

      - uid: high_latency
        title: High P95 Latency
        condition: A
        data:
          - refId: A
            datasourceUid: prometheus_uid
            model:
              expr: |
                histogram_quantile(0.95,
                  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
                ) > 2
        for: 10m
        annotations:
          description: "P95 latency is {{ $values.A.Value }}s on endpoint {{ $labels.endpoint }}"
          runbook_url: https://wiki.company.com/runbooks/high-latency
        labels:
          severity: warning
文件:
/etc/grafana/provisioning/alerting/rules.yaml
yaml
apiVersion: 1

groups:
  - name: application_alerts
    interval: 1m
    rules:
      - uid: error_rate_high
        title: 错误率过高
        condition: A
        data:
          - refId: A
            queryType: ""
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: prometheus_uid
            model:
              expr: |
                sum(rate(http_requests_total{status=~"5.."}[5m]))
                /
                sum(rate(http_requests_total[5m]))
                > 0.05
              intervalMs: 1000
              maxDataPoints: 43200
        noDataState: NoData
        execErrState: Error
        for: 5m
        annotations:
          description: '错误率为{{ printf "%.2f" $values.A.Value }}%(阈值:5%)'
          summary: 应用错误率超过阈值
          runbook_url: https://wiki.company.com/runbooks/high-error-rate
        labels:
          severity: critical
          team: platform
        isPaused: false

      - uid: high_latency
        title: P95延迟过高
        condition: A
        data:
          - refId: A
            datasourceUid: prometheus_uid
            model:
              expr: |
                histogram_quantile(0.95,
                  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
                ) > 2
        for: 10m
        annotations:
          description: "端点{{ $labels.endpoint }}的P95延迟为{{ $values.A.Value }}秒"
          runbook_url: https://wiki.company.com/runbooks/high-latency
        labels:
          severity: warning

Loki Alert Rule

Loki告警规则

yaml
apiVersion: 1

groups:
  - name: log_based_alerts
    interval: 1m
    rules:
      - uid: error_spike
        title: Error Log Spike
        condition: A
        data:
          - refId: A
            queryType: ""
            datasourceUid: loki_uid
            model:
              expr: |
                sum(rate({app="api"} | json | level="error" [5m]))
                > 10
        for: 2m
        annotations:
          description: "Error log rate is {{ $values.A.Value }} logs/sec"
          summary: Spike in error logs detected
        labels:
          severity: warning

      - uid: critical_error_pattern
        title: Critical Error Pattern Detected
        condition: A
        data:
          - refId: A
            datasourceUid: loki_uid
            model:
              expr: |
                sum(count_over_time({app="api"}
                  |~ "OutOfMemoryError|StackOverflowError|FatalException" [5m]
                )) > 0
        for: 0m
        annotations:
          description: "Critical error pattern found in logs"
        labels:
          severity: critical
          page: true
yaml
apiVersion: 1

groups:
  - name: log_based_alerts
    interval: 1m
    rules:
      - uid: error_spike
        title: 错误日志突增
        condition: A
        data:
          - refId: A
            queryType: ""
            datasourceUid: loki_uid
            model:
              expr: |
                sum(rate({app="api"} | json | level="error" [5m]))
                > 10
        for: 2m
        annotations:
          description: "错误日志速率为{{ $values.A.Value }}条/秒"
          summary: 检测到错误日志突增
        labels:
          severity: warning

      - uid: critical_error_pattern
        title: 检测到严重错误模式
        condition: A
        data:
          - refId: A
            datasourceUid: loki_uid
            model:
              expr: |
                sum(count_over_time({app="api"}
                  |~ "OutOfMemoryError|StackOverflowError|FatalException" [5m]
                )) > 0
        for: 0m
        annotations:
          description: "日志中发现严重错误模式"
        labels:
          severity: critical
          page: true

Contact Points and Notification Policies

联系点与通知策略

File:
/etc/grafana/provisioning/alerting/contactpoints.yaml
yaml
apiVersion: 1

contactPoints:
  - orgId: 1
    name: slack-critical
    receivers:
      - uid: slack_critical
        type: slack
        settings:
          url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
          title: "{{ .GroupLabels.alertname }}"
          text: |
            {{ range .Alerts }}
            *Alert:* {{ .Labels.alertname }}
            *Summary:* {{ .Annotations.summary }}
            *Description:* {{ .Annotations.description }}
            *Severity:* {{ .Labels.severity }}
            {{ end }}
        disableResolveMessage: false

  - orgId: 1
    name: pagerduty-oncall
    receivers:
      - uid: pagerduty_oncall
        type: pagerduty
        settings:
          integrationKey: YOUR_INTEGRATION_KEY
          severity: critical
          class: infrastructure

  - orgId: 1
    name: email-team
    receivers:
      - uid: email_team
        type: email
        settings:
          addresses: team@company.com
          singleEmail: true

notificationPolicies:
  - orgId: 1
    receiver: slack-critical
    group_by: ["alertname", "namespace"]
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    routes:
      - receiver: pagerduty-oncall
        matchers:
          - severity = critical
          - page = true
        group_wait: 10s
        repeat_interval: 1h
        continue: true

      - receiver: email-team
        matchers:
          - severity = warning
          - team = platform
        group_interval: 10m
        repeat_interval: 12h
文件:
/etc/grafana/provisioning/alerting/contactpoints.yaml
yaml
apiVersion: 1

contactPoints:
  - orgId: 1
    name: slack-critical
    receivers:
      - uid: slack_critical
        type: slack
        settings:
          url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
          title: "{{ .GroupLabels.alertname }}"
          text: |
            {{ range .Alerts }}
            *告警:* {{ .Labels.alertname }}
            *摘要:* {{ .Annotations.summary }}
            *描述:* {{ .Annotations.description }}
            *级别:* {{ .Labels.severity }}
            {{ end }}
        disableResolveMessage: false

  - orgId: 1
    name: pagerduty-oncall
    receivers:
      - uid: pagerduty_oncall
        type: pagerduty
        settings:
          integrationKey: YOUR_INTEGRATION_KEY
          severity: critical
          class: infrastructure

  - orgId: 1
    name: email-team
    receivers:
      - uid: email_team
        type: email
        settings:
          addresses: team@company.com
          singleEmail: true

notificationPolicies:
  - orgId: 1
    receiver: slack-critical
    group_by: ["alertname", "namespace"]
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    routes:
      - receiver: pagerduty-oncall
        matchers:
          - severity = critical
          - page = true
        group_wait: 10s
        repeat_interval: 1h
        continue: true

      - receiver: email-team
        matchers:
          - severity = warning
          - team = platform
        group_interval: 10m
        repeat_interval: 12h

LogQL Query Patterns

LogQL查询模式

Basic Log Queries

基础日志查询

Stream Selection

流选择

logql
undefined
logql
undefined

Simple label matching

简单标签匹配

{namespace="production", app="api"}
{namespace="production", app="api"}

Regex matching

正则匹配

{app=~"api|web|worker"}
{app=~"api|web|worker"}

Not equal

不等于

{env!="staging"}
{env!="staging"}

Multiple conditions

多条件

{namespace="production", app="api", level!="debug"}
undefined
{namespace="production", app="api", level!="debug"}
undefined

Line Filters

行过滤

logql
undefined
logql
undefined

Contains

包含指定内容

{app="api"} |= "error"
{app="api"} |= "error"

Does not contain

不包含指定内容

{app="api"} != "debug"
{app="api"} != "debug"

Regex match

正则匹配

{app="api"} |~ "error|exception|fatal"
{app="api"} |~ "error|exception|fatal"

Case insensitive

不区分大小写

{app="api"} |~ "(?i)error"
{app="api"} |~ "(?i)error"

Chaining filters

链式过滤

{app="api"} |= "error" != "timeout"
undefined
{app="api"} |= "error" != "timeout"
undefined

Parsing and Extraction

解析与提取

JSON Parsing

JSON解析

logql
undefined
logql
undefined

Parse JSON logs

解析JSON日志

{app="api"} | json
{app="api"} | json

Extract specific fields

提取特定字段

{app="api"} | json message="msg", level="severity"
{app="api"} | json message="msg", level="severity"

Filter on extracted field

过滤提取的字段

{app="api"} | json | level="error"
{app="api"} | json | level="error"

Nested JSON

嵌套JSON

{app="api"} | json | line_format "{{.response.status}}"
undefined
{app="api"} | json | line_format "{{.response.status}}"
undefined

Logfmt Parsing

Logfmt解析

logql
undefined
logql
undefined

Parse logfmt (key=value)

解析logfmt格式(key=value)

{app="api"} | logfmt
{app="api"} | logfmt

Extract specific fields

提取特定字段

{app="api"} | logfmt level, caller, msg
{app="api"} | logfmt level, caller, msg

Filter parsed fields

过滤解析后的字段

{app="api"} | logfmt | level="error"
undefined
{app="api"} | logfmt | level="error"
undefined

Pattern Parsing

模式匹配解析

logql
undefined
logql
undefined

Extract with pattern

使用模式提取

{app="nginx"} | pattern
<ip> - - <_> "<method> <uri> <_>" <status> <_>
{app="nginx"} | pattern
<ip> - - <_> "<method> <uri> <_>" <status> <_>

Filter on extracted values

过滤提取的值

{app="nginx"} | pattern
<_> <status> <_>
| status >= 400
{app="nginx"} | pattern
<_> <status> <_>
| status >= 400

Complex pattern

复杂模式

{app="api"} | pattern
level=<level> msg="<msg>" duration=<duration>ms
undefined
{app="api"} | pattern
level=<level> msg="<msg>" duration=<duration>ms
undefined

Aggregations and Metrics

聚合与指标

Count Queries

计数查询

logql
undefined
logql
undefined

Count log lines over time

统计指定时间范围内的日志行数

count_over_time({app="api"}[5m])
count_over_time({app="api"}[5m])

Rate of logs

日志速率

rate({app="api"}[5m])
rate({app="api"}[5m])

Errors per second

每秒错误数

sum(rate({app="api"} |= "error" [5m])) by (namespace)
sum(rate({app="api"} |= "error" [5m])) by (namespace)

Error ratio

错误占比

sum(rate({app="api"} |= "error" [5m])) / sum(rate({app="api"}[5m]))
undefined
sum(rate({app="api"} |= "error" [5m])) / sum(rate({app="api"}[5m]))
undefined

Extracted Metrics

提取指标

logql
undefined
logql
undefined

Average duration

平均耗时

avg_over_time({app="api"} | logfmt | unwrap duration [5m]) by (endpoint)
avg_over_time({app="api"} | logfmt | unwrap duration [5m]) by (endpoint)

P95 latency

P95延迟

quantile_over_time(0.95, {app="api"} | regexp
duration=(?P<duration>[0-9.]+)ms
| unwrap duration [5m]) by (method)
quantile_over_time(0.95, {app="api"} | regexp
duration=(?P<duration>[0-9.]+)ms
| unwrap duration [5m]) by (method)

Top 10 error messages

Top 10错误消息

topk(10, sum by (msg) ( count_over_time({app="api"} | json | level="error" [1h] ) ) )
undefined
topk(10, sum by (msg) ( count_over_time({app="api"} | json | level="error" [1h] ) ) )
undefined

TraceQL Query Patterns

TraceQL查询模式

Basic Trace Queries

基础追踪查询

traceql
undefined
traceql
undefined

Find traces by service

按服务查找追踪

{ .service.name = "api" }
{ .service.name = "api" }

HTTP status codes

HTTP状态码

{ .http.status_code = 500 }
{ .http.status_code = 500 }

Combine conditions

多条件组合

{ .service.name = "api" && .http.status_code >= 400 }
{ .service.name = "api" && .http.status_code >= 400 }

Duration filter

耗时过滤

{ duration > 1s }
undefined
{ duration > 1s }
undefined

Advanced TraceQL

高级TraceQL

traceql
undefined
traceql
undefined

Parent-child relationship

父子关系

{ .service.name = "frontend" }
{ .service.name = "backend" && .http.status_code = 500 }
{ .service.name = "frontend" }
{ .service.name = "backend" && .http.status_code = 500 }

Descendant spans

后代跨度

{ .service.name = "api" }
  • { .db.system = "postgresql" && duration > 1s }
{ .service.name = "api" }
  • { .db.system = "postgresql" && duration > 1s }

Failed database queries

失败的数据库查询

{ .service.name = "api" }
{ .db.system = "postgresql" && status = "error" }
undefined
{ .service.name = "api" }
{ .db.system = "postgresql" && status = "error" }
undefined

Complete Dashboard Examples

完整仪表板示例

Application Observability Dashboard

应用可观测性仪表板

json
{
  "dashboard": {
    "title": "Application Observability - ${app}",
    "tags": ["observability", "application"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "templating": {
      "list": [
        {
          "name": "app",
          "type": "query",
          "datasource": "Mimir",
          "query": "label_values(up, app)",
          "current": {
            "selected": false,
            "text": "api",
            "value": "api"
          }
        },
        {
          "name": "namespace",
          "type": "query",
          "datasource": "Mimir",
          "query": "label_values(up{app=\"$app\"}, namespace)",
          "multi": true,
          "includeAll": true
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (method, status)",
            "legendFormat": "{{method}} - {{status}}"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 0
        },
        "yaxes": [
          {
            "format": "reqps",
            "label": "Requests/sec"
          }
        ]
      },
      {
        "id": 2,
        "title": "P95 Latency",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (le, endpoint))",
            "legendFormat": "{{endpoint}}"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 0
        },
        "yaxes": [
          {
            "format": "s",
            "label": "Duration"
          }
        ],
        "thresholds": [
          {
            "value": 1,
            "colorMode": "critical",
            "fill": true,
            "line": true,
            "op": "gt"
          }
        ]
      },
      {
        "id": 3,
        "title": "Error Rate",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval]))",
            "legendFormat": "Error %"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 8
        },
        "yaxes": [
          {
            "format": "percentunit",
            "max": 1,
            "min": 0
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [0.01],
                "type": "gt"
              },
              "operator": {
                "type": "and"
              },
              "query": {
                "params": ["A", "5m", "now"]
              },
              "reducer": {
                "type": "avg"
              },
              "type": "query"
            }
          ],
          "frequency": "1m",
          "handler": 1,
          "name": "Error Rate Alert",
          "noDataState": "no_data",
          "notifications": []
        }
      },
      {
        "id": 4,
        "title": "Recent Error Logs",
        "type": "logs",
        "datasource": "Loki",
        "targets": [
          {
            "expr": "{app=\"$app\", namespace=~\"$namespace\"} | json | level=\"error\"",
            "refId": "A"
          }
        ],
        "options": {
          "showTime": true,
          "showLabels": false,
          "showCommonLabels": false,
          "wrapLogMessage": true,
          "dedupStrategy": "none",
          "enableLogDetails": true
        },
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 8
        }
      }
    ],
    "links": [
      {
        "title": "Explore Logs",
        "url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"expr\":\"{app=\\\"$app\\\",namespace=~\\\"$namespace\\\"}\"}]}",
        "type": "link",
        "icon": "doc"
      },
      {
        "title": "Explore Traces",
        "url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"{resource.service.name=\\\"$app\\\"}\",\"queryType\":\"traceql\"}]}",
        "type": "link",
        "icon": "gf-traces"
      }
    ]
  }
}
json
{
  "dashboard": {
    "title": "应用可观测性 - ${app}",
    "tags": ["observability", "application"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "templating": {
      "list": [
        {
          "name": "app",
          "type": "query",
          "datasource": "Mimir",
          "query": "label_values(up, app)",
          "current": {
            "selected": false,
            "text": "api",
            "value": "api"
          }
        },
        {
          "name": "namespace",
          "type": "query",
          "datasource": "Mimir",
          "query": "label_values(up{app=\"$app\"}, namespace)",
          "multi": true,
          "includeAll": true
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "请求速率",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (method, status)",
            "legendFormat": "{{method}} - {{status}}"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 0
        },
        "yaxes": [
          {
            "format": "reqps",
            "label": "请求数/秒"
          }
        ]
      },
      {
        "id": 2,
        "title": "P95延迟",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (le, endpoint))",
            "legendFormat": "{{endpoint}}"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 0
        },
        "yaxes": [
          {
            "format": "s",
            "label": "耗时"
          }
        ],
        "thresholds": [
          {
            "value": 1,
            "colorMode": "critical",
            "fill": true,
            "line": true,
            "op": "gt"
          }
        ]
      },
      {
        "id": 3,
        "title": "错误率",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval]))",
            "legendFormat": "错误率%"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 8
        },
        "yaxes": [
          {
            "format": "percentunit",
            "max": 1,
            "min": 0
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [0.01],
                "type": "gt"
              },
              "operator": {
                "type": "and"
              },
              "query": {
                "params": ["A", "5m", "now"]
              },
              "reducer": {
                "type": "avg"
              },
              "type": "query"
            }
          ],
          "frequency": "1m",
          "handler": 1,
          "name": "错误率告警",
          "noDataState": "no_data",
          "notifications": []
        }
      },
      {
        "id": 4,
        "title": "近期错误日志",
        "type": "logs",
        "datasource": "Loki",
        "targets": [
          {
            "expr": "{app=\"$app\", namespace=~\"$namespace\"} | json | level=\"error\"",
            "refId": "A"
          }
        ],
        "options": {
          "showTime": true,
          "showLabels": false,
          "showCommonLabels": false,
          "wrapLogMessage": true,
          "dedupStrategy": "none",
          "enableLogDetails": true
        },
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 8
        }
      }
    ],
    "links": [
      {
        "title": "探索日志",
        "url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"expr\":\"{app=\\\"$app\\\",namespace=~\\\"$namespace\\\"}\"}]}",
        "type": "link",
        "icon": "doc"
      },
      {
        "title": "探索追踪",
        "url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"{resource.service.name=\\\"$app\\\"}\",\"queryType\":\"traceql\"}]}",
        "type": "link",
        "icon": "gf-traces"
      }
    ]
  }
}

LGTM Stack Configuration

LGTM栈配置

Loki Configuration

Loki配置

File:
loki.yaml
yaml
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  aws:
    s3: s3://us-east-1/my-loki-bucket
    s3forcepathstyle: true
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache
    shared_store: s3

limits_config:
  retention_period: 744h # 31 days
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  max_query_series: 500
  max_query_lookback: 30d
  reject_old_samples: true
  reject_old_samples_max_age: 168h

compactor:
  working_directory: /loki/compactor
  shared_store: s3
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
文件:
loki.yaml
yaml
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  aws:
    s3: s3://us-east-1/my-loki-bucket
    s3forcepathstyle: true
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache
    shared_store: s3

limits_config:
  retention_period: 744h # 31天
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  max_query_series: 500
  max_query_lookback: 30d
  reject_old_samples: true
  reject_old_samples_max_age: 168h

compactor:
  working_directory: /loki/compactor
  shared_store: s3
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h

Tempo Configuration

Tempo配置

File:
tempo.yaml
yaml
server:
  http_listen_port: 3200
  grpc_listen_port: 9096

distributor:
  receivers:
    otlp:
      protocols:
        http:
        grpc:
    jaeger:
      protocols:
        thrift_http:
        grpc:

ingester:
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 720h # 30 days

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com
      region: us-east-1
    wal:
      path: /var/tempo/wal

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: primary
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://mimir:9009/api/v1/push
        send_exemplars: true
文件:
tempo.yaml
yaml
server:
  http_listen_port: 3200
  grpc_listen_port: 9096

distributor:
  receivers:
    otlp:
      protocols:
        http:
        grpc:
    jaeger:
      protocols:
        thrift_http:
        grpc:

ingester:
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 720h # 30天

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com
      region: us-east-1
    wal:
      path: /var/tempo/wal

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: primary
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://mimir:9009/api/v1/push
        send_exemplars: true

Production Best Practices

生产环境最佳实践

Performance Optimization

性能优化

Query Optimization

查询优化

  • Use label filters before line filters
  • Limit time ranges for expensive queries
  • Use
    unwrap
    instead of parsing when possible
  • Cache query results with query frontend
  • 先使用标签过滤,再使用行过滤
  • 对复杂查询限制时间范围
  • 可能时使用
    unwrap
    替代解析
  • 通过查询前端缓存查询结果

Dashboard Performance

仪表板性能

  • Limit number of panels (< 15 per dashboard)
  • Use appropriate time intervals
  • Avoid high-cardinality grouping
  • Use
    $__interval
    for adaptive sampling
  • 限制面板数量(单仪表板不超过15个)
  • 使用合适的时间间隔
  • 避免高基数分组
  • 使用
    $__interval
    实现自适应采样

Storage Optimization

存储优化

  • Configure retention policies
  • Use compaction for Loki and Tempo
  • Implement tiered storage (hot/warm/cold)
  • Monitor storage growth
  • 配置保留策略
  • 对Loki与Tempo启用压缩
  • 实现分层存储(热/温/冷)
  • 监控存储增长

Security Best Practices

安全最佳实践

Authentication

认证

  • Enable auth (
    auth_enabled: true
    in Loki/Tempo)
  • Use OAuth/LDAP for Grafana
  • Implement multi-tenancy with org isolation
  • 启用认证(Loki/Tempo中设置
    auth_enabled: true
  • Grafana使用OAuth/LDAP认证
  • 实现多租户与组织隔离

Authorization

授权

  • Configure RBAC in Grafana
  • Limit datasource access by team
  • Use folder permissions for dashboards
  • Grafana中配置RBAC
  • 按团队限制数据源访问
  • 对仪表板使用文件夹权限

Network Security

网络安全

  • TLS for all components
  • Network policies in Kubernetes
  • Rate limiting at ingress
  • 所有组件启用TLS
  • Kubernetes中配置网络策略
  • 入口处设置速率限制

Troubleshooting

故障排查

Common Issues

常见问题

  1. High Cardinality: Too many unique label combinations
    • Solution: Reduce label dimensions, use log parsing instead
  2. Query Timeouts: Complex queries on large datasets
    • Solution: Reduce time range, use aggregations, add query limits
  3. Storage Growth: Unbounded retention
    • Solution: Configure retention policies, enable compaction
  4. Missing Traces: Incomplete trace data
    • Solution: Check sampling rates, verify instrumentation
  1. 高基数:过多唯一标签组合
    • 解决方案:减少标签维度,改用日志解析
  2. 查询超时:大数据集上的复杂查询
    • 解决方案:缩小时间范围、使用聚合、添加查询限制
  3. 存储增长过快:无限制保留
    • 解决方案:配置保留策略、启用压缩
  4. 追踪缺失:追踪数据不完整
    • 解决方案:检查采样率、验证埋点

Resources

参考资源