grafana
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGrafana and LGTM Stack Skill
Grafana与LGTM栈技能
Overview
概述
The LGTM stack provides a complete observability solution with comprehensive visualization and dashboard capabilities:
- Loki: Log aggregation and querying (LogQL)
- Grafana: Visualization, dashboarding, alerting, and exploration
- Tempo: Distributed tracing (TraceQL)
- Mimir: Long-term metrics storage (Prometheus-compatible)
This skill covers setup, configuration, dashboard creation, panel design, querying, alerting, templating, and production observability best practices.
LGTM栈提供了一套完整的可观测性解决方案,具备全面的可视化与仪表板能力:
- Loki:日志聚合与查询(LogQL)
- Grafana:可视化、仪表板搭建、告警与探索功能
- Tempo:分布式追踪(TraceQL)
- Mimir:长期指标存储(兼容Prometheus)
本技能涵盖安装配置、仪表板创建、面板设计、查询编写、告警设置、模板化以及生产环境可观测性最佳实践。
When to Use This Skill
适用场景
Primary Use Cases
核心使用场景
- Creating or modifying Grafana dashboards
- Designing panels and visualizations (graphs, stats, tables, heatmaps, etc.)
- Writing queries (PromQL, LogQL, TraceQL)
- Configuring data sources (Prometheus, Loki, Tempo, Mimir)
- Setting up alerting rules and notification policies
- Implementing dashboard variables and templates
- Dashboard provisioning and GitOps workflows
- Troubleshooting observability queries
- Analyzing application performance, errors, or system behavior
- 创建或修改Grafana仪表板
- 设计面板与可视化效果(图表、统计卡片、表格、热力图等)
- 编写查询语句(PromQL、LogQL、TraceQL)
- 配置数据源(Prometheus、Loki、Tempo、Mimir)
- 设置告警规则与通知策略
- 实现仪表板变量与模板
- 仪表板自动化部署与GitOps工作流
- 排查可观测性查询问题
- 分析应用性能、错误或系统行为
Who Uses This Skill
适用人群
- senior-software-engineer (PRIMARY): Production observability setup, LGTM stack deployment, dashboard architecture (use infrastructure skills for deployment)
- software-engineer: Application dashboards, service metrics visualization
- 资深软件工程师(核心用户):生产环境可观测性搭建、LGTM栈部署、仪表板架构设计(部署相关请使用基础设施技能)
- 软件工程师:应用仪表板搭建、服务指标可视化
LGTM Stack Components
LGTM栈组件
Loki - Log Aggregation
Loki - 日志聚合
Architecture - Loki
Loki架构
Horizontally scalable log aggregation inspired by Prometheus
- Indexes only metadata (labels), not log content
- Cost-effective storage with object stores (S3, GCS, etc.)
- LogQL query language similar to PromQL
受Prometheus启发的水平可扩展日志聚合系统
- 仅索引元数据(标签),不索引日志内容
- 结合对象存储(S3、GCS等)实现低成本存储
- 类PromQL的查询语言LogQL
Key Concepts - Loki
Loki核心概念
- Labels for indexing (low cardinality)
- Log streams identified by unique label sets
- Parsers: logfmt, JSON, regex, pattern
- Line filters and label filters
- 用于索引的低基数标签
- 由唯一标签集标识的日志流
- 解析器:logfmt、JSON、正则表达式、模式匹配
- 行过滤与标签过滤
Grafana - Visualization
Grafana - 可视化
Features
功能特性
- Multi-datasource dashboarding
- Panel types: Graph, Stat, Table, Heatmap, Bar Chart, Pie Chart, Gauge, Logs, Traces, Time Series
- Templating and variables for dynamic dashboards
- Alerting (unified alerting with contact points and notification policies)
- Dashboard provisioning and GitOps integration
- Role-based access control (RBAC)
- Explore mode for ad-hoc queries
- Annotations for event markers
- Dashboard folders and organization
- 多数据源仪表板支持
- 面板类型:时间序列图、统计卡片、表格、热力图、柱状图、饼图、仪表盘、日志面板、追踪面板、时序图
- 模板化与变量实现动态仪表板
- 统一告警(含联系点与通知策略)
- 仪表板自动化部署与GitOps集成
- 基于角色的访问控制(RBAC)
- 探索模式支持临时查询
- 用于事件标记的注解功能
- 仪表板文件夹与组织管理
Tempo - Distributed Tracing
Tempo - 分布式追踪
Architecture - Tempo
Tempo架构
Scalable distributed tracing backend
- Cost-effective trace storage
- TraceQL for trace querying
- Integration with logs and metrics (trace-to-logs, trace-to-metrics)
- OpenTelemetry compatible
水平可扩展的分布式追踪后端
- 低成本的追踪存储
- 用于追踪查询的TraceQL
- 与日志和指标的集成(追踪转日志、追踪转指标)
- 兼容OpenTelemetry
Mimir - Metrics Storage
Mimir - 指标存储
Architecture - Mimir
Mimir架构
Horizontally scalable long-term Prometheus storage
- Multi-tenancy support
- Query federation
- High availability
- Prometheus remote_write compatible
水平可扩展的长期Prometheus指标存储
- 多租户支持
- 查询联邦
- 高可用性
- 兼容Prometheus remote_write
Dashboard Design and Best Practices
仪表板设计与最佳实践
Dashboard Organization Principles
仪表板组织原则
- Hierarchy: Overview -> Service -> Component -> Deep Dive
- Golden Signals: Latency, Traffic, Errors, Saturation (RED/USE method)
- Variable-driven: Use templates for flexibility across environments
- Consistent Layouts: Grid alignment (24-column grid), logical top-to-bottom flow
- Performance: Limit queries, use query caching, appropriate time intervals
- 层级结构:概览 -> 服务 -> 组件 -> 深度排查
- 黄金信号:延迟、流量、错误、饱和度(RED/USE方法)
- 变量驱动:使用模板实现跨环境的灵活性
- 布局一致:网格对齐(24列网格)、逻辑从上到下的流向
- 性能优化:限制查询数量、使用查询缓存、选择合适的时间间隔
Panel Types and When to Use Them
面板类型与适用场景
| Panel Type | Use Case | Best For |
|---|---|---|
| Time Series / Graph | Trends over time | Request rates, latency, resource usage |
| Stat | Single metric value | Error rates, current values, percentage |
| Gauge | Progress toward limit | CPU usage, memory, disk space |
| Bar Gauge | Comparative values | Top N items, distribution |
| Table | Structured data | Service lists, error details, resource inventory |
| Pie Chart | Proportions | Traffic distribution, error breakdown |
| Heatmap | Distribution over time | Latency percentiles, request patterns |
| Logs | Log streams | Error investigation, debugging |
| Traces | Distributed tracing | Performance analysis, dependency mapping |
| 面板类型 | 适用场景 | 最佳用途 |
|---|---|---|
| 时间序列图/折线图 | 趋势变化分析 | 请求速率、延迟、资源使用率 |
| 统计卡片 | 单指标展示 | 错误率、当前值、百分比 |
| 仪表盘 | 接近阈值的进度展示 | CPU使用率、内存、磁盘空间 |
| 条形仪表盘 | 多值对比 | Top N项、分布情况 |
| 表格 | 结构化数据展示 | 服务列表、错误详情、资源清单 |
| 饼图 | 占比分析 | 流量分布、错误细分 |
| 热力图 | 时间维度的分布分析 | 延迟分位数、请求模式 |
| 日志面板 | 日志流查看 | 错误排查、调试 |
| 追踪面板 | 分布式追踪分析 | 性能分析、依赖映射 |
Panel Configuration Best Practices
面板配置最佳实践
Titles and Descriptions
标题与描述
- Clear, descriptive titles: Include units and metric context
- Tooltips: Add description fields for panel documentation
- Examples:
- Good: "P95 Latency (seconds) by Endpoint"
- Bad: "Latency"
- 清晰、具描述性的标题:包含单位与指标上下文
- 工具提示:添加描述字段作为面板文档
- 示例:
- 优秀:"按端点统计的P95延迟(秒)"
- 不佳:"延迟"
Legends and Labels
图例与标签
- Show legends only when needed (multiple series)
- Use format for dynamic legend names
{{label}} - Place legends appropriately (bottom, right, or hidden)
- Sort by value when showing Top N
- 仅在多序列时显示图例
- 使用格式动态命名图例
{{label}} - 合理放置图例(底部、右侧或隐藏)
- 展示Top N时按数值排序
Axes and Units
坐标轴与单位
- Always label axes with units
- Use appropriate unit formats (seconds, bytes, percent, requests/sec)
- Set reasonable min/max ranges to avoid misleading scales
- Use logarithmic scales for wide value ranges
- 始终为坐标轴添加单位标签
- 使用合适的单位格式(秒、字节、百分比、请求/秒)
- 设置合理的最小/最大值范围,避免误导性刻度
- 数值范围较大时使用对数刻度
Thresholds and Colors
阈值与颜色
- Use thresholds for visual cues (green/yellow/red)
- Standard threshold pattern:
- Green: Normal operation
- Yellow: Warning (action may be needed)
- Red: Critical (immediate attention required)
- Examples:
- Error rate: 0% (green), 1% (yellow), 5% (red)
- P95 latency: <1s (green), 1-3s (yellow), >3s (red)
- 使用阈值提供视觉提示(绿/黄/红)
- 标准阈值模式:
- 绿色:正常运行
- 黄色:警告(可能需要采取行动)
- 红色:严重(需立即处理)
- 示例:
- 错误率:0%(绿)、1%(黄)、5%(红)
- P95延迟:<1秒(绿)、1-3秒(黄)、>3秒(红)
Links and Drilldowns
链接与钻取
- Link panels to related dashboards
- Use data links for context (logs, traces, related services)
- Create drill-down paths: Overview -> Service -> Component -> Details
- Link to runbooks for alert panels
- 将面板链接到相关仪表板
- 使用数据链接提供上下文(日志、追踪、关联服务)
- 创建钻取路径:概览 -> 服务 -> 组件 -> 详情
- 为告警面板链接到运行手册
Dashboard Variables and Templating
仪表板变量与模板
Dashboard variables enable reusable, dynamic dashboards that work across environments, services, and time ranges.
仪表板变量可实现可复用的动态仪表板,适配不同环境、服务与时间范围。
Variable Types
变量类型
| Type | Purpose | Example |
|---|---|---|
| Query | Populate from data source | Namespaces, services, pods |
| Custom | Static list of options | Environments (prod/staging/dev) |
| Interval | Time interval selection | Auto-adjusted query intervals |
| Datasource | Switch between data sources | Multiple Prometheus instances |
| Constant | Hidden values for queries | Cluster name, region |
| Text box | Free-form input | Custom filters |
| 类型 | 用途 | 示例 |
|---|---|---|
| 查询型 | 从数据源获取选项 | 命名空间、服务、Pod |
| 自定义型 | 静态选项列表 | 环境(生产/预发布/开发) |
| 时间间隔型 | 时间间隔选择 | 自动调整的查询间隔 |
| 数据源型 | 切换不同数据源 | 多Prometheus实例 |
| 常量型 | 查询中使用的隐藏值 | 集群名称、区域 |
| 文本框型 | 自由输入内容 | 自定义过滤器 |
Common Variable Patterns
常见变量配置示例
json
{
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus",
"description": "Select Prometheus data source"
},
{
"name": "namespace",
"type": "query",
"datasource": "${datasource}",
"query": "label_values(kube_pod_info, namespace)",
"multi": true,
"includeAll": true,
"description": "Kubernetes namespace filter"
},
{
"name": "app",
"type": "query",
"datasource": "${datasource}",
"query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, app)",
"multi": true,
"includeAll": true,
"description": "Application filter (depends on namespace)"
},
{
"name": "interval",
"type": "interval",
"auto": true,
"auto_count": 30,
"auto_min": "10s",
"options": ["1m", "5m", "15m", "30m", "1h", "6h", "12h", "1d"],
"description": "Query resolution interval"
},
{
"name": "environment",
"type": "custom",
"options": [
{ "text": "Production", "value": "prod" },
{ "text": "Staging", "value": "staging" },
{ "text": "Development", "value": "dev" }
],
"current": { "text": "Production", "value": "prod" }
}
]
}
}json
{
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus",
"description": "选择Prometheus数据源"
},
{
"name": "namespace",
"type": "query",
"datasource": "${datasource}",
"query": "label_values(kube_pod_info, namespace)",
"multi": true,
"includeAll": true,
"description": "Kubernetes命名空间过滤器"
},
{
"name": "app",
"type": "query",
"datasource": "${datasource}",
"query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, app)",
"multi": true,
"includeAll": true,
"description": "应用过滤器(依赖命名空间)"
},
{
"name": "interval",
"type": "interval",
"auto": true,
"auto_count": 30,
"auto_min": "10s",
"options": ["1m", "5m", "15m", "30m", "1h", "6h", "12h", "1d"],
"description": "查询精度间隔"
},
{
"name": "environment",
"type": "custom",
"options": [
{ "text": "Production", "value": "prod" },
{ "text": "Staging", "value": "staging" },
{ "text": "Development", "value": "dev" }
],
"current": { "text": "Production", "value": "prod" }
}
]
}
}Variable Usage in Queries
查询中的变量引用
Variables are referenced with or syntax:
$variable_name${variable_name}promql
undefined使用或语法引用变量:
$variable_name${variable_name}promql
undefinedSimple variable reference
简单变量引用
rate(http_requests_total{namespace="$namespace"}[5m])
rate(http_requests_total{namespace="$namespace"}[5m])
Multi-select with regex match
多选正则匹配
rate(http_requests_total{namespace=~"$namespace"}[5m])
rate(http_requests_total{namespace=~"$namespace"}[5m])
Variable in legend
图例中的变量
rate(http_requests_total{app="$app"}[5m]) by (method)
rate(http_requests_total{app="$app"}[5m]) by (method)
Legend format: "{{method}}"
图例格式: "{{method}}"
Using interval variable for adaptive queries
使用间隔变量实现自适应查询
rate(http_requests_total[$__interval])
rate(http_requests_total[$__interval])
Chained variables (app depends on namespace)
链式变量(应用依赖命名空间)
rate(http_requests_total{namespace="$namespace", app="$app"}[5m])
undefinedrate(http_requests_total{namespace="$namespace", app="$app"}[5m])
undefinedAdvanced Variable Techniques
高级变量技巧
Regex filtering:
json
{
"name": "pod",
"type": "query",
"query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
"regex": "/^$app-.*/",
"description": "Filter pods by app prefix"
}All option with custom value:
json
{
"name": "status",
"type": "custom",
"options": ["200", "404", "500"],
"includeAll": true,
"allValue": ".*",
"description": "HTTP status code filter"
}Dependent variables (variable chain):
- (datasource type)
$datasource - (query: depends on datasource)
$cluster - (query: depends on cluster)
$namespace - (query: depends on namespace)
$app - (query: depends on app)
$pod
正则过滤:
json
{
"name": "pod",
"type": "query",
"query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
"regex": "/^$app-.*/",
"description": "按应用前缀过滤Pod"
}含自定义值的全选选项:
json
{
"name": "status",
"type": "custom",
"options": ["200", "404", "500"],
"includeAll": true,
"allValue": ".*",
"description": "HTTP状态码过滤器"
}依赖变量(变量链):
- (数据源类型)
$datasource - (查询:依赖数据源)
$cluster - (查询:依赖集群)
$namespace - (查询:依赖命名空间)
$app - (查询:依赖应用)
$pod
Annotations
注解
Annotations display events as vertical markers on time series panels:
json
{
"annotations": {
"list": [
{
"name": "Deployments",
"datasource": "Prometheus",
"expr": "changes(kube_deployment_spec_replicas{namespace=\"$namespace\"}[5m])",
"tagKeys": "deployment,namespace",
"textFormat": "Deployment: {{deployment}}",
"iconColor": "blue"
},
{
"name": "Alerts",
"datasource": "Loki",
"expr": "{app=\"alertmanager\"} | json | alertname!=\"\"",
"textFormat": "Alert: {{alertname}}",
"iconColor": "red"
}
]
}
}注解在时间序列面板上以垂直标记展示事件:
json
{
"annotations": {
"list": [
{
"name": "部署事件",
"datasource": "Prometheus",
"expr": "changes(kube_deployment_spec_replicas{namespace=\"$namespace\"}[5m])",
"tagKeys": "deployment,namespace",
"textFormat": "部署: {{deployment}}",
"iconColor": "blue"
},
{
"name": "告警事件",
"datasource": "Loki",
"expr": "{app=\"alertmanager\"} | json | alertname!=\"\"",
"textFormat": "告警: {{alertname}}",
"iconColor": "red"
}
]
}
}Dashboard Performance Optimization
仪表板性能优化
Query Optimization
查询优化
- Limit number of panels (< 15 per dashboard)
- Use appropriate time ranges (avoid queries over months)
- Leverage for adaptive sampling
$__interval - Avoid high-cardinality grouping (too many series)
- Use query caching when available
- 限制面板数量(单仪表板不超过15个)
- 使用合适的时间范围(避免跨月查询)
- 利用实现自适应采样
$__interval - 避免高基数分组(过多序列)
- 启用查询缓存(如支持)
Panel Performance
面板性能
- Set max data points to reasonable values
- Use instant queries for current-state panels
- Combine related metrics into single queries when possible
- Disable auto-refresh on heavy dashboards
- 设置合理的最大数据点数量
- 对当前状态面板使用即时查询
- 可能时将相关指标合并为单个查询
- 对重型仪表板禁用自动刷新
Dashboard as Code and Provisioning
即代码化仪表板与自动化部署
Dashboard Provisioning
仪表板自动化部署
Dashboard provisioning enables GitOps workflows and version-controlled dashboard definitions.
仪表板自动化部署支持GitOps工作流与版本控制的仪表板定义。
Provisioning Provider Configuration
部署提供者配置
File:
/etc/grafana/provisioning/dashboards/dashboards.yamlyaml
apiVersion: 1
providers:
- name: "default"
orgId: 1
folder: ""
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /etc/grafana/provisioning/dashboards
- name: "application"
orgId: 1
folder: "Applications"
type: file
disableDeletion: true
editable: false
options:
path: /var/lib/grafana/dashboards/application
- name: "infrastructure"
orgId: 1
folder: "Infrastructure"
type: file
options:
path: /var/lib/grafana/dashboards/infrastructure文件:
/etc/grafana/provisioning/dashboards/dashboards.yamlyaml
apiVersion: 1
providers:
- name: "default"
orgId: 1
folder: ""
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /etc/grafana/provisioning/dashboards
- name: "application"
orgId: 1
folder: "Applications"
type: file
disableDeletion: true
editable: false
options:
path: /var/lib/grafana/dashboards/application
- name: "infrastructure"
orgId: 1
folder: "Infrastructure"
type: file
options:
path: /var/lib/grafana/dashboards/infrastructureDashboard JSON Structure
仪表板JSON结构
Complete dashboard JSON with metadata and provisioning:
json
{
"dashboard": {
"title": "Application Observability - ${app}",
"uid": "app-observability",
"tags": ["observability", "application"],
"timezone": "browser",
"editable": true,
"graphTooltip": 1,
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "30s",
"templating": { "list": [] },
"panels": [],
"links": []
},
"overwrite": true,
"folderId": null,
"folderUid": null
}包含元数据与部署配置的完整仪表板JSON:
json
{
"dashboard": {
"title": "应用可观测性 - ${app}",
"uid": "app-observability",
"tags": ["observability", "application"],
"timezone": "browser",
"editable": true,
"graphTooltip": 1,
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "30s",
"templating": { "list": [] },
"panels": [],
"links": []
},
"overwrite": true,
"folderId": null,
"folderUid": null
}Kubernetes ConfigMap Provisioning
Kubernetes ConfigMap部署
yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
application-dashboard.json: |
{
"dashboard": {
"title": "Application Metrics",
"uid": "app-metrics",
"tags": ["application"],
"panels": []
}
}yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
application-dashboard.json: |
{
"dashboard": {
"title": "应用指标",
"uid": "app-metrics",
"tags": ["application"],
"panels": []
}
}Grafana Operator (CRD)
Grafana Operator(CRD)
yaml
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: application-observability
namespace: monitoring
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
json: |
{
"dashboard": {
"title": "Application Observability",
"panels": []
}
}yaml
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: application-observability
namespace: monitoring
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
json: |
{
"dashboard": {
"title": "应用可观测性",
"panels": []
}
}Data Source Provisioning
数据源自动化部署
Loki Data Source
Loki数据源
File:
/etc/grafana/provisioning/datasources/loki.yamlyaml
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
url: http://loki:3100
jsonData:
maxLines: 1000
derivedFields:
- datasourceUid: tempo_uid
matcherRegex: "trace_id=(\\w+)"
name: TraceID
url: "$${__value.raw}"
editable: false文件:
/etc/grafana/provisioning/datasources/loki.yamlyaml
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
url: http://loki:3100
jsonData:
maxLines: 1000
derivedFields:
- datasourceUid: tempo_uid
matcherRegex: "trace_id=(\\w+)"
name: TraceID
url: "$${__value.raw}"
editable: falseTempo Data Source
Tempo数据源
File:
/etc/grafana/provisioning/datasources/tempo.yamlyaml
apiVersion: 1
datasources:
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
uid: tempo_uid
jsonData:
httpMethod: GET
tracesToLogs:
datasourceUid: loki_uid
tags: ["job", "instance", "pod", "namespace"]
mappedTags: [{ key: "service.name", value: "service" }]
spanStartTimeShift: "1h"
spanEndTimeShift: "1h"
tracesToMetrics:
datasourceUid: prometheus_uid
tags: [{ key: "service.name", value: "service" }]
serviceMap:
datasourceUid: prometheus_uid
nodeGraph:
enabled: true
editable: false文件:
/etc/grafana/provisioning/datasources/tempo.yamlyaml
apiVersion: 1
datasources:
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
uid: tempo_uid
jsonData:
httpMethod: GET
tracesToLogs:
datasourceUid: loki_uid
tags: ["job", "instance", "pod", "namespace"]
mappedTags: [{ key: "service.name", value: "service" }]
spanStartTimeShift: "1h"
spanEndTimeShift: "1h"
tracesToMetrics:
datasourceUid: prometheus_uid
tags: [{ key: "service.name", value: "service" }]
serviceMap:
datasourceUid: prometheus_uid
nodeGraph:
enabled: true
editable: falseMimir/Prometheus Data Source
Mimir/Prometheus数据源
File:
/etc/grafana/provisioning/datasources/mimir.yamlyaml
apiVersion: 1
datasources:
- name: Mimir
type: prometheus
access: proxy
url: http://mimir:8080/prometheus
uid: prometheus_uid
jsonData:
httpMethod: POST
exemplarTraceIdDestinations:
- datasourceUid: tempo_uid
name: trace_id
prometheusType: Mimir
prometheusVersion: 2.40.0
cacheLevel: "High"
incrementalQuerying: true
incrementalQueryOverlapWindow: 10m
editable: false文件:
/etc/grafana/provisioning/datasources/mimir.yamlyaml
apiVersion: 1
datasources:
- name: Mimir
type: prometheus
access: proxy
url: http://mimir:8080/prometheus
uid: prometheus_uid
jsonData:
httpMethod: POST
exemplarTraceIdDestinations:
- datasourceUid: tempo_uid
name: trace_id
prometheusType: Mimir
prometheusVersion: 2.40.0
cacheLevel: "High"
incrementalQuerying: true
incrementalQueryOverlapWindow: 10m
editable: falseAlerting
告警
Alert Rule Configuration
告警规则配置
Grafana unified alerting supports multi-datasource alerts with flexible evaluation and routing.
Grafana统一告警支持多数据源告警,具备灵活的评估与路由能力。
Prometheus/Mimir Alert Rule
Prometheus/Mimir告警规则
File:
/etc/grafana/provisioning/alerting/rules.yamlyaml
apiVersion: 1
groups:
- name: application_alerts
interval: 1m
rules:
- uid: error_rate_high
title: High Error Rate
condition: A
data:
- refId: A
queryType: ""
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus_uid
model:
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
intervalMs: 1000
maxDataPoints: 43200
noDataState: NoData
execErrState: Error
for: 5m
annotations:
description: 'Error rate is {{ printf "%.2f" $values.A.Value }}% (threshold: 5%)'
summary: Application error rate is above threshold
runbook_url: https://wiki.company.com/runbooks/high-error-rate
labels:
severity: critical
team: platform
isPaused: false
- uid: high_latency
title: High P95 Latency
condition: A
data:
- refId: A
datasourceUid: prometheus_uid
model:
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
) > 2
for: 10m
annotations:
description: "P95 latency is {{ $values.A.Value }}s on endpoint {{ $labels.endpoint }}"
runbook_url: https://wiki.company.com/runbooks/high-latency
labels:
severity: warning文件:
/etc/grafana/provisioning/alerting/rules.yamlyaml
apiVersion: 1
groups:
- name: application_alerts
interval: 1m
rules:
- uid: error_rate_high
title: 错误率过高
condition: A
data:
- refId: A
queryType: ""
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus_uid
model:
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
intervalMs: 1000
maxDataPoints: 43200
noDataState: NoData
execErrState: Error
for: 5m
annotations:
description: '错误率为{{ printf "%.2f" $values.A.Value }}%(阈值:5%)'
summary: 应用错误率超过阈值
runbook_url: https://wiki.company.com/runbooks/high-error-rate
labels:
severity: critical
team: platform
isPaused: false
- uid: high_latency
title: P95延迟过高
condition: A
data:
- refId: A
datasourceUid: prometheus_uid
model:
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
) > 2
for: 10m
annotations:
description: "端点{{ $labels.endpoint }}的P95延迟为{{ $values.A.Value }}秒"
runbook_url: https://wiki.company.com/runbooks/high-latency
labels:
severity: warningLoki Alert Rule
Loki告警规则
yaml
apiVersion: 1
groups:
- name: log_based_alerts
interval: 1m
rules:
- uid: error_spike
title: Error Log Spike
condition: A
data:
- refId: A
queryType: ""
datasourceUid: loki_uid
model:
expr: |
sum(rate({app="api"} | json | level="error" [5m]))
> 10
for: 2m
annotations:
description: "Error log rate is {{ $values.A.Value }} logs/sec"
summary: Spike in error logs detected
labels:
severity: warning
- uid: critical_error_pattern
title: Critical Error Pattern Detected
condition: A
data:
- refId: A
datasourceUid: loki_uid
model:
expr: |
sum(count_over_time({app="api"}
|~ "OutOfMemoryError|StackOverflowError|FatalException" [5m]
)) > 0
for: 0m
annotations:
description: "Critical error pattern found in logs"
labels:
severity: critical
page: trueyaml
apiVersion: 1
groups:
- name: log_based_alerts
interval: 1m
rules:
- uid: error_spike
title: 错误日志突增
condition: A
data:
- refId: A
queryType: ""
datasourceUid: loki_uid
model:
expr: |
sum(rate({app="api"} | json | level="error" [5m]))
> 10
for: 2m
annotations:
description: "错误日志速率为{{ $values.A.Value }}条/秒"
summary: 检测到错误日志突增
labels:
severity: warning
- uid: critical_error_pattern
title: 检测到严重错误模式
condition: A
data:
- refId: A
datasourceUid: loki_uid
model:
expr: |
sum(count_over_time({app="api"}
|~ "OutOfMemoryError|StackOverflowError|FatalException" [5m]
)) > 0
for: 0m
annotations:
description: "日志中发现严重错误模式"
labels:
severity: critical
page: trueContact Points and Notification Policies
联系点与通知策略
File:
/etc/grafana/provisioning/alerting/contactpoints.yamlyaml
apiVersion: 1
contactPoints:
- orgId: 1
name: slack-critical
receivers:
- uid: slack_critical
type: slack
settings:
url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
title: "{{ .GroupLabels.alertname }}"
text: |
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
{{ end }}
disableResolveMessage: false
- orgId: 1
name: pagerduty-oncall
receivers:
- uid: pagerduty_oncall
type: pagerduty
settings:
integrationKey: YOUR_INTEGRATION_KEY
severity: critical
class: infrastructure
- orgId: 1
name: email-team
receivers:
- uid: email_team
type: email
settings:
addresses: team@company.com
singleEmail: true
notificationPolicies:
- orgId: 1
receiver: slack-critical
group_by: ["alertname", "namespace"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- receiver: pagerduty-oncall
matchers:
- severity = critical
- page = true
group_wait: 10s
repeat_interval: 1h
continue: true
- receiver: email-team
matchers:
- severity = warning
- team = platform
group_interval: 10m
repeat_interval: 12h文件:
/etc/grafana/provisioning/alerting/contactpoints.yamlyaml
apiVersion: 1
contactPoints:
- orgId: 1
name: slack-critical
receivers:
- uid: slack_critical
type: slack
settings:
url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
title: "{{ .GroupLabels.alertname }}"
text: |
{{ range .Alerts }}
*告警:* {{ .Labels.alertname }}
*摘要:* {{ .Annotations.summary }}
*描述:* {{ .Annotations.description }}
*级别:* {{ .Labels.severity }}
{{ end }}
disableResolveMessage: false
- orgId: 1
name: pagerduty-oncall
receivers:
- uid: pagerduty_oncall
type: pagerduty
settings:
integrationKey: YOUR_INTEGRATION_KEY
severity: critical
class: infrastructure
- orgId: 1
name: email-team
receivers:
- uid: email_team
type: email
settings:
addresses: team@company.com
singleEmail: true
notificationPolicies:
- orgId: 1
receiver: slack-critical
group_by: ["alertname", "namespace"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- receiver: pagerduty-oncall
matchers:
- severity = critical
- page = true
group_wait: 10s
repeat_interval: 1h
continue: true
- receiver: email-team
matchers:
- severity = warning
- team = platform
group_interval: 10m
repeat_interval: 12hLogQL Query Patterns
LogQL查询模式
Basic Log Queries
基础日志查询
Stream Selection
流选择
logql
undefinedlogql
undefinedSimple label matching
简单标签匹配
{namespace="production", app="api"}
{namespace="production", app="api"}
Regex matching
正则匹配
{app=~"api|web|worker"}
{app=~"api|web|worker"}
Not equal
不等于
{env!="staging"}
{env!="staging"}
Multiple conditions
多条件
{namespace="production", app="api", level!="debug"}
undefined{namespace="production", app="api", level!="debug"}
undefinedLine Filters
行过滤
logql
undefinedlogql
undefinedContains
包含指定内容
{app="api"} |= "error"
{app="api"} |= "error"
Does not contain
不包含指定内容
{app="api"} != "debug"
{app="api"} != "debug"
Regex match
正则匹配
{app="api"} |~ "error|exception|fatal"
{app="api"} |~ "error|exception|fatal"
Case insensitive
不区分大小写
{app="api"} |~ "(?i)error"
{app="api"} |~ "(?i)error"
Chaining filters
链式过滤
{app="api"} |= "error" != "timeout"
undefined{app="api"} |= "error" != "timeout"
undefinedParsing and Extraction
解析与提取
JSON Parsing
JSON解析
logql
undefinedlogql
undefinedParse JSON logs
解析JSON日志
{app="api"} | json
{app="api"} | json
Extract specific fields
提取特定字段
{app="api"} | json message="msg", level="severity"
{app="api"} | json message="msg", level="severity"
Filter on extracted field
过滤提取的字段
{app="api"} | json | level="error"
{app="api"} | json | level="error"
Nested JSON
嵌套JSON
{app="api"} | json | line_format "{{.response.status}}"
undefined{app="api"} | json | line_format "{{.response.status}}"
undefinedLogfmt Parsing
Logfmt解析
logql
undefinedlogql
undefinedParse logfmt (key=value)
解析logfmt格式(key=value)
{app="api"} | logfmt
{app="api"} | logfmt
Extract specific fields
提取特定字段
{app="api"} | logfmt level, caller, msg
{app="api"} | logfmt level, caller, msg
Filter parsed fields
过滤解析后的字段
{app="api"} | logfmt | level="error"
undefined{app="api"} | logfmt | level="error"
undefinedPattern Parsing
模式匹配解析
logql
undefinedlogql
undefinedExtract with pattern
使用模式提取
{app="nginx"} | pattern
<ip> - - <_> "<method> <uri> <_>" <status> <_>{app="nginx"} | pattern
<ip> - - <_> "<method> <uri> <_>" <status> <_>Filter on extracted values
过滤提取的值
{app="nginx"} | pattern | status >= 400
<_> <status> <_>{app="nginx"} | pattern | status >= 400
<_> <status> <_>Complex pattern
复杂模式
{app="api"} | pattern
level=<level> msg="<msg>" duration=<duration>msundefined{app="api"} | pattern
level=<level> msg="<msg>" duration=<duration>msundefinedAggregations and Metrics
聚合与指标
Count Queries
计数查询
logql
undefinedlogql
undefinedCount log lines over time
统计指定时间范围内的日志行数
count_over_time({app="api"}[5m])
count_over_time({app="api"}[5m])
Rate of logs
日志速率
rate({app="api"}[5m])
rate({app="api"}[5m])
Errors per second
每秒错误数
sum(rate({app="api"} |= "error" [5m])) by (namespace)
sum(rate({app="api"} |= "error" [5m])) by (namespace)
Error ratio
错误占比
sum(rate({app="api"} |= "error" [5m]))
/
sum(rate({app="api"}[5m]))
undefinedsum(rate({app="api"} |= "error" [5m]))
/
sum(rate({app="api"}[5m]))
undefinedExtracted Metrics
提取指标
logql
undefinedlogql
undefinedAverage duration
平均耗时
avg_over_time({app="api"}
| logfmt
| unwrap duration [5m]) by (endpoint)
avg_over_time({app="api"}
| logfmt
| unwrap duration [5m]) by (endpoint)
P95 latency
P95延迟
quantile_over_time(0.95, {app="api"}
| regexp
| unwrap duration [5m]) by (method)
duration=(?P<duration>[0-9.]+)msquantile_over_time(0.95, {app="api"}
| regexp
| unwrap duration [5m]) by (method)
duration=(?P<duration>[0-9.]+)msTop 10 error messages
Top 10错误消息
topk(10,
sum by (msg) (
count_over_time({app="api"}
| json
| level="error" [1h]
)
)
)
undefinedtopk(10,
sum by (msg) (
count_over_time({app="api"}
| json
| level="error" [1h]
)
)
)
undefinedTraceQL Query Patterns
TraceQL查询模式
Basic Trace Queries
基础追踪查询
traceql
undefinedtraceql
undefinedFind traces by service
按服务查找追踪
{ .service.name = "api" }
{ .service.name = "api" }
HTTP status codes
HTTP状态码
{ .http.status_code = 500 }
{ .http.status_code = 500 }
Combine conditions
多条件组合
{ .service.name = "api" && .http.status_code >= 400 }
{ .service.name = "api" && .http.status_code >= 400 }
Duration filter
耗时过滤
{ duration > 1s }
undefined{ duration > 1s }
undefinedAdvanced TraceQL
高级TraceQL
traceql
undefinedtraceql
undefinedParent-child relationship
父子关系
{ .service.name = "frontend" }
{ .service.name = "backend" && .http.status_code = 500 }
{ .service.name = "frontend" }
{ .service.name = "backend" && .http.status_code = 500 }
Descendant spans
后代跨度
{ .service.name = "api" }
- { .db.system = "postgresql" && duration > 1s }
{ .service.name = "api" }
- { .db.system = "postgresql" && duration > 1s }
Failed database queries
失败的数据库查询
{ .service.name = "api" }
{ .db.system = "postgresql" && status = "error" }
undefined{ .service.name = "api" }
{ .db.system = "postgresql" && status = "error" }
undefinedComplete Dashboard Examples
完整仪表板示例
Application Observability Dashboard
应用可观测性仪表板
json
{
"dashboard": {
"title": "Application Observability - ${app}",
"tags": ["observability", "application"],
"timezone": "browser",
"editable": true,
"graphTooltip": 1,
"time": {
"from": "now-1h",
"to": "now"
},
"templating": {
"list": [
{
"name": "app",
"type": "query",
"datasource": "Mimir",
"query": "label_values(up, app)",
"current": {
"selected": false,
"text": "api",
"value": "api"
}
},
{
"name": "namespace",
"type": "query",
"datasource": "Mimir",
"query": "label_values(up{app=\"$app\"}, namespace)",
"multi": true,
"includeAll": true
}
]
},
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (method, status)",
"legendFormat": "{{method}} - {{status}}"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"yaxes": [
{
"format": "reqps",
"label": "Requests/sec"
}
]
},
{
"id": 2,
"title": "P95 Latency",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (le, endpoint))",
"legendFormat": "{{endpoint}}"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"yaxes": [
{
"format": "s",
"label": "Duration"
}
],
"thresholds": [
{
"value": 1,
"colorMode": "critical",
"fill": true,
"line": true,
"op": "gt"
}
]
},
{
"id": 3,
"title": "Error Rate",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval]))",
"legendFormat": "Error %"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"yaxes": [
{
"format": "percentunit",
"max": 1,
"min": 0
}
],
"alert": {
"conditions": [
{
"evaluator": {
"params": [0.01],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": ["A", "5m", "now"]
},
"reducer": {
"type": "avg"
},
"type": "query"
}
],
"frequency": "1m",
"handler": 1,
"name": "Error Rate Alert",
"noDataState": "no_data",
"notifications": []
}
},
{
"id": 4,
"title": "Recent Error Logs",
"type": "logs",
"datasource": "Loki",
"targets": [
{
"expr": "{app=\"$app\", namespace=~\"$namespace\"} | json | level=\"error\"",
"refId": "A"
}
],
"options": {
"showTime": true,
"showLabels": false,
"showCommonLabels": false,
"wrapLogMessage": true,
"dedupStrategy": "none",
"enableLogDetails": true
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
}
}
],
"links": [
{
"title": "Explore Logs",
"url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"expr\":\"{app=\\\"$app\\\",namespace=~\\\"$namespace\\\"}\"}]}",
"type": "link",
"icon": "doc"
},
{
"title": "Explore Traces",
"url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"{resource.service.name=\\\"$app\\\"}\",\"queryType\":\"traceql\"}]}",
"type": "link",
"icon": "gf-traces"
}
]
}
}json
{
"dashboard": {
"title": "应用可观测性 - ${app}",
"tags": ["observability", "application"],
"timezone": "browser",
"editable": true,
"graphTooltip": 1,
"time": {
"from": "now-1h",
"to": "now"
},
"templating": {
"list": [
{
"name": "app",
"type": "query",
"datasource": "Mimir",
"query": "label_values(up, app)",
"current": {
"selected": false,
"text": "api",
"value": "api"
}
},
{
"name": "namespace",
"type": "query",
"datasource": "Mimir",
"query": "label_values(up{app=\"$app\"}, namespace)",
"multi": true,
"includeAll": true
}
]
},
"panels": [
{
"id": 1,
"title": "请求速率",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (method, status)",
"legendFormat": "{{method}} - {{status}}"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"yaxes": [
{
"format": "reqps",
"label": "请求数/秒"
}
]
},
{
"id": 2,
"title": "P95延迟",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (le, endpoint))",
"legendFormat": "{{endpoint}}"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"yaxes": [
{
"format": "s",
"label": "耗时"
}
],
"thresholds": [
{
"value": 1,
"colorMode": "critical",
"fill": true,
"line": true,
"op": "gt"
}
]
},
{
"id": 3,
"title": "错误率",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval]))",
"legendFormat": "错误率%"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"yaxes": [
{
"format": "percentunit",
"max": 1,
"min": 0
}
],
"alert": {
"conditions": [
{
"evaluator": {
"params": [0.01],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": ["A", "5m", "now"]
},
"reducer": {
"type": "avg"
},
"type": "query"
}
],
"frequency": "1m",
"handler": 1,
"name": "错误率告警",
"noDataState": "no_data",
"notifications": []
}
},
{
"id": 4,
"title": "近期错误日志",
"type": "logs",
"datasource": "Loki",
"targets": [
{
"expr": "{app=\"$app\", namespace=~\"$namespace\"} | json | level=\"error\"",
"refId": "A"
}
],
"options": {
"showTime": true,
"showLabels": false,
"showCommonLabels": false,
"wrapLogMessage": true,
"dedupStrategy": "none",
"enableLogDetails": true
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
}
}
],
"links": [
{
"title": "探索日志",
"url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"expr\":\"{app=\\\"$app\\\",namespace=~\\\"$namespace\\\"}\"}]}",
"type": "link",
"icon": "doc"
},
{
"title": "探索追踪",
"url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"{resource.service.name=\\\"$app\\\"}\",\"queryType\":\"traceql\"}]}",
"type": "link",
"icon": "gf-traces"
}
]
}
}LGTM Stack Configuration
LGTM栈配置
Loki Configuration
Loki配置
File:
loki.yamlyaml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
log_level: info
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: index_
period: 24h
storage_config:
aws:
s3: s3://us-east-1/my-loki-bucket
s3forcepathstyle: true
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
shared_store: s3
limits_config:
retention_period: 744h # 31 days
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
max_query_series: 500
max_query_lookback: 30d
reject_old_samples: true
reject_old_samples_max_age: 168h
compactor:
working_directory: /loki/compactor
shared_store: s3
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h文件:
loki.yamlyaml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
log_level: info
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: index_
period: 24h
storage_config:
aws:
s3: s3://us-east-1/my-loki-bucket
s3forcepathstyle: true
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
shared_store: s3
limits_config:
retention_period: 744h # 31天
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
max_query_series: 500
max_query_lookback: 30d
reject_old_samples: true
reject_old_samples_max_age: 168h
compactor:
working_directory: /loki/compactor
shared_store: s3
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2hTempo Configuration
Tempo配置
File:
tempo.yamlyaml
server:
http_listen_port: 3200
grpc_listen_port: 9096
distributor:
receivers:
otlp:
protocols:
http:
grpc:
jaeger:
protocols:
thrift_http:
grpc:
ingester:
max_block_duration: 5m
compactor:
compaction:
block_retention: 720h # 30 days
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
region: us-east-1
wal:
path: /var/tempo/wal
metrics_generator:
registry:
external_labels:
source: tempo
cluster: primary
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://mimir:9009/api/v1/push
send_exemplars: true文件:
tempo.yamlyaml
server:
http_listen_port: 3200
grpc_listen_port: 9096
distributor:
receivers:
otlp:
protocols:
http:
grpc:
jaeger:
protocols:
thrift_http:
grpc:
ingester:
max_block_duration: 5m
compactor:
compaction:
block_retention: 720h # 30天
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
region: us-east-1
wal:
path: /var/tempo/wal
metrics_generator:
registry:
external_labels:
source: tempo
cluster: primary
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://mimir:9009/api/v1/push
send_exemplars: trueProduction Best Practices
生产环境最佳实践
Performance Optimization
性能优化
Query Optimization
查询优化
- Use label filters before line filters
- Limit time ranges for expensive queries
- Use instead of parsing when possible
unwrap - Cache query results with query frontend
- 先使用标签过滤,再使用行过滤
- 对复杂查询限制时间范围
- 可能时使用替代解析
unwrap - 通过查询前端缓存查询结果
Dashboard Performance
仪表板性能
- Limit number of panels (< 15 per dashboard)
- Use appropriate time intervals
- Avoid high-cardinality grouping
- Use for adaptive sampling
$__interval
- 限制面板数量(单仪表板不超过15个)
- 使用合适的时间间隔
- 避免高基数分组
- 使用实现自适应采样
$__interval
Storage Optimization
存储优化
- Configure retention policies
- Use compaction for Loki and Tempo
- Implement tiered storage (hot/warm/cold)
- Monitor storage growth
- 配置保留策略
- 对Loki与Tempo启用压缩
- 实现分层存储(热/温/冷)
- 监控存储增长
Security Best Practices
安全最佳实践
Authentication
认证
- Enable auth (in Loki/Tempo)
auth_enabled: true - Use OAuth/LDAP for Grafana
- Implement multi-tenancy with org isolation
- 启用认证(Loki/Tempo中设置)
auth_enabled: true - Grafana使用OAuth/LDAP认证
- 实现多租户与组织隔离
Authorization
授权
- Configure RBAC in Grafana
- Limit datasource access by team
- Use folder permissions for dashboards
- Grafana中配置RBAC
- 按团队限制数据源访问
- 对仪表板使用文件夹权限
Network Security
网络安全
- TLS for all components
- Network policies in Kubernetes
- Rate limiting at ingress
- 所有组件启用TLS
- Kubernetes中配置网络策略
- 入口处设置速率限制
Troubleshooting
故障排查
Common Issues
常见问题
-
High Cardinality: Too many unique label combinations
- Solution: Reduce label dimensions, use log parsing instead
-
Query Timeouts: Complex queries on large datasets
- Solution: Reduce time range, use aggregations, add query limits
-
Storage Growth: Unbounded retention
- Solution: Configure retention policies, enable compaction
-
Missing Traces: Incomplete trace data
- Solution: Check sampling rates, verify instrumentation
-
高基数:过多唯一标签组合
- 解决方案:减少标签维度,改用日志解析
-
查询超时:大数据集上的复杂查询
- 解决方案:缩小时间范围、使用聚合、添加查询限制
-
存储增长过快:无限制保留
- 解决方案:配置保留策略、启用压缩
-
追踪缺失:追踪数据不完整
- 解决方案:检查采样率、验证埋点