ml-ai

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Grafana Cloud AI & ML

Grafana Cloud AI & ML

Grafana Assistant

Grafana Assistant

Context-aware LLM sidebar agent (GA). Integrates with your Grafana Cloud stack.
Capabilities:
  • Convert natural language to PromQL/LogQL/TraceQL
  • Explain existing queries in plain English
  • Build and edit dashboards from descriptions
  • Investigate incidents (correlate metrics, logs, traces)
  • MCP server integration — connect external tools to Assistant
  • RBAC controls per organization
  • Slack integration for on-call workflows
Assistant Investigations (public preview): Multi-agent autonomous incident analysis mode — launches multiple specialized agents in parallel to investigate different signals.
Enable: Grafana Cloud → Administration → AI & LLM → Enable Grafana Assistant
In panel editor: Click the magic wand / "Assistant" icon to get query suggestions and explanations.
上下文感知的LLM侧边栏代理(GA),可与您的Grafana Cloud栈集成。
功能特性:
  • 将自然语言转换为PromQL/LogQL/TraceQL
  • 用通俗易懂的英文解释现有查询语句
  • 根据描述创建和编辑仪表盘
  • 调查事件(关联指标、日志、链路追踪)
  • MCP服务器集成——将外部工具连接至Assistant
  • 按组织设置RBAC权限控制
  • Slack集成用于随叫随到的工作流
Assistant Investigations(公开预览版):多代理自主事件分析模式——并行启动多个专业代理,调查不同信号。
启用方式: Grafana Cloud → 管理 → AI & LLM → 启用Grafana Assistant
在面板编辑器中: 点击魔法棒/“Assistant”图标获取查询建议和解释。

Dynamic Alerting

Dynamic Alerting

ML-based alerting without static thresholds.
基于ML的告警系统,无需静态阈值。

Forecasting (Prophet model)

预测功能(Prophet模型)

Trained on 90 days of history; learns daily and weekly seasonality patterns.
bash
undefined
基于90天历史数据训练;学习每日和每周的季节性模式。
bash
undefined

Create forecast job

Create forecast job

curl -X POST https://yourstack.grafana.net/api/plugins/grafana-ml-app/resources/ml/v1/forecast
-H "Authorization: Bearer <token>"
-H "Content-Type: application/json"
-d '{ "name": "cpu-forecast", "metric": "avg(rate(node_cpu_seconds_total{mode="user"}[5m]))", "datasourceId": 1, "interval": 300, "trainingWindow": "90d", "forecastWindow": "7d", "algorithm": { "name": "prophet", "config": {} } }'

Generated metric pairs for alert rules:
```promql
curl -X POST https://yourstack.grafana.net/api/plugins/grafana-ml-app/resources/ml/v1/forecast
-H "Authorization: Bearer <token>"
-H "Content-Type: application/json"
-d '{ "name": "cpu-forecast", "metric": "avg(rate(node_cpu_seconds_total{mode="user"}[5m]))", "datasourceId": 1, "interval": 300, "trainingWindow": "90d", "forecastWindow": "7d", "algorithm": { "name": "prophet", "config": {} } }'

为告警规则生成的指标对:
```promql

Predicted value

Predicted value

ml_forecast{job="cpu-forecast"}
ml_forecast{job="cpu-forecast"}

Confidence bounds

Confidence bounds

ml_forecast_lower{job="cpu-forecast"} ml_forecast_upper{job="cpu-forecast"}
ml_forecast_lower{job="cpu-forecast"} ml_forecast_upper{job="cpu-forecast"}

Alert: actual > upper bound (anomaly above forecast)

Alert: actual > upper bound (anomaly above forecast)

avg(rate(node_cpu_seconds_total{mode="user"}[5m]))
ml_forecast_upper{job="cpu-forecast"} * 1.1
undefined
avg(rate(node_cpu_seconds_total{mode="user"}[5m]))
ml_forecast_upper{job="cpu-forecast"} * 1.1
undefined

Outlier Detection

异常值检测

Detects when one series in a group deviates from its peers.
bash
curl -X POST https://yourstack.grafana.net/api/plugins/grafana-ml-app/resources/ml/v1/outlier \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "service-error-outliers",
    "metric": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)",
    "datasourceId": 1,
    "interval": 300,
    "algorithm": {
      "name": "dbscan",
      "sensitivity": 0.5,
      "config": { "epsilon": 0.5 }
    }
  }'
promql
undefined
检测分组中的某一序列是否偏离其他序列。
bash
curl -X POST https://yourstack.grafana.net/api/plugins/grafana-ml-app/resources/ml/v1/outlier \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "service-error-outliers",
    "metric": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)",
    "datasourceId": 1,
    "interval": 300,
    "algorithm": {
      "name": "dbscan",
      "sensitivity": 0.5,
      "config": { "epsilon": 0.5 }
    }
  }'
promql
undefined

Score > 0: series is an outlier (use in alert rule)

Score > 0: series is an outlier (use in alert rule)

ml_outlier_score{job="service-error-outliers", service="checkout"}
undefined
ml_outlier_score{job="service-error-outliers", service="checkout"}
undefined

Alert Rules using ML

使用ML的告警规则

yaml
groups:
  - name: ml-alerts
    rules:
      - alert: CPUAboveForecast
        expr: |
          avg(rate(node_cpu_seconds_total{mode="user"}[5m]))
          > ml_forecast_upper{job="cpu-forecast"} * 1.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage significantly above forecast"

      - alert: ServiceErrorRateAnomaly
        expr: ml_outlier_score{job="service-error-outliers"} > 0.8
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Anomalous error rate on {{ $labels.service }}"
yaml
groups:
  - name: ml-alerts
    rules:
      - alert: CPUAboveForecast
        expr: |
          avg(rate(node_cpu_seconds_total{mode="user"}[5m]))
          > ml_forecast_upper{job="cpu-forecast"} * 1.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage significantly above forecast"

      - alert: ServiceErrorRateAnomaly
        expr: ml_outlier_score{job="service-error-outliers"} > 0.8
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Anomalous error rate on {{ $labels.service }}"

Sift (Automated Root Cause Analysis)

Sift(自动化根因分析)

Free for all Grafana Cloud accounts. Automatically investigates incidents by correlating signals.
8 Analysis Types:
AnalysisWhat it checks
Error Pattern LogsClusters log errors by pattern, ranks by frequency/recency
HTTP Error SeriesFinds HTTP 4xx/5xx spikes correlated with incident window
Kube CrashesOOMKills, pod restarts, evictions in K8s
Log QueryCustom LogQL query results correlated to incident time
Metric QueryCustom PromQL anomalies around incident window
Noisy NeighborsDetects resource contention from co-located services
Recent DeploymentsCorrelates recent Helm/K8s deployments with incident start
Resource ContentionCPU throttling, memory pressure, disk I/O saturation
Trigger Sift from:
  • Explore → "Run Sift Investigation"
  • Dashboard panel → "Investigate with Sift"
  • Grafana Incident → "Run Sift" button
  • Command palette (
    Cmd+K
    ) → "Start Sift investigation"
  • OnCall escalation chains → automatic trigger
bash
undefined
所有Grafana Cloud账户均可免费使用。通过关联信号自动调查事件。
8种分析类型:
分析类型检查内容
错误模式日志按模式聚类日志错误,按频率/时效性排序
HTTP错误序列查找与事件窗口相关的HTTP 4xx/5xx峰值
Kube崩溃事件K8s中的OOMKills、Pod重启、驱逐事件
日志查询与事件时间相关的自定义LogQL查询结果
指标查询事件窗口前后的自定义PromQL异常
噪声邻居检测检测来自同位置服务的资源竞争
最近部署记录将近期Helm/K8s部署与事件开始时间关联
资源竞争检测CPU节流、内存压力、磁盘I/O饱和
触发Sift的方式:
  • Explore → "运行Sift调查"
  • 仪表盘面板 → "使用Sift调查"
  • Grafana事件 → "运行Sift"按钮
  • 命令面板(
    Cmd+K
    )→ "启动Sift调查"
  • OnCall升级链 → 自动触发
bash
undefined

Trigger via API

Trigger via API

curl -X POST https://yourstack.grafana.net/api/plugins/grafana-sift-app/resources/sift/v1/investigations
-H "Authorization: Bearer <token>"
-H "Content-Type: application/json"
-d '{ "name": "checkout-latency-spike", "start": "2024-02-01T10:00:00Z", "end": "2024-02-01T10:30:00Z", "filters": { "service": "checkout", "namespace": "production" } }'
undefined
curl -X POST https://yourstack.grafana.net/api/plugins/grafana-sift-app/resources/sift/v1/investigations
-H "Authorization: Bearer <token>"
-H "Content-Type: application/json"
-d '{ "name": "checkout-latency-spike", "start": "2024-02-01T10:00:00Z", "end": "2024-02-01T10:30:00Z", "filters": { "service": "checkout", "namespace": "production" } }'
undefined

Knowledge Graph

Knowledge Graph

Auto-discovers services, pods, nodes, and namespaces from metric labels and trace data. Updates every minute.
Access: Observability → Entity graph
Search syntax:
Show Service api-server
Show all services in namespace production
Show Pod frontend-abc123
RCA Workbench: Structured troubleshooting interface built on the knowledge graph — traces relationships between entities to identify blast radius and upstream causes.
从指标标签和链路追踪数据中自动发现服务、Pod、节点和命名空间。每分钟更新一次。
访问方式: 可观测性 → 实体图谱
搜索语法:
Show Service api-server
Show all services in namespace production
Show Pod frontend-abc123
RCA工作台: 基于知识图谱构建的结构化故障排查界面——追踪实体间的关系,识别影响范围和上游原因。

LLM Plugin

LLM插件

Acts as an authenticated proxy for LLM provider API calls from Grafana panels and plugins.
Supported providers: OpenAI, Anthropic (Claude), Azure OpenAI, vLLM, Ollama, LiteLLM
Powered features: Flame graph interpretation, incident auto-summary, panel title generation, Sift log explanations, natural language panel descriptions.
Enable: Administration → Plugins → LLM Plugin → "Enable OpenAI/LLM access via Grafana"
yaml
undefined
作为Grafana面板和插件调用LLM提供商API的认证代理。
支持的提供商: OpenAI、Anthropic(Claude)、Azure OpenAI、vLLM、Ollama、LiteLLM
支持的功能: 火焰图解读、事件自动总结、面板标题生成、Sift日志解释、自然语言面板描述。
启用方式: 管理 → 插件 → LLM插件 → "通过Grafana启用OpenAI/LLM访问"
yaml
undefined

provisioning/plugins/llm.yaml

provisioning/plugins/llm.yaml

apiVersion: 1 apps:
  • type: grafana-llm-app jsonData:

    OpenAI

    openAIUrl: https://api.openai.com openAIModel: gpt-4o

    Or Anthropic:

    provider: anthropic

    anthropicModel: claude-sonnet-4-6

    Or Azure OpenAI:

    openAIUrl: https://your-resource.openai.azure.com

    azureModelMapping: '[["gpt-4o","your-deployment-name"]]'

    secureJsonData: openAIKey: sk-your-openai-key
undefined
apiVersion: 1 apps:
  • type: grafana-llm-app jsonData:

    OpenAI

    openAIUrl: https://api.openai.com openAIModel: gpt-4o

    Or Anthropic:

    provider: anthropic

    anthropicModel: claude-sonnet-4-6

    Or Azure OpenAI:

    openAIUrl: https://your-resource.openai.azure.com

    azureModelMapping: '[["gpt-4o","your-deployment-name"]]'

    secureJsonData: openAIKey: sk-your-openai-key
undefined

Adaptive Metrics

Adaptive Metrics

Identifies unused metrics to reduce cardinality and storage costs.
bash
undefined
识别未使用的指标,以降低基数和存储成本。
bash
undefined

Get aggregation recommendations

Get aggregation recommendations


Aggregation rule (drops high-cardinality labels):
```yaml
- match: "^http_request_duration_seconds.*"
  action: keep
  match_labels: [method, status, service]
  # Drops: pod, container, instance

聚合规则(丢弃高基数标签):
```yaml
- match: "^http_request_duration_seconds.*"
  action: keep
  match_labels: [method, status, service]
  # Drops: pod, container, instance