ml-ai

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Grafana Cloud AI & ML

Docs: https://grafana.com/docs/grafana-cloud/alerting-and-irm/machine-learning/

文档：https://grafana.com/docs/grafana-cloud/alerting-and-irm/machine-learning/

Grafana Assistant

Context-aware LLM sidebar agent (GA). Integrates with your Grafana Cloud stack.

Capabilities:

Convert natural language to PromQL/LogQL/TraceQL
Explain existing queries in plain English
Build and edit dashboards from descriptions
Investigate incidents (correlate metrics, logs, traces)
MCP server integration — connect external tools to Assistant
RBAC controls per organization
Slack integration for on-call workflows

Assistant Investigations (public preview): Multi-agent autonomous incident analysis mode — launches multiple specialized agents in parallel to investigate different signals.

Enable: Grafana Cloud → Administration → AI & LLM → Enable Grafana Assistant

In panel editor: Click the magic wand / "Assistant" icon to get query suggestions and explanations.

上下文感知的LLM侧边栏代理（GA），可与您的Grafana Cloud栈集成。

功能特性：

将自然语言转换为PromQL/LogQL/TraceQL
用通俗易懂的英文解释现有查询语句
根据描述创建和编辑仪表盘
调查事件（关联指标、日志、链路追踪）
MCP服务器集成——将外部工具连接至Assistant
按组织设置RBAC权限控制
Slack集成用于随叫随到的工作流

Assistant Investigations（公开预览版）：多代理自主事件分析模式——并行启动多个专业代理，调查不同信号。

启用方式： Grafana Cloud → 管理 → AI & LLM → 启用Grafana Assistant

在面板编辑器中： 点击魔法棒/“Assistant”图标获取查询建议和解释。

Dynamic Alerting

ML-based alerting without static thresholds.

基于ML的告警系统，无需静态阈值。

Forecasting (Prophet model)

预测功能（Prophet模型）

Trained on 90 days of history; learns daily and weekly seasonality patterns.

bash

undefined

基于90天历史数据训练；学习每日和每周的季节性模式。

bash

undefined

Create forecast job

curl -X POST https://yourstack.grafana.net/api/plugins/grafana-ml-app/resources/ml/v1/forecast
-H "Authorization: Bearer <token>"
-H "Content-Type: application/json"
-d '{ "name": "cpu-forecast", "metric": "avg(rate(node_cpu_seconds_total{mode="user"}[5m]))", "datasourceId": 1, "interval": 300, "trainingWindow": "90d", "forecastWindow": "7d", "algorithm": { "name": "prophet", "config": {} } }'


Generated metric pairs for alert rules:
```promql


为告警规则生成的指标对：
```promql

Predicted value

ml_forecast{job="cpu-forecast"}

Confidence bounds

ml_forecast_lower{job="cpu-forecast"} ml_forecast_upper{job="cpu-forecast"}

Alert: actual > upper bound (anomaly above forecast)

avg(rate(node_cpu_seconds_total{mode="user"}[5m]))

ml_forecast_upper{job="cpu-forecast"} * 1.1

undefined

avg(rate(node_cpu_seconds_total{mode="user"}[5m]))

ml_forecast_upper{job="cpu-forecast"} * 1.1

undefined

Outlier Detection

异常值检测

Detects when one series in a group deviates from its peers.

bash

curl -X POST https://yourstack.grafana.net/api/plugins/grafana-ml-app/resources/ml/v1/outlier \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "service-error-outliers",
    "metric": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)",
    "datasourceId": 1,
    "interval": 300,
    "algorithm": {
      "name": "dbscan",
      "sensitivity": 0.5,
      "config": { "epsilon": 0.5 }
    }
  }'

promql

undefined

检测分组中的某一序列是否偏离其他序列。

bash

curl -X POST https://yourstack.grafana.net/api/plugins/grafana-ml-app/resources/ml/v1/outlier \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "service-error-outliers",
    "metric": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)",
    "datasourceId": 1,
    "interval": 300,
    "algorithm": {
      "name": "dbscan",
      "sensitivity": 0.5,
      "config": { "epsilon": 0.5 }
    }
  }'

promql

undefined

Score > 0: series is an outlier (use in alert rule)

ml_outlier_score{job="service-error-outliers", service="checkout"}

undefined

ml_outlier_score{job="service-error-outliers", service="checkout"}

undefined

Alert Rules using ML

使用ML的告警规则

yaml

groups:
  - name: ml-alerts
    rules:
      - alert: CPUAboveForecast
        expr: |
          avg(rate(node_cpu_seconds_total{mode="user"}[5m]))
          > ml_forecast_upper{job="cpu-forecast"} * 1.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage significantly above forecast"

      - alert: ServiceErrorRateAnomaly
        expr: ml_outlier_score{job="service-error-outliers"} > 0.8
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Anomalous error rate on {{ $labels.service }}"

yaml

groups:
  - name: ml-alerts
    rules:
      - alert: CPUAboveForecast
        expr: |
          avg(rate(node_cpu_seconds_total{mode="user"}[5m]))
          > ml_forecast_upper{job="cpu-forecast"} * 1.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage significantly above forecast"

      - alert: ServiceErrorRateAnomaly
        expr: ml_outlier_score{job="service-error-outliers"} > 0.8
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Anomalous error rate on {{ $labels.service }}"

Sift (Automated Root Cause Analysis)

Sift（自动化根因分析）

Free for all Grafana Cloud accounts. Automatically investigates incidents by correlating signals.

8 Analysis Types:

Analysis	What it checks
Error Pattern Logs	Clusters log errors by pattern, ranks by frequency/recency
HTTP Error Series	Finds HTTP 4xx/5xx spikes correlated with incident window
Kube Crashes	OOMKills, pod restarts, evictions in K8s
Log Query	Custom LogQL query results correlated to incident time
Metric Query	Custom PromQL anomalies around incident window
Noisy Neighbors	Detects resource contention from co-located services
Recent Deployments	Correlates recent Helm/K8s deployments with incident start
Resource Contention	CPU throttling, memory pressure, disk I/O saturation

Trigger Sift from:

Explore → "Run Sift Investigation"
Dashboard panel → "Investigate with Sift"
Grafana Incident → "Run Sift" button
Command palette (
```
Cmd+K
```
) → "Start Sift investigation"
OnCall escalation chains → automatic trigger

bash

undefined

所有Grafana Cloud账户均可免费使用。通过关联信号自动调查事件。

8种分析类型：

分析类型	检查内容
错误模式日志	按模式聚类日志错误，按频率/时效性排序
HTTP错误序列	查找与事件窗口相关的HTTP 4xx/5xx峰值
Kube崩溃事件	K8s中的OOMKills、Pod重启、驱逐事件
日志查询	与事件时间相关的自定义LogQL查询结果
指标查询	事件窗口前后的自定义PromQL异常
噪声邻居检测	检测来自同位置服务的资源竞争
最近部署记录	将近期Helm/K8s部署与事件开始时间关联
资源竞争检测	CPU节流、内存压力、磁盘I/O饱和

触发Sift的方式：

Explore → "运行Sift调查"
仪表盘面板 → "使用Sift调查"
Grafana事件 → "运行Sift"按钮
命令面板（
```
Cmd+K
```
）→ "启动Sift调查"
OnCall升级链 → 自动触发

bash

undefined

Trigger via API

curl -X POST https://yourstack.grafana.net/api/plugins/grafana-sift-app/resources/sift/v1/investigations
-H "Authorization: Bearer <token>"
-H "Content-Type: application/json"
-d '{ "name": "checkout-latency-spike", "start": "2024-02-01T10:00:00Z", "end": "2024-02-01T10:30:00Z", "filters": { "service": "checkout", "namespace": "production" } }'

undefined

undefined

Knowledge Graph

Auto-discovers services, pods, nodes, and namespaces from metric labels and trace data. Updates every minute.

Access: Observability → Entity graph

Search syntax:

Show Service api-server
Show all services in namespace production
Show Pod frontend-abc123

RCA Workbench: Structured troubleshooting interface built on the knowledge graph — traces relationships between entities to identify blast radius and upstream causes.

从指标标签和链路追踪数据中自动发现服务、Pod、节点和命名空间。每分钟更新一次。

访问方式： 可观测性 → 实体图谱

搜索语法：

Show Service api-server
Show all services in namespace production
Show Pod frontend-abc123

RCA工作台： 基于知识图谱构建的结构化故障排查界面——追踪实体间的关系，识别影响范围和上游原因。

LLM Plugin

LLM插件

Acts as an authenticated proxy for LLM provider API calls from Grafana panels and plugins.

Supported providers: OpenAI, Anthropic (Claude), Azure OpenAI, vLLM, Ollama, LiteLLM

Powered features: Flame graph interpretation, incident auto-summary, panel title generation, Sift log explanations, natural language panel descriptions.

Enable: Administration → Plugins → LLM Plugin → "Enable OpenAI/LLM access via Grafana"

yaml

undefined

作为Grafana面板和插件调用LLM提供商API的认证代理。

支持的提供商： OpenAI、Anthropic（Claude）、Azure OpenAI、vLLM、Ollama、LiteLLM

支持的功能： 火焰图解读、事件自动总结、面板标题生成、Sift日志解释、自然语言面板描述。

启用方式： 管理 → 插件 → LLM插件 → "通过Grafana启用OpenAI/LLM访问"

yaml

undefined

provisioning/plugins/llm.yaml

apiVersion: 1 apps:

type: grafana-llm-app jsonData:
OpenAI
openAIUrl: https://api.openai.com openAIModel: gpt-4o
Or Anthropic:

provider: anthropic

anthropicModel: claude-sonnet-4-6

Or Azure OpenAI:

openAIUrl: https://your-resource.openai.azure.com

azureModelMapping: '[["gpt-4o","your-deployment-name"]]'
secureJsonData: openAIKey: sk-your-openai-key

undefined

apiVersion: 1 apps:

type: grafana-llm-app jsonData:
OpenAI
openAIUrl: https://api.openai.com openAIModel: gpt-4o
Or Anthropic:

provider: anthropic

anthropicModel: claude-sonnet-4-6

Or Azure OpenAI:

openAIUrl: https://your-resource.openai.azure.com

azureModelMapping: '[["gpt-4o","your-deployment-name"]]'
secureJsonData: openAIKey: sk-your-openai-key

undefined

Adaptive Metrics

Identifies unused metrics to reduce cardinality and storage costs.

bash

undefined

识别未使用的指标，以降低基数和存储成本。

bash

undefined

Get aggregation recommendations

curl https://yourstack.grafana.net/api/plugins/grafana-adaptive-metrics-app/resources/v1/recommendations
-H "Authorization: Bearer <token>"


Aggregation rule (drops high-cardinality labels):
```yaml
- match: "^http_request_duration_seconds.*"
  action: keep
  match_labels: [method, status, service]
  # Drops: pod, container, instance

curl https://yourstack.grafana.net/api/plugins/grafana-adaptive-metrics-app/resources/v1/recommendations
-H "Authorization: Bearer <token>"


聚合规则（丢弃高基数标签）：
```yaml
- match: "^http_request_duration_seconds.*"
  action: keep
  match_labels: [method, status, service]
  # Drops: pod, container, instance

ml-ai

Original

Translation

Grafana Cloud AI & ML

Grafana Cloud AI & ML

Grafana Assistant

Grafana Assistant

Dynamic Alerting

Dynamic Alerting

Forecasting (Prophet model)

预测功能（Prophet模型）

Create forecast job

Create forecast job

Predicted value

Predicted value

Confidence bounds

Confidence bounds

Alert: actual > upper bound (anomaly above forecast)

Alert: actual > upper bound (anomaly above forecast)

Outlier Detection

异常值检测

Score > 0: series is an outlier (use in alert rule)

Score > 0: series is an outlier (use in alert rule)

Alert Rules using ML

使用ML的告警规则

Sift (Automated Root Cause Analysis)

Sift（自动化根因分析）

Trigger via API

Trigger via API

Knowledge Graph

Knowledge Graph

LLM Plugin

LLM插件

provisioning/plugins/llm.yaml

provisioning/plugins/llm.yaml

OpenAI

Or Anthropic:

provider: anthropic

anthropicModel: claude-sonnet-4-6

Or Azure OpenAI:

openAIUrl: https://your-resource.openai.azure.com

azureModelMapping: '[["gpt-4o","your-deployment-name"]]'

OpenAI

Or Anthropic:

provider: anthropic

anthropicModel: claude-sonnet-4-6

Or Azure OpenAI:

openAIUrl: https://your-resource.openai.azure.com

azureModelMapping: '[["gpt-4o","your-deployment-name"]]'

Adaptive Metrics

Adaptive Metrics

Get aggregation recommendations

Get aggregation recommendations