ml-ai
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGrafana Cloud AI & ML
Grafana Cloud AI & ML
Grafana Assistant
Grafana Assistant
Context-aware LLM sidebar agent (GA). Integrates with your Grafana Cloud stack.
Capabilities:
- Convert natural language to PromQL/LogQL/TraceQL
- Explain existing queries in plain English
- Build and edit dashboards from descriptions
- Investigate incidents (correlate metrics, logs, traces)
- MCP server integration — connect external tools to Assistant
- RBAC controls per organization
- Slack integration for on-call workflows
Assistant Investigations (public preview): Multi-agent autonomous incident analysis mode — launches multiple specialized agents in parallel to investigate different signals.
Enable: Grafana Cloud → Administration → AI & LLM → Enable Grafana Assistant
In panel editor: Click the magic wand / "Assistant" icon to get query suggestions and explanations.
上下文感知的LLM侧边栏代理(GA),可与您的Grafana Cloud栈集成。
功能特性:
- 将自然语言转换为PromQL/LogQL/TraceQL
- 用通俗易懂的英文解释现有查询语句
- 根据描述创建和编辑仪表盘
- 调查事件(关联指标、日志、链路追踪)
- MCP服务器集成——将外部工具连接至Assistant
- 按组织设置RBAC权限控制
- Slack集成用于随叫随到的工作流
Assistant Investigations(公开预览版):多代理自主事件分析模式——并行启动多个专业代理,调查不同信号。
启用方式: Grafana Cloud → 管理 → AI & LLM → 启用Grafana Assistant
在面板编辑器中: 点击魔法棒/“Assistant”图标获取查询建议和解释。
Dynamic Alerting
Dynamic Alerting
ML-based alerting without static thresholds.
基于ML的告警系统,无需静态阈值。
Forecasting (Prophet model)
预测功能(Prophet模型)
Trained on 90 days of history; learns daily and weekly seasonality patterns.
bash
undefined基于90天历史数据训练;学习每日和每周的季节性模式。
bash
undefinedCreate forecast job
Create forecast job
curl -X POST https://yourstack.grafana.net/api/plugins/grafana-ml-app/resources/ml/v1/forecast
-H "Authorization: Bearer <token>"
-H "Content-Type: application/json"
-d '{ "name": "cpu-forecast", "metric": "avg(rate(node_cpu_seconds_total{mode="user"}[5m]))", "datasourceId": 1, "interval": 300, "trainingWindow": "90d", "forecastWindow": "7d", "algorithm": { "name": "prophet", "config": {} } }'
-H "Authorization: Bearer <token>"
-H "Content-Type: application/json"
-d '{ "name": "cpu-forecast", "metric": "avg(rate(node_cpu_seconds_total{mode="user"}[5m]))", "datasourceId": 1, "interval": 300, "trainingWindow": "90d", "forecastWindow": "7d", "algorithm": { "name": "prophet", "config": {} } }'
Generated metric pairs for alert rules:
```promqlcurl -X POST https://yourstack.grafana.net/api/plugins/grafana-ml-app/resources/ml/v1/forecast
-H "Authorization: Bearer <token>"
-H "Content-Type: application/json"
-d '{ "name": "cpu-forecast", "metric": "avg(rate(node_cpu_seconds_total{mode="user"}[5m]))", "datasourceId": 1, "interval": 300, "trainingWindow": "90d", "forecastWindow": "7d", "algorithm": { "name": "prophet", "config": {} } }'
-H "Authorization: Bearer <token>"
-H "Content-Type: application/json"
-d '{ "name": "cpu-forecast", "metric": "avg(rate(node_cpu_seconds_total{mode="user"}[5m]))", "datasourceId": 1, "interval": 300, "trainingWindow": "90d", "forecastWindow": "7d", "algorithm": { "name": "prophet", "config": {} } }'
为告警规则生成的指标对:
```promqlPredicted value
Predicted value
ml_forecast{job="cpu-forecast"}
ml_forecast{job="cpu-forecast"}
Confidence bounds
Confidence bounds
ml_forecast_lower{job="cpu-forecast"}
ml_forecast_upper{job="cpu-forecast"}
ml_forecast_lower{job="cpu-forecast"}
ml_forecast_upper{job="cpu-forecast"}
Alert: actual > upper bound (anomaly above forecast)
Alert: actual > upper bound (anomaly above forecast)
avg(rate(node_cpu_seconds_total{mode="user"}[5m]))
ml_forecast_upper{job="cpu-forecast"} * 1.1
undefinedavg(rate(node_cpu_seconds_total{mode="user"}[5m]))
ml_forecast_upper{job="cpu-forecast"} * 1.1
undefinedOutlier Detection
异常值检测
Detects when one series in a group deviates from its peers.
bash
curl -X POST https://yourstack.grafana.net/api/plugins/grafana-ml-app/resources/ml/v1/outlier \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"name": "service-error-outliers",
"metric": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)",
"datasourceId": 1,
"interval": 300,
"algorithm": {
"name": "dbscan",
"sensitivity": 0.5,
"config": { "epsilon": 0.5 }
}
}'promql
undefined检测分组中的某一序列是否偏离其他序列。
bash
curl -X POST https://yourstack.grafana.net/api/plugins/grafana-ml-app/resources/ml/v1/outlier \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"name": "service-error-outliers",
"metric": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)",
"datasourceId": 1,
"interval": 300,
"algorithm": {
"name": "dbscan",
"sensitivity": 0.5,
"config": { "epsilon": 0.5 }
}
}'promql
undefinedScore > 0: series is an outlier (use in alert rule)
Score > 0: series is an outlier (use in alert rule)
ml_outlier_score{job="service-error-outliers", service="checkout"}
undefinedml_outlier_score{job="service-error-outliers", service="checkout"}
undefinedAlert Rules using ML
使用ML的告警规则
yaml
groups:
- name: ml-alerts
rules:
- alert: CPUAboveForecast
expr: |
avg(rate(node_cpu_seconds_total{mode="user"}[5m]))
> ml_forecast_upper{job="cpu-forecast"} * 1.1
for: 5m
labels:
severity: warning
annotations:
summary: "CPU usage significantly above forecast"
- alert: ServiceErrorRateAnomaly
expr: ml_outlier_score{job="service-error-outliers"} > 0.8
for: 5m
labels:
severity: critical
annotations:
summary: "Anomalous error rate on {{ $labels.service }}"yaml
groups:
- name: ml-alerts
rules:
- alert: CPUAboveForecast
expr: |
avg(rate(node_cpu_seconds_total{mode="user"}[5m]))
> ml_forecast_upper{job="cpu-forecast"} * 1.1
for: 5m
labels:
severity: warning
annotations:
summary: "CPU usage significantly above forecast"
- alert: ServiceErrorRateAnomaly
expr: ml_outlier_score{job="service-error-outliers"} > 0.8
for: 5m
labels:
severity: critical
annotations:
summary: "Anomalous error rate on {{ $labels.service }}"Sift (Automated Root Cause Analysis)
Sift(自动化根因分析)
Free for all Grafana Cloud accounts. Automatically investigates incidents by correlating signals.
8 Analysis Types:
| Analysis | What it checks |
|---|---|
| Error Pattern Logs | Clusters log errors by pattern, ranks by frequency/recency |
| HTTP Error Series | Finds HTTP 4xx/5xx spikes correlated with incident window |
| Kube Crashes | OOMKills, pod restarts, evictions in K8s |
| Log Query | Custom LogQL query results correlated to incident time |
| Metric Query | Custom PromQL anomalies around incident window |
| Noisy Neighbors | Detects resource contention from co-located services |
| Recent Deployments | Correlates recent Helm/K8s deployments with incident start |
| Resource Contention | CPU throttling, memory pressure, disk I/O saturation |
Trigger Sift from:
- Explore → "Run Sift Investigation"
- Dashboard panel → "Investigate with Sift"
- Grafana Incident → "Run Sift" button
- Command palette () → "Start Sift investigation"
Cmd+K - OnCall escalation chains → automatic trigger
bash
undefined所有Grafana Cloud账户均可免费使用。通过关联信号自动调查事件。
8种分析类型:
| 分析类型 | 检查内容 |
|---|---|
| 错误模式日志 | 按模式聚类日志错误,按频率/时效性排序 |
| HTTP错误序列 | 查找与事件窗口相关的HTTP 4xx/5xx峰值 |
| Kube崩溃事件 | K8s中的OOMKills、Pod重启、驱逐事件 |
| 日志查询 | 与事件时间相关的自定义LogQL查询结果 |
| 指标查询 | 事件窗口前后的自定义PromQL异常 |
| 噪声邻居检测 | 检测来自同位置服务的资源竞争 |
| 最近部署记录 | 将近期Helm/K8s部署与事件开始时间关联 |
| 资源竞争检测 | CPU节流、内存压力、磁盘I/O饱和 |
触发Sift的方式:
- Explore → "运行Sift调查"
- 仪表盘面板 → "使用Sift调查"
- Grafana事件 → "运行Sift"按钮
- 命令面板()→ "启动Sift调查"
Cmd+K - OnCall升级链 → 自动触发
bash
undefinedTrigger via API
Trigger via API
curl -X POST https://yourstack.grafana.net/api/plugins/grafana-sift-app/resources/sift/v1/investigations
-H "Authorization: Bearer <token>"
-H "Content-Type: application/json"
-d '{ "name": "checkout-latency-spike", "start": "2024-02-01T10:00:00Z", "end": "2024-02-01T10:30:00Z", "filters": { "service": "checkout", "namespace": "production" } }'
-H "Authorization: Bearer <token>"
-H "Content-Type: application/json"
-d '{ "name": "checkout-latency-spike", "start": "2024-02-01T10:00:00Z", "end": "2024-02-01T10:30:00Z", "filters": { "service": "checkout", "namespace": "production" } }'
undefinedcurl -X POST https://yourstack.grafana.net/api/plugins/grafana-sift-app/resources/sift/v1/investigations
-H "Authorization: Bearer <token>"
-H "Content-Type: application/json"
-d '{ "name": "checkout-latency-spike", "start": "2024-02-01T10:00:00Z", "end": "2024-02-01T10:30:00Z", "filters": { "service": "checkout", "namespace": "production" } }'
-H "Authorization: Bearer <token>"
-H "Content-Type: application/json"
-d '{ "name": "checkout-latency-spike", "start": "2024-02-01T10:00:00Z", "end": "2024-02-01T10:30:00Z", "filters": { "service": "checkout", "namespace": "production" } }'
undefinedKnowledge Graph
Knowledge Graph
Auto-discovers services, pods, nodes, and namespaces from metric labels and trace data. Updates every minute.
Access: Observability → Entity graph
Search syntax:
Show Service api-server
Show all services in namespace production
Show Pod frontend-abc123RCA Workbench: Structured troubleshooting interface built on the knowledge graph — traces relationships between entities to identify blast radius and upstream causes.
从指标标签和链路追踪数据中自动发现服务、Pod、节点和命名空间。每分钟更新一次。
访问方式: 可观测性 → 实体图谱
搜索语法:
Show Service api-server
Show all services in namespace production
Show Pod frontend-abc123RCA工作台: 基于知识图谱构建的结构化故障排查界面——追踪实体间的关系,识别影响范围和上游原因。
LLM Plugin
LLM插件
Acts as an authenticated proxy for LLM provider API calls from Grafana panels and plugins.
Supported providers: OpenAI, Anthropic (Claude), Azure OpenAI, vLLM, Ollama, LiteLLM
Powered features: Flame graph interpretation, incident auto-summary, panel title generation, Sift log explanations, natural language panel descriptions.
Enable: Administration → Plugins → LLM Plugin → "Enable OpenAI/LLM access via Grafana"
yaml
undefined作为Grafana面板和插件调用LLM提供商API的认证代理。
支持的提供商: OpenAI、Anthropic(Claude)、Azure OpenAI、vLLM、Ollama、LiteLLM
支持的功能: 火焰图解读、事件自动总结、面板标题生成、Sift日志解释、自然语言面板描述。
启用方式: 管理 → 插件 → LLM插件 → "通过Grafana启用OpenAI/LLM访问"
yaml
undefinedprovisioning/plugins/llm.yaml
provisioning/plugins/llm.yaml
apiVersion: 1
apps:
- type: grafana-llm-app
jsonData:
OpenAI
openAIUrl: https://api.openai.com openAIModel: gpt-4oOr Anthropic:
provider: anthropic
anthropicModel: claude-sonnet-4-6
Or Azure OpenAI:
openAIUrl: https://your-resource.openai.azure.com
azureModelMapping: '[["gpt-4o","your-deployment-name"]]'
secureJsonData: openAIKey: sk-your-openai-key
undefinedapiVersion: 1
apps:
- type: grafana-llm-app
jsonData:
OpenAI
openAIUrl: https://api.openai.com openAIModel: gpt-4oOr Anthropic:
provider: anthropic
anthropicModel: claude-sonnet-4-6
Or Azure OpenAI:
openAIUrl: https://your-resource.openai.azure.com
azureModelMapping: '[["gpt-4o","your-deployment-name"]]'
secureJsonData: openAIKey: sk-your-openai-key
undefinedAdaptive Metrics
Adaptive Metrics
Identifies unused metrics to reduce cardinality and storage costs.
bash
undefined识别未使用的指标,以降低基数和存储成本。
bash
undefinedGet aggregation recommendations
Get aggregation recommendations
curl https://yourstack.grafana.net/api/plugins/grafana-adaptive-metrics-app/resources/v1/recommendations
-H "Authorization: Bearer <token>"
-H "Authorization: Bearer <token>"
Aggregation rule (drops high-cardinality labels):
```yaml
- match: "^http_request_duration_seconds.*"
action: keep
match_labels: [method, status, service]
# Drops: pod, container, instancecurl https://yourstack.grafana.net/api/plugins/grafana-adaptive-metrics-app/resources/v1/recommendations
-H "Authorization: Bearer <token>"
-H "Authorization: Bearer <token>"
聚合规则(丢弃高基数标签):
```yaml
- match: "^http_request_duration_seconds.*"
action: keep
match_labels: [method, status, service]
# Drops: pod, container, instance