cost-management
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGrafana Cloud Cost Management
Grafana Cloud成本管理
Cost Management & Billing Application
成本管理与账单应用
Access: My Account → Cost Management (or within your Grafana Cloud stack)
FOCUS-compliant (FinOps Open Cost and Usage Specification) billing dashboards showing:
- Spending by signal type (metrics, logs, traces, profiles)
- Month-over-month trends
- Usage vs. quota tracking
- Invoice download
访问路径:我的账户 → 成本管理(或在你的Grafana Cloud实例内)
符合FOCUS(FinOps开放成本与使用规范)标准的账单仪表盘,展示:
- 按信号类型(指标、日志、链路追踪、性能剖析)划分的支出情况
- 月度同比趋势
- 使用量与配额追踪
- 发票下载
Cost Attribution by Label
基于标签的成本归属
Tag your telemetry at ingestion to enable per-team cost reporting:
alloy
// Add cost attribution labels in Alloy
prometheus.remote_write "cloud" {
endpoint {
url = sys.env("PROMETHEUS_URL")
basic_auth {
username = sys.env("PROM_USER")
password = sys.env("GRAFANA_CLOUD_API_KEY")
}
}
external_labels = {
team = "platform",
project = "checkout-service",
env = "production",
}
}
loki.write "cloud" {
endpoint {
url = sys.env("LOKI_URL")
basic_auth {
username = sys.env("LOKI_USER")
password = sys.env("GRAFANA_CLOUD_API_KEY")
}
}
external_labels = {
team = "platform",
project = "checkout-service",
}
}在摄入遥测数据时添加标签,以支持按团队进行成本报告:
alloy
// 在Alloy中添加成本归属标签
prometheus.remote_write "cloud" {
endpoint {
url = sys.env("PROMETHEUS_URL")
basic_auth {
username = sys.env("PROM_USER")
password = sys.env("GRAFANA_CLOUD_API_KEY")
}
}
external_labels = {
team = "platform",
project = "checkout-service",
env = "production",
}
}
loki.write "cloud" {
endpoint {
url = sys.env("LOKI_URL")
basic_auth {
username = sys.env("LOKI_USER")
password = sys.env("GRAFANA_CLOUD_API_KEY")
}
}
external_labels = {
team = "platform",
project = "checkout-service",
}
}Usage Alerts
使用量告警
Set alerts before you hit quota or budget thresholds:
yaml
undefined在达到配额或预算阈值前设置告警:
yaml
undefinedAlert when approaching metrics quota
当指标配额即将耗尽时触发告警
groups:
- name: grafana-cloud-usage
rules:
-
alert: MetricsUsageHigh expr: grafana_cloud_metrics_active_series / grafana_cloud_metrics_limit > 0.8 for: 1h labels: severity: warning annotations: summary: "Grafana Cloud metrics usage >80% of quota"
-
alert: LogsIngestionHigh expr: increase(grafana_cloud_logs_bytes_ingested_total[24h]) > 50e9 # 50GB/day labels: severity: warning annotations: summary: "Grafana Cloud log ingestion >50GB today"
-
undefinedgroups:
- name: grafana-cloud-usage
rules:
-
alert: MetricsUsageHigh expr: grafana_cloud_metrics_active_series / grafana_cloud_metrics_limit > 0.8 for: 1h labels: severity: warning annotations: summary: "Grafana Cloud指标使用量已超过配额的80%"
-
alert: LogsIngestionHigh expr: increase(grafana_cloud_logs_bytes_ingested_total[24h]) > 50e9 # 每日50GB labels: severity: warning annotations: summary: "今日Grafana Cloud日志摄入已超过50GB"
-
undefinedAdaptive Metrics (Reduce Cardinality)
自适应指标(缩减基数)
Automatically identifies unused or high-cardinality metrics and generates aggregation rules.
bash
undefined自动识别未使用或高基数指标,并生成聚合规则。
bash
undefinedView recommendations
查看推荐规则
curl https://yourstack.grafana.net/api/plugins/grafana-adaptive-metrics-app/resources/v1/recommendations
-H "Authorization: Bearer <token>"
-H "Authorization: Bearer <token>"
```yamlcurl https://yourstack.grafana.net/api/plugins/grafana-adaptive-metrics-app/resources/v1/recommendations
-H "Authorization: Bearer <token>"
-H "Authorization: Bearer <token>"
```yamlApply aggregation rule — drops high-cardinality labels from a metric
应用聚合规则——从指标中移除高基数标签
- match: "^http_request_duration_seconds.*"
action: keep
match_labels:
- method
- status_code
- service
Drops: pod, container, instance, node — reduces series from 10k → 50
**Workflow:**
1. Go to **Grafana Cloud → Adaptive Metrics**
2. Review recommended aggregation rules (sorted by series reduction impact)
3. Test rules in "Preview" mode before applying
4. Apply rules — takes effect within 5 minutes- match: "^http_request_duration_seconds.*"
action: keep
match_labels:
- method
- status_code
- service
移除:pod、container、instance、node——将序列数从10k减少到50
**工作流程:**
1. 进入 **Grafana Cloud → 自适应指标**
2. 查看推荐的聚合规则(按序列缩减影响排序)
3. 在“预览”模式下测试规则后再应用
4. 应用规则——5分钟内生效Adaptive Logs (Reduce Log Volume)
自适应日志(缩减日志量)
Drop or sample log lines before ingestion using Loki's pipeline stages in Alloy:
alloy
loki.process "filter_logs" {
forward_to = [loki.write.cloud.receiver]
// Drop health check logs (high volume, low value)
stage.drop {
expression = ".*GET /health.*"
}
// Drop debug logs in production
stage.drop {
source = "level"
expression = "debug"
}
// Sample verbose info logs (keep 10%)
stage.sampling {
rate = 0.1
source = "level"
value = "info"
}
}在摄入前使用Alloy中的Loki流水线阶段丢弃或采样日志行:
alloy
loki.process "filter_logs" {
forward_to = [loki.write.cloud.receiver]
// 丢弃健康检查日志(数量大、价值低)
stage.drop {
expression = ".*GET /health.*"
}
// 在生产环境中丢弃调试日志
stage.drop {
source = "level"
expression = "debug"
}
// 对 verbose 的 info 日志进行采样(保留10%)
stage.sampling {
rate = 0.1
source = "level"
value = "info"
}
}Adaptive Traces (Reduce Trace Volume)
自适应链路追踪(缩减追踪量)
Use Alloy tail-based sampling to keep only important traces:
alloy
otelcol.processor.tail_sampling "cost_control" {
decision_wait = "10s"
policy {
name = "keep-errors"
type = "status_code"
status_code { status_codes = ["ERROR"] }
}
policy {
name = "keep-slow"
type = "latency"
latency { threshold_ms = 1000 }
}
policy {
name = "sample-rest"
type = "probabilistic"
probabilistic { sampling_percentage = 5 }
}
output {
traces = [otelcol.exporter.otlp.cloud.input]
}
}使用Alloy的尾部采样仅保留重要的链路追踪数据:
alloy
otelcol.processor.tail_sampling "cost_control" {
decision_wait = "10s"
policy {
name = "keep-errors"
type = "status_code"
status_code { status_codes = ["ERROR"] }
}
policy {
name = "keep-slow"
type = "latency"
latency { threshold_ms = 1000 }
}
policy {
name = "sample-rest"
type = "probabilistic"
probabilistic { sampling_percentage = 5 }
}
output {
traces = [otelcol.exporter.otlp.cloud.input]
}
}Key Metrics for Cost Monitoring
成本监控关键指标
promql
undefinedpromql
undefinedActive metric series (billed unit for metrics)
活跃指标序列(指标的计费单位)
grafana_cloud_metrics_active_series
grafana_cloud_metrics_active_series
Series by label (find high-cardinality sources)
按标签统计的序列数(查找高基数来源)
topk(20, count by (name) ({name=~".+"}))
topk(20, count by (name) ({name=~".+"}))
Log bytes ingested per stream
每个流的日志摄入字节数
sum(increase(loki_ingester_chunk_size_bytes_sum[24h])) by (namespace, app)
sum(increase(loki_ingester_chunk_size_bytes_sum[24h])) by (namespace, app)
Trace spans ingested
摄入的链路追踪跨度
rate(tempo_distributor_spans_received_total[5m])
undefinedrate(tempo_distributor_spans_received_total[5m])
undefinedOptimization Checklist
优化检查清单
- Run Adaptive Metrics recommendations — typically reduces series 40-60%
- Drop health/readiness probe logs in Alloy pipeline
- Set sampling rate for traces (5-10% is typical for most workloads)
- Review top-N high-cardinality metrics:
topk(20, count by (__name__)) - Add cost attribution labels (,
team) to all Alloy configsproject - Set usage alerts at 80% of quota
- Review and clean up unused dashboards and data sources (they don't reduce cost but indicate stale collection)
- Use recording rules to pre-aggregate expensive PromQL queries
- 运行自适应指标推荐规则——通常可减少40-60%的序列数
- 在Alloy流水线中丢弃健康/就绪探测日志
- 设置链路追踪采样率(大多数工作负载的典型值为5-10%)
- 查看Top-N高基数指标:
topk(20, count by (__name__)) - 为所有Alloy配置添加成本归属标签(、
team)project - 在配额达到80%时设置使用量告警
- 检查并清理未使用的仪表盘和数据源(虽不会降低成本,但表明存在过时的采集任务)
- 使用记录规则预聚合开销大的PromQL查询
Understanding Grafana Cloud Pricing
Grafana Cloud定价说明
| Signal | Billing Unit |
|---|---|
| Metrics | Active series (unique label combinations) |
| Logs | Bytes ingested |
| Traces | Spans ingested |
| Profiles | Bytes ingested |
| Synthetic Monitoring | Check executions |
| k6 | VUh (Virtual User hours) |
| 信号类型 | 计费单位 |
|---|---|
| 指标 | 活跃序列(唯一标签组合) |
| 日志 | 摄入字节数 |
| 链路追踪 | 摄入的跨度数量 |
| 性能剖析 | 摄入字节数 |
| 合成监控 | 检查执行次数 |
| k6 | VUh(虚拟用户小时数) |