cost-management

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Grafana Cloud Cost Management

Grafana Cloud成本管理

Cost Management & Billing Application

成本管理与账单应用

Access: My Account → Cost Management (or within your Grafana Cloud stack)
FOCUS-compliant (FinOps Open Cost and Usage Specification) billing dashboards showing:
  • Spending by signal type (metrics, logs, traces, profiles)
  • Month-over-month trends
  • Usage vs. quota tracking
  • Invoice download
访问路径:我的账户 → 成本管理(或在你的Grafana Cloud实例内)
符合FOCUS(FinOps开放成本与使用规范)标准的账单仪表盘,展示:
  • 按信号类型(指标、日志、链路追踪、性能剖析)划分的支出情况
  • 月度同比趋势
  • 使用量与配额追踪
  • 发票下载

Cost Attribution by Label

基于标签的成本归属

Tag your telemetry at ingestion to enable per-team cost reporting:
alloy
// Add cost attribution labels in Alloy
prometheus.remote_write "cloud" {
  endpoint {
    url = sys.env("PROMETHEUS_URL")
    basic_auth {
      username = sys.env("PROM_USER")
      password = sys.env("GRAFANA_CLOUD_API_KEY")
    }
  }
  external_labels = {
    team    = "platform",
    project = "checkout-service",
    env     = "production",
  }
}

loki.write "cloud" {
  endpoint {
    url = sys.env("LOKI_URL")
    basic_auth {
      username = sys.env("LOKI_USER")
      password = sys.env("GRAFANA_CLOUD_API_KEY")
    }
  }
  external_labels = {
    team    = "platform",
    project = "checkout-service",
  }
}
在摄入遥测数据时添加标签,以支持按团队进行成本报告:
alloy
// 在Alloy中添加成本归属标签
prometheus.remote_write "cloud" {
  endpoint {
    url = sys.env("PROMETHEUS_URL")
    basic_auth {
      username = sys.env("PROM_USER")
      password = sys.env("GRAFANA_CLOUD_API_KEY")
    }
  }
  external_labels = {
    team    = "platform",
    project = "checkout-service",
    env     = "production",
  }
}

loki.write "cloud" {
  endpoint {
    url = sys.env("LOKI_URL")
    basic_auth {
      username = sys.env("LOKI_USER")
      password = sys.env("GRAFANA_CLOUD_API_KEY")
    }
  }
  external_labels = {
    team    = "platform",
    project = "checkout-service",
  }
}

Usage Alerts

使用量告警

Set alerts before you hit quota or budget thresholds:
yaml
undefined
在达到配额或预算阈值前设置告警:
yaml
undefined

Alert when approaching metrics quota

当指标配额即将耗尽时触发告警

groups:
  • name: grafana-cloud-usage rules:
    • alert: MetricsUsageHigh expr: grafana_cloud_metrics_active_series / grafana_cloud_metrics_limit > 0.8 for: 1h labels: severity: warning annotations: summary: "Grafana Cloud metrics usage >80% of quota"
    • alert: LogsIngestionHigh expr: increase(grafana_cloud_logs_bytes_ingested_total[24h]) > 50e9 # 50GB/day labels: severity: warning annotations: summary: "Grafana Cloud log ingestion >50GB today"
undefined
groups:
  • name: grafana-cloud-usage rules:
    • alert: MetricsUsageHigh expr: grafana_cloud_metrics_active_series / grafana_cloud_metrics_limit > 0.8 for: 1h labels: severity: warning annotations: summary: "Grafana Cloud指标使用量已超过配额的80%"
    • alert: LogsIngestionHigh expr: increase(grafana_cloud_logs_bytes_ingested_total[24h]) > 50e9 # 每日50GB labels: severity: warning annotations: summary: "今日Grafana Cloud日志摄入已超过50GB"
undefined

Adaptive Metrics (Reduce Cardinality)

自适应指标(缩减基数)

Automatically identifies unused or high-cardinality metrics and generates aggregation rules.
bash
undefined
自动识别未使用或高基数指标,并生成聚合规则。
bash
undefined

View recommendations

查看推荐规则

Apply aggregation rule — drops high-cardinality labels from a metric

应用聚合规则——从指标中移除高基数标签

  • match: "^http_request_duration_seconds.*" action: keep match_labels:
    • method
    • status_code
    • service

    Drops: pod, container, instance, node — reduces series from 10k → 50


**Workflow:**
1. Go to **Grafana Cloud → Adaptive Metrics**
2. Review recommended aggregation rules (sorted by series reduction impact)
3. Test rules in "Preview" mode before applying
4. Apply rules — takes effect within 5 minutes
  • match: "^http_request_duration_seconds.*" action: keep match_labels:
    • method
    • status_code
    • service

    移除:pod、container、instance、node——将序列数从10k减少到50


**工作流程:**
1. 进入 **Grafana Cloud → 自适应指标**
2. 查看推荐的聚合规则(按序列缩减影响排序)
3. 在“预览”模式下测试规则后再应用
4. 应用规则——5分钟内生效

Adaptive Logs (Reduce Log Volume)

自适应日志(缩减日志量)

Drop or sample log lines before ingestion using Loki's pipeline stages in Alloy:
alloy
loki.process "filter_logs" {
  forward_to = [loki.write.cloud.receiver]

  // Drop health check logs (high volume, low value)
  stage.drop {
    expression = ".*GET /health.*"
  }

  // Drop debug logs in production
  stage.drop {
    source     = "level"
    expression = "debug"
  }

  // Sample verbose info logs (keep 10%)
  stage.sampling {
    rate = 0.1
    source = "level"
    value  = "info"
  }
}
在摄入前使用Alloy中的Loki流水线阶段丢弃或采样日志行:
alloy
loki.process "filter_logs" {
  forward_to = [loki.write.cloud.receiver]

  // 丢弃健康检查日志(数量大、价值低)
  stage.drop {
    expression = ".*GET /health.*"
  }

  // 在生产环境中丢弃调试日志
  stage.drop {
    source     = "level"
    expression = "debug"
  }

  // 对 verbose 的 info 日志进行采样(保留10%)
  stage.sampling {
    rate = 0.1
    source = "level"
    value  = "info"
  }
}

Adaptive Traces (Reduce Trace Volume)

自适应链路追踪(缩减追踪量)

Use Alloy tail-based sampling to keep only important traces:
alloy
otelcol.processor.tail_sampling "cost_control" {
  decision_wait = "10s"
  policy {
    name = "keep-errors"
    type = "status_code"
    status_code { status_codes = ["ERROR"] }
  }
  policy {
    name = "keep-slow"
    type = "latency"
    latency { threshold_ms = 1000 }
  }
  policy {
    name = "sample-rest"
    type = "probabilistic"
    probabilistic { sampling_percentage = 5 }
  }
  output {
    traces = [otelcol.exporter.otlp.cloud.input]
  }
}
使用Alloy的尾部采样仅保留重要的链路追踪数据:
alloy
otelcol.processor.tail_sampling "cost_control" {
  decision_wait = "10s"
  policy {
    name = "keep-errors"
    type = "status_code"
    status_code { status_codes = ["ERROR"] }
  }
  policy {
    name = "keep-slow"
    type = "latency"
    latency { threshold_ms = 1000 }
  }
  policy {
    name = "sample-rest"
    type = "probabilistic"
    probabilistic { sampling_percentage = 5 }
  }
  output {
    traces = [otelcol.exporter.otlp.cloud.input]
  }
}

Key Metrics for Cost Monitoring

成本监控关键指标

promql
undefined
promql
undefined

Active metric series (billed unit for metrics)

活跃指标序列(指标的计费单位)

grafana_cloud_metrics_active_series
grafana_cloud_metrics_active_series

Series by label (find high-cardinality sources)

按标签统计的序列数(查找高基数来源)

topk(20, count by (name) ({name=~".+"}))
topk(20, count by (name) ({name=~".+"}))

Log bytes ingested per stream

每个流的日志摄入字节数

sum(increase(loki_ingester_chunk_size_bytes_sum[24h])) by (namespace, app)
sum(increase(loki_ingester_chunk_size_bytes_sum[24h])) by (namespace, app)

Trace spans ingested

摄入的链路追踪跨度

rate(tempo_distributor_spans_received_total[5m])
undefined
rate(tempo_distributor_spans_received_total[5m])
undefined

Optimization Checklist

优化检查清单

  • Run Adaptive Metrics recommendations — typically reduces series 40-60%
  • Drop health/readiness probe logs in Alloy pipeline
  • Set sampling rate for traces (5-10% is typical for most workloads)
  • Review top-N high-cardinality metrics:
    topk(20, count by (__name__))
  • Add cost attribution labels (
    team
    ,
    project
    ) to all Alloy configs
  • Set usage alerts at 80% of quota
  • Review and clean up unused dashboards and data sources (they don't reduce cost but indicate stale collection)
  • Use recording rules to pre-aggregate expensive PromQL queries
  • 运行自适应指标推荐规则——通常可减少40-60%的序列数
  • 在Alloy流水线中丢弃健康/就绪探测日志
  • 设置链路追踪采样率(大多数工作负载的典型值为5-10%)
  • 查看Top-N高基数指标:
    topk(20, count by (__name__))
  • 为所有Alloy配置添加成本归属标签(
    team
    project
  • 在配额达到80%时设置使用量告警
  • 检查并清理未使用的仪表盘和数据源(虽不会降低成本,但表明存在过时的采集任务)
  • 使用记录规则预聚合开销大的PromQL查询

Understanding Grafana Cloud Pricing

Grafana Cloud定价说明

SignalBilling Unit
MetricsActive series (unique label combinations)
LogsBytes ingested
TracesSpans ingested
ProfilesBytes ingested
Synthetic MonitoringCheck executions
k6VUh (Virtual User hours)
信号类型计费单位
指标活跃序列(唯一标签组合)
日志摄入字节数
链路追踪摄入的跨度数量
性能剖析摄入字节数
合成监控检查执行次数
k6VUh(虚拟用户小时数)