dpm-finder

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

dpm-finder

dpm-finder

A Grafana Professional Services tool for identifying which Prometheus metrics drive high Data Points per Minute (DPM). Analyzes metric-level DPM with per-label breakdown to help optimize Grafana Cloud costs.
一款Grafana专业服务工具,用于识别哪些Prometheus指标会导致高每分钟数据点(DPM)。通过分析指标级别的DPM以及按标签细分,帮助优化Grafana Cloud成本。

Quick Start

快速开始

Prerequisites

前提条件

  • Python 3.9+
  • Access to a Grafana Cloud Prometheus endpoint (or any Prometheus-compatible API)
  • Python 3.9及以上版本
  • 可访问Grafana Cloud Prometheus端点(或任何兼容Prometheus的API)

Setup

设置步骤

  1. Clone the repo and create a virtual environment:
bash
git clone https://github.com/grafana-ps/dpm-finder.git
cd dpm-finder
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
  1. Configure credentials by copying
    .env_example
    to
    .env
    and filling in values:
    • PROMETHEUS_ENDPOINT
      -- The Prometheus endpoint URL (must end in
      .net
      , nothing after)
    • PROMETHEUS_USERNAME
      -- Tenant ID / stack ID (numeric)
    • PROMETHEUS_API_KEY
      -- Grafana Cloud API key (
      glc_...
      format)
  1. 克隆仓库并创建虚拟环境:
bash
git clone https://github.com/grafana-ps/dpm-finder.git
cd dpm-finder
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
  1. 通过复制
    .env_example
    .env
    并填写配置信息来设置凭据:
    • PROMETHEUS_ENDPOINT
      -- Prometheus端点URL(必须以
      .net
      结尾,后面无其他路径)
    • PROMETHEUS_USERNAME
      -- 租户ID/栈ID(数字格式)
    • PROMETHEUS_API_KEY
      -- Grafana Cloud API密钥(
      glc_...
      格式)

Stack Discovery with gcx

使用gcx进行栈发现

If gcx is available, use it to find stack details:
bash
gcx config check              # Show active stack context
gcx config list-contexts      # List all configured stacks
gcx config view               # Full config with endpoints
The Prometheus endpoint follows the pattern:
https://prometheus-{cluster_slug}.grafana.net
The username is the numeric stack ID. gcx auto-discovers service URLs from the stack slug via GCOM.
如果已安装gcx,可使用它查找栈详情:
bash
gcx config check              # 显示当前活跃的栈上下文
gcx config list-contexts      # 列出所有已配置的栈
gcx config view               # 查看包含端点的完整配置
Prometheus端点遵循以下格式:
https://prometheus-{cluster_slug}.grafana.net
用户名是数字格式的栈ID。gcx会通过GCOM自动从栈slug中发现服务URL。

Stack Discovery without gcx

不使用gcx进行栈发现

Look up the stack in the Grafana Cloud portal, or query the usage datasource:
grafanacloud_instance_info{name=~"STACK_NAME.*"}
Extract
cluster_slug
for the endpoint URL and
id
for the username.
在Grafana Cloud门户中查找栈,或查询使用情况数据源:
grafanacloud_instance_info{name=~"STACK_NAME.*"}
提取
cluster_slug
用于构建端点URL,提取
id
作为用户名。

Running the Tool

运行工具

One-Shot Analysis (primary use case)

一次性分析(主要使用场景)

bash
./dpm-finder.py -f json -m 2.0 -t 8 --timeout 120 -l 10
bash
./dpm-finder.py -f json -m 2.0 -t 8 --timeout 120 -l 10

CLI Flags Reference

CLI参数参考

FlagDefaultDescription
-f
,
--format
csv
Output format:
csv
,
text
,
txt
,
json
,
prom
-m
,
--min-dpm
1.0
Minimum DPM threshold to include a metric
-t
,
--threads
10
Concurrent processing threads
-l
,
--lookback
10
Lookback window in minutes for DPM calculation
--timeout
60
API request timeout in seconds
--cost-per-1000-series
(none)Dollar cost per 1000 series; adds estimated_cost column
-q
,
--quiet
false
Suppress progress output
-v
,
--verbose
false
Enable debug logging
-e
,
--exporter
false
Run as Prometheus exporter instead of one-shot
-p
,
--port
9966
Exporter server port
-u
,
--update-interval
86400
Exporter metric refresh interval in seconds
参数默认值描述
-f
,
--format
csv
输出格式:
csv
,
text
,
txt
,
json
,
prom
-m
,
--min-dpm
1.0
纳入分析的最小DPM阈值
-t
,
--threads
10
并发处理线程数
-l
,
--lookback
10
计算DPM的回溯窗口(分钟)
--timeout
60
API请求超时时间(秒)
--cost-per-1000-series
(无)每1000个系列的美元成本;添加
estimated_cost
-q
,
--quiet
false
抑制进度输出
-v
,
--verbose
false
启用调试日志
-e
,
--exporter
false
以Prometheus exporter模式运行,而非一次性分析
-p
,
--port
9966
Exporter服务器端口
-u
,
--update-interval
86400
Exporter指标刷新间隔(秒)

Output Formats

输出格式

Output files are written to the current working directory.
输出文件将写入当前工作目录。

JSON (
-f json
) ->
metric_rates.json

JSON格式 (
-f json
) ->
metric_rates.json

Best for programmatic analysis. Includes per-series DPM breakdown:
  • metrics[].metric_name
    -- the metric name
  • metrics[].dpm
    -- data points per minute (maximum across this metric's individual series)
  • metrics[].series_count
    -- number of active time series
  • metrics[].series_detail[]
    -- per-label-set DPM breakdown (sorted by DPM descending)
  • total_metrics_above_threshold
    -- count of metrics above threshold
  • performance_metrics.total_runtime_seconds
    -- total processing time
  • performance_metrics.average_metric_processing_seconds
    -- avg time per metric
  • performance_metrics.total_metrics_processed
    -- total metrics analyzed
  • performance_metrics.metrics_per_second
    -- processing throughput
最适合程序化分析。包含每个系列的DPM细分:
  • metrics[].metric_name
    -- 指标名称
  • metrics[].dpm
    -- 每分钟数据点(该指标下所有单个系列中的最大值)
  • metrics[].series_count
    -- 活跃时间序列的数量
  • metrics[].series_detail[]
    -- 按标签集细分的DPM(按DPM降序排序)
  • total_metrics_above_threshold
    -- 超过阈值的指标数量
  • performance_metrics.total_runtime_seconds
    -- 总处理时间
  • performance_metrics.average_metric_processing_seconds
    -- 每个指标的平均处理时间
  • performance_metrics.total_metrics_processed
    -- 分析的总指标数量
  • performance_metrics.metrics_per_second
    -- 处理吞吐量

CSV (
-f csv
) ->
metric_rates.csv

CSV格式 (
-f csv
) ->
metric_rates.csv

Columns:
metric_name
,
dpm
,
series_count
(plus
estimated_cost
if
--cost-per-1000-series
is set).
列:
metric_name
,
dpm
,
series_count
(如果设置了
--cost-per-1000-series
,则添加
estimated_cost
列)。

Text (
-f text
) ->
metric_rates.txt

文本格式 (
-f text
) ->
metric_rates.txt

Human-readable format with per-series breakdown and performance statistics.
易于人类阅读的格式,包含每个系列的细分和性能统计信息。

Prometheus (
-f prom
) ->
metric_rates.prom

Prometheus格式 (
-f prom
) ->
metric_rates.prom

Prometheus exposition format suitable for Alloy's
prometheus.exporter.unix
textfile collector.
适合Alloy的
prometheus.exporter.unix
文本文件收集器的Prometheus暴露格式。

Interpreting Results

结果解读

  • DPM = data points per minute (maximum across this metric's individual series)
  • series_count = number of active time series for that metric
  • series_detail (JSON/text only) = per-label-combination DPM breakdown
  • Sort by DPM descending to find the noisiest metrics
  • For top metrics, examine
    series_detail
    to identify which label combinations drive the highest DPM
  • If
    --cost-per-1000-series
    is set, use
    estimated_cost
    to prioritize by spend
  • DPM = 每分钟数据点(该指标下所有单个系列中的最大值)
  • series_count = 该指标的活跃时间序列数量
  • series_detail(仅JSON/文本格式)= 按标签组合细分的DPM
  • 按DPM降序排序,找到最嘈杂的指标
  • 对于顶级指标,查看
    series_detail
    以确定哪些标签组合导致最高DPM
  • 如果设置了
    --cost-per-1000-series
    ,使用
    estimated_cost
    按支出优先级排序

Rate Limiting

速率限制

When running dpm-finder against multiple stacks, limit to max 3 concurrent runs. Batch the stacks and wait for each batch to complete before starting the next.
针对多个栈运行dpm-finder时,限制为最多3个并发运行。将栈分批处理,等待每批完成后再启动下一批。

Metric Filtering

指标过滤

The tool automatically excludes:
  • Histogram/summary components:
    *_count
    ,
    *_bucket
    ,
    *_sum
    suffixes
  • Grafana internal metrics:
    grafana_*
    prefix
  • Metrics with aggregation rules defined in the cluster (fetched from
    /aggregations/rules
    )
工具会自动排除以下指标:
  • 直方图/摘要组件:带有
    *_count
    ,
    *_bucket
    ,
    *_sum
    后缀的指标
  • Grafana内部指标:带有
    grafana_*
    前缀的指标
  • 集群中定义了聚合规则的指标(从
    /aggregations/rules
    获取)

Exporter Mode

Exporter模式

Run as a long-lived Prometheus exporter instead of one-shot analysis:
bash
./dpm-finder.py -e -p 9966 -u 86400
Serves metrics at
http://localhost:PORT/metrics
. Recalculates at the configured interval (default: daily). See
README.md
for full exporter and Docker documentation.
以长期运行的Prometheus exporter模式运行,而非一次性分析:
bash
./dpm-finder.py -e -p 9966 -u 86400
http://localhost:PORT/metrics
提供指标。按配置的间隔重新计算(默认:每日)。完整的exporter和Docker文档请查看
README.md

Docker

Docker使用

Alternative to local Python setup:
bash
docker build -t dpm-finder:latest .
docker run --rm --env-file .env -v $(pwd)/output:/app/output \
  dpm-finder:latest --format json --min-dpm 2.0
See
README.md
for full Docker Compose, production deployment, and monitoring integration docs.
替代本地Python环境设置:
bash
docker build -t dpm-finder:latest .
docker run --rm --env-file .env -v $(pwd)/output:/app/output \
  dpm-finder:latest --format json --min-dpm 2.0
完整的Docker Compose、生产部署和监控集成文档请查看
README.md

Troubleshooting

故障排除

Common Errors

常见错误

  • Authentication failures (401/403): Verify the API key is valid and has
    metrics:read
    scope. Confirm
    PROMETHEUS_USERNAME
    matches the numeric stack ID.
  • Timeouts: Increase
    --timeout
    for large metric sets. The default is 60s; use 120s or higher for stacks with thousands of metrics.
  • HTTP 422 errors: Usually means the metric has aggregation rules. The tool logs a warning and skips these automatically.
  • Empty results: Lower the
    --min-dpm
    threshold. Check that
    PROMETHEUS_ENDPOINT
    does not have a trailing path after
    .net
    .
  • Connection errors: Verify network connectivity to the Prometheus endpoint. The tool retries with exponential backoff (up to 10 retries).
  • 认证失败(401/403):验证API密钥有效且具有
    metrics:read
    权限。确认
    PROMETHEUS_USERNAME
    与数字栈ID匹配。
  • 超时:对于大型指标集,增加
    --timeout
    值。默认是60秒;对于包含数千个指标的栈,使用120秒或更高。
  • HTTP 422错误:通常表示该指标有聚合规则。工具会记录警告并自动跳过这些指标。
  • 空结果:降低
    --min-dpm
    阈值。检查
    PROMETHEUS_ENDPOINT
    .net
    后没有尾随路径。
  • 连接错误:验证与Prometheus端点的网络连接。工具会使用指数退避重试(最多10次)。

Retry Behavior

重试机制

The tool retries failed API requests with exponential backoff (up to 10 retries). Rate-limited responses (HTTP 429) are backed off automatically. HTTP 4xx errors other than 429 are not retried.
工具会对失败的API请求使用指数退避重试(最多10次)。速率限制响应(HTTP 429)会自动退避处理。除429外的其他HTTP 4xx错误不会重试。

Project Structure

项目结构

dpm-finder.py          # Main CLI tool (one-shot + exporter modes)
requirements.txt       # Python dependencies
.env_example           # Template for credential configuration
Dockerfile             # Multi-stage Docker build
docker-compose.yml     # Docker Compose orchestration
README.md              # Full project documentation
dpm-finder.py          # 主CLI工具(一次性分析 + exporter模式)
requirements.txt       # Python依赖
.env_example           # 凭据配置模板
Dockerfile             # 多阶段Docker构建文件
docker-compose.yml     # Docker Compose编排文件
README.md              # 完整项目文档