dpm-finder

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

dpm-finder

A Grafana Professional Services tool for identifying which Prometheus metrics drive high Data Points per Minute (DPM). Analyzes metric-level DPM with per-label breakdown to help optimize Grafana Cloud costs.

Source: https://github.com/grafana-ps/dpm-finder

一款Grafana专业服务工具，用于识别哪些Prometheus指标会导致高每分钟数据点（DPM）。通过分析指标级别的DPM以及按标签细分，帮助优化Grafana Cloud成本。

源码：https://github.com/grafana-ps/dpm-finder

Quick Start

快速开始

Prerequisites

前提条件

Python 3.9+
Access to a Grafana Cloud Prometheus endpoint (or any Prometheus-compatible API)

Python 3.9及以上版本
可访问Grafana Cloud Prometheus端点（或任何兼容Prometheus的API）

Setup

设置步骤

Clone the repo and create a virtual environment:

bash

git clone https://github.com/grafana-ps/dpm-finder.git
cd dpm-finder
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Configure credentials by copying
```
.env_example
```
to
```
.env
```
and filling in values:
- ```
PROMETHEUS_ENDPOINT
```
  -- The Prometheus endpoint URL (must end in
```
.net
```
  , nothing after)
- ```
PROMETHEUS_USERNAME
```
  -- Tenant ID / stack ID (numeric)
- ```
PROMETHEUS_API_KEY
```
  -- Grafana Cloud API key (
```
glc_...
```
  format)

克隆仓库并创建虚拟环境：

bash

git clone https://github.com/grafana-ps/dpm-finder.git
cd dpm-finder
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

通过复制
```
.env_example
```
到
```
.env
```
并填写配置信息来设置凭据：
- ```
PROMETHEUS_ENDPOINT
```
  -- Prometheus端点URL（必须以
```
.net
```
  结尾，后面无其他路径）
- ```
PROMETHEUS_USERNAME
```
  -- 租户ID/栈ID（数字格式）
- ```
PROMETHEUS_API_KEY
```
  -- Grafana Cloud API密钥（
```
glc_...
```
  格式）

Stack Discovery with gcx

使用gcx进行栈发现

If gcx is available, use it to find stack details:

bash

gcx config check              # Show active stack context
gcx config list-contexts      # List all configured stacks
gcx config view               # Full config with endpoints

The Prometheus endpoint follows the pattern:

https://prometheus-{cluster_slug}.grafana.net

The username is the numeric stack ID. gcx auto-discovers service URLs from the stack slug via GCOM.

如果已安装gcx，可使用它查找栈详情：

bash

gcx config check              # 显示当前活跃的栈上下文
gcx config list-contexts      # 列出所有已配置的栈
gcx config view               # 查看包含端点的完整配置

Prometheus端点遵循以下格式：

https://prometheus-{cluster_slug}.grafana.net

用户名是数字格式的栈ID。gcx会通过GCOM自动从栈slug中发现服务URL。

Stack Discovery without gcx

不使用gcx进行栈发现

Look up the stack in the Grafana Cloud portal, or query the usage datasource:

grafanacloud_instance_info{name=~"STACK_NAME.*"}

Extract

cluster_slug

for the endpoint URL and

id

for the username.

在Grafana Cloud门户中查找栈，或查询使用情况数据源：

grafanacloud_instance_info{name=~"STACK_NAME.*"}

提取

cluster_slug

用于构建端点URL，提取

id

作为用户名。

Running the Tool

运行工具

One-Shot Analysis (primary use case)

一次性分析（主要使用场景）

bash

./dpm-finder.py -f json -m 2.0 -t 8 --timeout 120 -l 10

bash

./dpm-finder.py -f json -m 2.0 -t 8 --timeout 120 -l 10

CLI Flags Reference

CLI参数参考

Flag	Default	Description
`-f` , `--format`	`csv`	Output format: `csv` , `text` , `txt` , `json` , `prom`
`-m` , `--min-dpm`	`1.0`	Minimum DPM threshold to include a metric
`-t` , `--threads`	`10`	Concurrent processing threads
`-l` , `--lookback`	`10`	Lookback window in minutes for DPM calculation
`--timeout`	`60`	API request timeout in seconds
`--cost-per-1000-series`	(none)	Dollar cost per 1000 series; adds estimated_cost column
`-q` , `--quiet`	`false`	Suppress progress output
`-v` , `--verbose`	`false`	Enable debug logging
`-e` , `--exporter`	`false`	Run as Prometheus exporter instead of one-shot
`-p` , `--port`	`9966`	Exporter server port
`-u` , `--update-interval`	`86400`	Exporter metric refresh interval in seconds

参数	默认值	描述
`-f` , `--format`	`csv`	输出格式： `csv` , `text` , `txt` , `json` , `prom`
`-m` , `--min-dpm`	`1.0`	纳入分析的最小DPM阈值
`-t` , `--threads`	`10`	并发处理线程数
`-l` , `--lookback`	`10`	计算DPM的回溯窗口（分钟）
`--timeout`	`60`	API请求超时时间（秒）
`--cost-per-1000-series`	(无)	每1000个系列的美元成本；添加 `estimated_cost` 列
`-q` , `--quiet`	`false`	抑制进度输出
`-v` , `--verbose`	`false`	启用调试日志
`-e` , `--exporter`	`false`	以Prometheus exporter模式运行，而非一次性分析
`-p` , `--port`	`9966`	Exporter服务器端口
`-u` , `--update-interval`	`86400`	Exporter指标刷新间隔（秒）

Output Formats

输出格式

Output files are written to the current working directory.

输出文件将写入当前工作目录。

JSON (

-f json

) ->

metric_rates.json

JSON格式 (

-f json

) ->

metric_rates.json

Best for programmatic analysis. Includes per-series DPM breakdown:

```
metrics[].metric_name
```
-- the metric name
```
metrics[].dpm
```
-- data points per minute (maximum across this metric's individual series)
```
metrics[].series_count
```
-- number of active time series
```
metrics[].series_detail[]
```
-- per-label-set DPM breakdown (sorted by DPM descending)
```
total_metrics_above_threshold
```
-- count of metrics above threshold

performance_metrics.total_runtime_seconds

-- total processing time

performance_metrics.average_metric_processing_seconds

-- avg time per metric

performance_metrics.total_metrics_processed

-- total metrics analyzed

```
performance_metrics.metrics_per_second
```
-- processing throughput

最适合程序化分析。包含每个系列的DPM细分：

```
metrics[].metric_name
```
-- 指标名称
```
metrics[].dpm
```
-- 每分钟数据点（该指标下所有单个系列中的最大值）
```
metrics[].series_count
```
-- 活跃时间序列的数量
```
metrics[].series_detail[]
```
-- 按标签集细分的DPM（按DPM降序排序）
```
total_metrics_above_threshold
```
-- 超过阈值的指标数量

performance_metrics.total_runtime_seconds

-- 总处理时间

performance_metrics.average_metric_processing_seconds

-- 每个指标的平均处理时间

performance_metrics.total_metrics_processed

-- 分析的总指标数量

```
performance_metrics.metrics_per_second
```
-- 处理吞吐量

CSV (

-f csv

) ->

metric_rates.csv

CSV格式 (

-f csv

) ->

metric_rates.csv

Columns:

metric_name

dpm

series_count

(plus

estimated_cost

--cost-per-1000-series

is set).

列：

metric_name

dpm

series_count

（如果设置了

--cost-per-1000-series

，则添加

estimated_cost

列）。

Text (

-f text

) ->

metric_rates.txt

文本格式 (

-f text

) ->

metric_rates.txt

Human-readable format with per-series breakdown and performance statistics.

易于人类阅读的格式，包含每个系列的细分和性能统计信息。

Prometheus (

-f prom

) ->

metric_rates.prom

Prometheus格式 (

-f prom

) ->

metric_rates.prom

Prometheus exposition format suitable for Alloy's

prometheus.exporter.unix

textfile collector.

适合Alloy的

prometheus.exporter.unix

文本文件收集器的Prometheus暴露格式。

Interpreting Results

结果解读

DPM = data points per minute (maximum across this metric's individual series)
series_count = number of active time series for that metric
series_detail (JSON/text only) = per-label-combination DPM breakdown
Sort by DPM descending to find the noisiest metrics
For top metrics, examine
```
series_detail
```
to identify which label combinations drive the highest DPM
If
```
--cost-per-1000-series
```
is set, use
```
estimated_cost
```
to prioritize by spend

DPM = 每分钟数据点（该指标下所有单个系列中的最大值）
series_count = 该指标的活跃时间序列数量
series_detail（仅JSON/文本格式）= 按标签组合细分的DPM
按DPM降序排序，找到最嘈杂的指标
对于顶级指标，查看
```
series_detail
```
以确定哪些标签组合导致最高DPM
如果设置了
```
--cost-per-1000-series
```
，使用
```
estimated_cost
```
按支出优先级排序

Rate Limiting

速率限制

When running dpm-finder against multiple stacks, limit to max 3 concurrent runs. Batch the stacks and wait for each batch to complete before starting the next.

针对多个栈运行dpm-finder时，限制为最多3个并发运行。将栈分批处理，等待每批完成后再启动下一批。

Metric Filtering

指标过滤

The tool automatically excludes:

Histogram/summary components:
```
*_count
```
,
```
*_bucket
```
,
```
*_sum
```
suffixes
Grafana internal metrics:
```
grafana_*
```
prefix
Metrics with aggregation rules defined in the cluster (fetched from
```
/aggregations/rules
```
)

工具会自动排除以下指标：

直方图/摘要组件：带有
```
*_count
```
,
```
*_bucket
```
,
```
*_sum
```
后缀的指标
Grafana内部指标：带有
```
grafana_*
```
前缀的指标
集群中定义了聚合规则的指标（从
```
/aggregations/rules
```
获取）

Exporter Mode

Exporter模式

Run as a long-lived Prometheus exporter instead of one-shot analysis:

bash

./dpm-finder.py -e -p 9966 -u 86400

Serves metrics at

http://localhost:PORT/metrics

. Recalculates at the configured interval (default: daily). See

README.md

for full exporter and Docker documentation.

以长期运行的Prometheus exporter模式运行，而非一次性分析：

bash

./dpm-finder.py -e -p 9966 -u 86400

在

http://localhost:PORT/metrics

提供指标。按配置的间隔重新计算（默认：每日）。完整的exporter和Docker文档请查看

README.md

。

Docker

Docker使用

Alternative to local Python setup:

bash

docker build -t dpm-finder:latest .
docker run --rm --env-file .env -v $(pwd)/output:/app/output \
  dpm-finder:latest --format json --min-dpm 2.0

See

README.md

for full Docker Compose, production deployment, and monitoring integration docs.

替代本地Python环境设置：

bash

docker build -t dpm-finder:latest .
docker run --rm --env-file .env -v $(pwd)/output:/app/output \
  dpm-finder:latest --format json --min-dpm 2.0

完整的Docker Compose、生产部署和监控集成文档请查看

README.md

。

Troubleshooting

故障排除

Common Errors

常见错误

Authentication failures (401/403): Verify the API key is valid and has
```
metrics:read
```
scope. Confirm
```
PROMETHEUS_USERNAME
```
matches the numeric stack ID.
Timeouts: Increase
```
--timeout
```
for large metric sets. The default is 60s; use 120s or higher for stacks with thousands of metrics.
HTTP 422 errors: Usually means the metric has aggregation rules. The tool logs a warning and skips these automatically.
Empty results: Lower the
```
--min-dpm
```
threshold. Check that
```
PROMETHEUS_ENDPOINT
```
does not have a trailing path after
```
.net
```
.
Connection errors: Verify network connectivity to the Prometheus endpoint. The tool retries with exponential backoff (up to 10 retries).

认证失败（401/403）：验证API密钥有效且具有
```
metrics:read
```
权限。确认
```
PROMETHEUS_USERNAME
```
与数字栈ID匹配。
超时：对于大型指标集，增加
```
--timeout
```
值。默认是60秒；对于包含数千个指标的栈，使用120秒或更高。
HTTP 422错误：通常表示该指标有聚合规则。工具会记录警告并自动跳过这些指标。
空结果：降低
```
--min-dpm
```
阈值。检查
```
PROMETHEUS_ENDPOINT
```
在
```
.net
```
后没有尾随路径。
连接错误：验证与Prometheus端点的网络连接。工具会使用指数退避重试（最多10次）。

Retry Behavior

重试机制

The tool retries failed API requests with exponential backoff (up to 10 retries). Rate-limited responses (HTTP 429) are backed off automatically. HTTP 4xx errors other than 429 are not retried.

工具会对失败的API请求使用指数退避重试（最多10次）。速率限制响应（HTTP 429）会自动退避处理。除429外的其他HTTP 4xx错误不会重试。

Project Structure

项目结构

dpm-finder.py          # Main CLI tool (one-shot + exporter modes)
requirements.txt       # Python dependencies
.env_example           # Template for credential configuration
Dockerfile             # Multi-stage Docker build
docker-compose.yml     # Docker Compose orchestration
README.md              # Full project documentation

dpm-finder.py          # 主CLI工具（一次性分析 + exporter模式）
requirements.txt       # Python依赖
.env_example           # 凭据配置模板
Dockerfile             # 多阶段Docker构建文件
docker-compose.yml     # Docker Compose编排文件
README.md              # 完整项目文档

dpm-finder

Original

Translation

dpm-finder

dpm-finder

Quick Start

快速开始

Prerequisites

前提条件

Setup

设置步骤

Stack Discovery with gcx

使用gcx进行栈发现

Stack Discovery without gcx

不使用gcx进行栈发现

Running the Tool

运行工具

One-Shot Analysis (primary use case)

一次性分析（主要使用场景）

CLI Flags Reference

CLI参数参考

Output Formats

输出格式

JSON (-f json) -> metric_rates.json

JSON格式 (-f json) -> metric_rates.json

CSV (-f csv) -> metric_rates.csv

CSV格式 (-f csv) -> metric_rates.csv

Text (-f text) -> metric_rates.txt

文本格式 (-f text) -> metric_rates.txt

Prometheus (-f prom) -> metric_rates.prom

Prometheus格式 (-f prom) -> metric_rates.prom

Interpreting Results

结果解读

Rate Limiting

速率限制

Metric Filtering

指标过滤

Exporter Mode

Exporter模式

Docker

Docker使用

Troubleshooting

故障排除

Common Errors

常见错误

Retry Behavior

重试机制

Project Structure

项目结构

JSON (
`-f json`
) ->
`metric_rates.json`

JSON格式 (
`-f json`
) ->
`metric_rates.json`

CSV (
`-f csv`
) ->
`metric_rates.csv`

CSV格式 (
`-f csv`
) ->
`metric_rates.csv`

Text (
`-f text`
) ->
`metric_rates.txt`

文本格式 (
`-f text`
) ->
`metric_rates.txt`

Prometheus (
`-f prom`
) ->
`metric_rates.prom`

Prometheus格式 (
`-f prom`
) ->
`metric_rates.prom`