dpm-finder
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesedpm-finder
dpm-finder
A Grafana Professional Services tool for identifying which Prometheus metrics
drive high Data Points per Minute (DPM). Analyzes metric-level DPM with
per-label breakdown to help optimize Grafana Cloud costs.
一款Grafana专业服务工具,用于识别哪些Prometheus指标会导致高每分钟数据点(DPM)。通过分析指标级别的DPM以及按标签细分,帮助优化Grafana Cloud成本。
Quick Start
快速开始
Prerequisites
前提条件
- Python 3.9+
- Access to a Grafana Cloud Prometheus endpoint (or any Prometheus-compatible API)
- Python 3.9及以上版本
- 可访问Grafana Cloud Prometheus端点(或任何兼容Prometheus的API)
Setup
设置步骤
- Clone the repo and create a virtual environment:
bash
git clone https://github.com/grafana-ps/dpm-finder.git
cd dpm-finder
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt- Configure credentials by copying to
.env_exampleand filling in values:.env- -- The Prometheus endpoint URL (must end in
PROMETHEUS_ENDPOINT, nothing after).net - -- Tenant ID / stack ID (numeric)
PROMETHEUS_USERNAME - -- Grafana Cloud API key (
PROMETHEUS_API_KEYformat)glc_...
- 克隆仓库并创建虚拟环境:
bash
git clone https://github.com/grafana-ps/dpm-finder.git
cd dpm-finder
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt- 通过复制到
.env_example并填写配置信息来设置凭据:.env- -- Prometheus端点URL(必须以
PROMETHEUS_ENDPOINT结尾,后面无其他路径).net - -- 租户ID/栈ID(数字格式)
PROMETHEUS_USERNAME - -- Grafana Cloud API密钥(
PROMETHEUS_API_KEY格式)glc_...
Stack Discovery with gcx
使用gcx进行栈发现
If gcx is available, use it to find stack details:
bash
gcx config check # Show active stack context
gcx config list-contexts # List all configured stacks
gcx config view # Full config with endpointsThe Prometheus endpoint follows the pattern:
https://prometheus-{cluster_slug}.grafana.netThe username is the numeric stack ID. gcx auto-discovers service URLs from the stack slug via GCOM.
如果已安装gcx,可使用它查找栈详情:
bash
gcx config check # 显示当前活跃的栈上下文
gcx config list-contexts # 列出所有已配置的栈
gcx config view # 查看包含端点的完整配置Prometheus端点遵循以下格式:
https://prometheus-{cluster_slug}.grafana.net用户名是数字格式的栈ID。gcx会通过GCOM自动从栈slug中发现服务URL。
Stack Discovery without gcx
不使用gcx进行栈发现
Look up the stack in the Grafana Cloud portal, or query the usage datasource:
grafanacloud_instance_info{name=~"STACK_NAME.*"}Extract for the endpoint URL and for the username.
cluster_slugid在Grafana Cloud门户中查找栈,或查询使用情况数据源:
grafanacloud_instance_info{name=~"STACK_NAME.*"}提取用于构建端点URL,提取作为用户名。
cluster_slugidRunning the Tool
运行工具
One-Shot Analysis (primary use case)
一次性分析(主要使用场景)
bash
./dpm-finder.py -f json -m 2.0 -t 8 --timeout 120 -l 10bash
./dpm-finder.py -f json -m 2.0 -t 8 --timeout 120 -l 10CLI Flags Reference
CLI参数参考
| Flag | Default | Description |
|---|---|---|
| | Output format: |
| | Minimum DPM threshold to include a metric |
| | Concurrent processing threads |
| | Lookback window in minutes for DPM calculation |
| | API request timeout in seconds |
| (none) | Dollar cost per 1000 series; adds estimated_cost column |
| | Suppress progress output |
| | Enable debug logging |
| | Run as Prometheus exporter instead of one-shot |
| | Exporter server port |
| | Exporter metric refresh interval in seconds |
| 参数 | 默认值 | 描述 |
|---|---|---|
| | 输出格式: |
| | 纳入分析的最小DPM阈值 |
| | 并发处理线程数 |
| | 计算DPM的回溯窗口(分钟) |
| | API请求超时时间(秒) |
| (无) | 每1000个系列的美元成本;添加 |
| | 抑制进度输出 |
| | 启用调试日志 |
| | 以Prometheus exporter模式运行,而非一次性分析 |
| | Exporter服务器端口 |
| | Exporter指标刷新间隔(秒) |
Output Formats
输出格式
Output files are written to the current working directory.
输出文件将写入当前工作目录。
JSON (-f json
) -> metric_rates.json
-f jsonmetric_rates.jsonJSON格式 (-f json
) -> metric_rates.json
-f jsonmetric_rates.jsonBest for programmatic analysis. Includes per-series DPM breakdown:
- -- the metric name
metrics[].metric_name - -- data points per minute (maximum across this metric's individual series)
metrics[].dpm - -- number of active time series
metrics[].series_count - -- per-label-set DPM breakdown (sorted by DPM descending)
metrics[].series_detail[] - -- count of metrics above threshold
total_metrics_above_threshold - -- total processing time
performance_metrics.total_runtime_seconds - -- avg time per metric
performance_metrics.average_metric_processing_seconds - -- total metrics analyzed
performance_metrics.total_metrics_processed - -- processing throughput
performance_metrics.metrics_per_second
最适合程序化分析。包含每个系列的DPM细分:
- -- 指标名称
metrics[].metric_name - -- 每分钟数据点(该指标下所有单个系列中的最大值)
metrics[].dpm - -- 活跃时间序列的数量
metrics[].series_count - -- 按标签集细分的DPM(按DPM降序排序)
metrics[].series_detail[] - -- 超过阈值的指标数量
total_metrics_above_threshold - -- 总处理时间
performance_metrics.total_runtime_seconds - -- 每个指标的平均处理时间
performance_metrics.average_metric_processing_seconds - -- 分析的总指标数量
performance_metrics.total_metrics_processed - -- 处理吞吐量
performance_metrics.metrics_per_second
CSV (-f csv
) -> metric_rates.csv
-f csvmetric_rates.csvCSV格式 (-f csv
) -> metric_rates.csv
-f csvmetric_rates.csvColumns: , , (plus if is set).
metric_namedpmseries_countestimated_cost--cost-per-1000-series列:, , (如果设置了,则添加列)。
metric_namedpmseries_count--cost-per-1000-seriesestimated_costText (-f text
) -> metric_rates.txt
-f textmetric_rates.txt文本格式 (-f text
) -> metric_rates.txt
-f textmetric_rates.txtHuman-readable format with per-series breakdown and performance statistics.
易于人类阅读的格式,包含每个系列的细分和性能统计信息。
Prometheus (-f prom
) -> metric_rates.prom
-f prommetric_rates.promPrometheus格式 (-f prom
) -> metric_rates.prom
-f prommetric_rates.promPrometheus exposition format suitable for Alloy's textfile collector.
prometheus.exporter.unix适合Alloy的文本文件收集器的Prometheus暴露格式。
prometheus.exporter.unixInterpreting Results
结果解读
- DPM = data points per minute (maximum across this metric's individual series)
- series_count = number of active time series for that metric
- series_detail (JSON/text only) = per-label-combination DPM breakdown
- Sort by DPM descending to find the noisiest metrics
- For top metrics, examine to identify which label combinations drive the highest DPM
series_detail - If is set, use
--cost-per-1000-seriesto prioritize by spendestimated_cost
- DPM = 每分钟数据点(该指标下所有单个系列中的最大值)
- series_count = 该指标的活跃时间序列数量
- series_detail(仅JSON/文本格式)= 按标签组合细分的DPM
- 按DPM降序排序,找到最嘈杂的指标
- 对于顶级指标,查看以确定哪些标签组合导致最高DPM
series_detail - 如果设置了,使用
--cost-per-1000-series按支出优先级排序estimated_cost
Rate Limiting
速率限制
When running dpm-finder against multiple stacks, limit to max 3 concurrent runs. Batch the stacks and wait for each batch to complete before starting the next.
针对多个栈运行dpm-finder时,限制为最多3个并发运行。将栈分批处理,等待每批完成后再启动下一批。
Metric Filtering
指标过滤
The tool automatically excludes:
- Histogram/summary components: ,
*_count,*_bucketsuffixes*_sum - Grafana internal metrics: prefix
grafana_* - Metrics with aggregation rules defined in the cluster (fetched from )
/aggregations/rules
工具会自动排除以下指标:
- 直方图/摘要组件:带有,
*_count,*_bucket后缀的指标*_sum - Grafana内部指标:带有前缀的指标
grafana_* - 集群中定义了聚合规则的指标(从获取)
/aggregations/rules
Exporter Mode
Exporter模式
Run as a long-lived Prometheus exporter instead of one-shot analysis:
bash
./dpm-finder.py -e -p 9966 -u 86400Serves metrics at . Recalculates at the configured interval (default: daily). See for full exporter and Docker documentation.
http://localhost:PORT/metricsREADME.md以长期运行的Prometheus exporter模式运行,而非一次性分析:
bash
./dpm-finder.py -e -p 9966 -u 86400在提供指标。按配置的间隔重新计算(默认:每日)。完整的exporter和Docker文档请查看。
http://localhost:PORT/metricsREADME.mdDocker
Docker使用
Alternative to local Python setup:
bash
docker build -t dpm-finder:latest .
docker run --rm --env-file .env -v $(pwd)/output:/app/output \
dpm-finder:latest --format json --min-dpm 2.0See for full Docker Compose, production deployment, and monitoring integration docs.
README.md替代本地Python环境设置:
bash
docker build -t dpm-finder:latest .
docker run --rm --env-file .env -v $(pwd)/output:/app/output \
dpm-finder:latest --format json --min-dpm 2.0完整的Docker Compose、生产部署和监控集成文档请查看。
README.mdTroubleshooting
故障排除
Common Errors
常见错误
- Authentication failures (401/403): Verify the API key is valid and has scope. Confirm
metrics:readmatches the numeric stack ID.PROMETHEUS_USERNAME - Timeouts: Increase for large metric sets. The default is 60s; use 120s or higher for stacks with thousands of metrics.
--timeout - HTTP 422 errors: Usually means the metric has aggregation rules. The tool logs a warning and skips these automatically.
- Empty results: Lower the threshold. Check that
--min-dpmdoes not have a trailing path afterPROMETHEUS_ENDPOINT..net - Connection errors: Verify network connectivity to the Prometheus endpoint. The tool retries with exponential backoff (up to 10 retries).
- 认证失败(401/403):验证API密钥有效且具有权限。确认
metrics:read与数字栈ID匹配。PROMETHEUS_USERNAME - 超时:对于大型指标集,增加值。默认是60秒;对于包含数千个指标的栈,使用120秒或更高。
--timeout - HTTP 422错误:通常表示该指标有聚合规则。工具会记录警告并自动跳过这些指标。
- 空结果:降低阈值。检查
--min-dpm在PROMETHEUS_ENDPOINT后没有尾随路径。.net - 连接错误:验证与Prometheus端点的网络连接。工具会使用指数退避重试(最多10次)。
Retry Behavior
重试机制
The tool retries failed API requests with exponential backoff (up to 10 retries). Rate-limited responses (HTTP 429) are backed off automatically. HTTP 4xx errors other than 429 are not retried.
工具会对失败的API请求使用指数退避重试(最多10次)。速率限制响应(HTTP 429)会自动退避处理。除429外的其他HTTP 4xx错误不会重试。
Project Structure
项目结构
dpm-finder.py # Main CLI tool (one-shot + exporter modes)
requirements.txt # Python dependencies
.env_example # Template for credential configuration
Dockerfile # Multi-stage Docker build
docker-compose.yml # Docker Compose orchestration
README.md # Full project documentationdpm-finder.py # 主CLI工具(一次性分析 + exporter模式)
requirements.txt # Python依赖
.env_example # 凭据配置模板
Dockerfile # 多阶段Docker构建文件
docker-compose.yml # Docker Compose编排文件
README.md # 完整项目文档