controlling-costs
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAxiom Cost Control
Axiom成本控制
Dashboards, monitors, and waste identification for Axiom usage optimization.
用于Axiom使用优化的仪表盘、监控及浪费识别方案。
Before You Start
开始之前
-
Load required skills:
skill: axiom-sre skill: building-dashboardsBuilding-dashboards provides:,dashboard-list,dashboard-get,dashboard-create,dashboard-updatedashboard-delete -
Find the audit dataset. Tryfirst:
axiom-auditapl['axiom-audit'] | where _time > ago(1h) | summarize count() by action | where action in ('usageCalculated', 'runAPLQueryCost')- If not found → ask user. Common names: ,
axiom-audit-logs-viewaudit-logs - If found but no events → wrong dataset, ask user
usageCalculated
- If not found → ask user. Common names:
-
Verifyaccess (required for Phase 4):
axiom-historyapl['axiom-history'] | where _time > ago(1h) | take 1If not found, Phase 4 optimization will not work. -
Confirm with user:
- Deployment name?
- Audit dataset name?
- Contract limit in TB/day? (required for Phase 3 monitors)
-
Replaceand
<deployment>in all commands below.<audit-dataset>
Tips:
- Run any script with for full usage
-h - Do NOT pipe script output to or
head— causes SIGPIPE errorstail - Requires for JSON parsing
jq - Use axiom-sre's for ad-hoc APL, not direct CLI
axiom-query
-
加载所需技能:
skill: axiom-sre skill: building-dashboardsbuilding-dashboards 技能提供以下功能:、dashboard-list、dashboard-get、dashboard-create、dashboard-updatedashboard-delete -
查找审计数据集。首先尝试:
axiom-auditapl['axiom-audit'] | where _time > ago(1h) | summarize count() by action | where action in ('usageCalculated', 'runAPLQueryCost')- 若未找到 → 询问用户。常见名称:、
axiom-audit-logs-viewaudit-logs - 若找到但无事件 → 数据集错误,询问用户
usageCalculated
- 若未找到 → 询问用户。常见名称:
-
验证访问权限(第4阶段必需):
axiom-historyapl['axiom-history'] | where _time > ago(1h) | take 1若未找到,第4阶段的优化将无法进行。 -
与用户确认:
- 部署名称?
- 审计数据集名称?
- 每日合约限额(TB)?(第3阶段监控必需)
-
替换以下所有命令中的和
<deployment>。<audit-dataset>
提示:
- 运行任何脚本时添加参数查看完整使用说明
-h - 请勿将脚本输出通过管道传递给或
head——会导致SIGPIPE错误tail - 需要工具进行JSON解析
jq - 使用axiom-sre的执行临时APL查询,而非直接使用CLI
axiom-query
Which Phases to Run
运行哪些阶段
| User request | Run these phases |
|---|---|
| "reduce costs" / "find waste" | 0 → 1 → 4 |
| "set up cost control" | 0 → 1 → 2 → 3 |
| "deploy dashboard" | 0 → 2 |
| "create monitors" | 0 → 3 |
| "check for drift" | 0 only |
| 用户需求 | 运行对应阶段 |
|---|---|
| "降低成本" / "查找浪费" | 0 → 1 → 4 |
| "设置成本控制" | 0 → 1 → 2 → 3 |
| "部署仪表盘" | 0 → 2 |
| "创建监控" | 0 → 3 |
| "检查偏移" | 仅0 |
Phase 0: Check Existing Setup
阶段0:检查现有配置
bash
undefinedbash
undefinedExisting dashboard?
是否存在现有仪表盘?
dashboard-list <deployment> | grep -i cost
dashboard-list <deployment> | grep -i cost
Existing monitors?
是否存在现有监控?
axiom-api <deployment> GET "/v2/monitors" | jq -r '.[] | select(.name | startswith("Cost Control:")) | "(.id)\t(.name)"'
If found, fetch with `dashboard-get` and compare to `templates/dashboard.json` for drift.
---axiom-api <deployment> GET "/v2/monitors" | jq -r '.[] | select(.name | startswith("Cost Control:")) | "(.id)\t(.name)"'
若找到,使用`dashboard-get`获取并与`templates/dashboard.json`对比,检查是否存在配置偏移。
---Phase 1: Discovery
阶段1:发现
bash
scripts/baseline-stats -d <deployment> -a <audit-dataset>Captures daily ingest stats and produces the Analysis Queue (needed for Phase 4).
bash
scripts/baseline-stats -d <deployment> -a <audit-dataset>捕获每日数据摄入统计信息,并生成分析队列(第4阶段必需)。
Phase 2: Dashboard
阶段2:仪表盘
bash
scripts/deploy-dashboard -d <deployment> -a <audit-dataset>Creates dashboard with: ingest trends, burn rate, projections, waste candidates, top users. See for details.
reference/dashboard-panels.mdbash
scripts/deploy-dashboard -d <deployment> -a <audit-dataset>创建包含以下内容的仪表盘:摄入趋势、消耗速率、预测数据、浪费候选项、顶级用户。详情请参阅。
reference/dashboard-panels.mdPhase 3: Monitors
阶段3:监控
Contract is required. You must have the contract limit from preflight step 4.
必需合约信息。必须获取预检查步骤4中的合约限额。
Step 1: List available notifiers
步骤1:列出可用通知渠道
bash
scripts/list-notifiers -d <deployment>Present the list to the user and ask which notifier they want for cost alerts.
If they don't want notifications, proceed without .
-nbash
scripts/list-notifiers -d <deployment>将列表展示给用户,询问他们希望使用哪个通知渠道接收成本告警。若用户不需要通知,则无需添加参数。
-nStep 2: Create monitors
步骤2:创建监控
bash
scripts/create-monitors -d <deployment> -a <audit-dataset> -c <contract_tb> [-n <notifier_id>]Creates 3 monitors:
- Total Ingest Guard — alerts when daily ingest >1.2x contract OR 7-day avg grows >15% vs baseline
- Per-Dataset Spike — robust z-score detection, alerts per dataset with attribution
- Query Cost Spike — same z-score approach for query costs (GB·ms)
The spike monitors use so each dataset triggers a separate alert.
notifyByGroup: trueSee for threshold derivation.
reference/monitor-strategy.mdbash
scripts/create-monitors -d <deployment> -a <audit-dataset> -c <contract_tb> [-n <notifier_id>]创建3个监控:
- 总摄入防护 — 当每日摄入超过合约的1.2倍,或7日平均值较基线增长超过15%时触发告警
- 数据集级峰值 — 采用鲁棒z分数检测,针对单个数据集触发告警并提供归因信息
- 查询成本峰值 — 采用相同z分数方法检测查询成本(GB·ms)的峰值
峰值监控使用配置,因此每个数据集会触发单独的告警。
notifyByGroup: true阈值推导详情请参阅。
reference/monitor-strategy.mdPhase 4: Optimization
阶段4:优化
Get the Analysis Queue
获取分析队列
Run if not already done. It outputs a prioritized list:
scripts/baseline-stats| Priority | Meaning |
|---|---|
| P0⛔ | Top 3 by ingest OR >10% of total — MANDATORY |
| P1 | Never queried — strong drop candidate |
| P2 | Rarely queried (Work/GB < 100) — likely waste |
Work/GB = query cost (GB·ms) / ingest (GB). Lower = less value from data.
若尚未运行,请先执行。该脚本会输出一个优先级列表:
scripts/baseline-stats| 优先级 | 含义 |
|---|---|
| P0⛔ | 摄入总量排名前三 或 占总摄入的10%以上 — 必须处理 |
| P1 | 从未被查询过 — 强烈建议删除 |
| P2 | 极少被查询(Work/GB < 100) — 大概率为浪费数据 |
Work/GB = 查询成本(GB·ms) / 摄入数据量(GB)。数值越低,数据的价值越低。
Analyze datasets in order
按顺序分析数据集
Work top-to-bottom. For each dataset:
Step 1: Column analysis
bash
scripts/analyze-query-coverage -d <deployment> -D <dataset> -a <audit-dataset>If 0 queries → recommend DROP, move to next.
Step 2: Field value analysis
Pick a field from suggested list (usually , , or ):
appservicekubernetes.labels.appbash
scripts/analyze-query-coverage -d <deployment> -D <dataset> -a <audit-dataset> -f <field>Note values with high volume but never queried (⚠️ markers).
Step 3: Handle empty values
If has >5% volume, you MUST drill down with alternative field (e.g., ).
(empty)kubernetes.namespace_nameStep 4: Record recommendation
For each dataset, note: name, ingest volume, Work/GB, top unqueried values, action (DROP/SAMPLE/KEEP), estimated savings.
从上到下处理。针对每个数据集:
步骤1:列分析
bash
scripts/analyze-query-coverage -d <deployment> -D <dataset> -a <audit-dataset>若查询次数为0 → 建议删除,处理下一个数据集。
步骤2:字段值分析
从建议列表中选择一个字段(通常为、或):
appservicekubernetes.labels.appbash
scripts/analyze-query-coverage -d <deployment> -D <dataset> -a <audit-dataset> -f <field>记录那些数据量高但从未被查询过的值(标记为⚠️)。
步骤3:处理空值
若占比超过5%,则必须使用替代字段(如)深入分析。
(empty)kubernetes.namespace_name步骤4:记录建议
针对每个数据集,记录:名称、摄入数据量、Work/GB、未被查询的顶级值、操作建议(删除/采样/保留)、预估节省成本。
Done when
完成条件
All P0⛔ and P1 datasets analyzed. Then compile report using .
reference/analysis-report-template.md所有P0⛔和P1数据集分析完毕。随后使用编译报告。
reference/analysis-report-template.mdCleanup
清理
bash
undefinedbash
undefinedDelete monitors
删除监控
axiom-api <deployment> GET "/v2/monitors" | jq -r '.[] | select(.name | startswith("Cost Control:")) | "(.id)\t(.name)"'
axiom-api <deployment> DELETE "/v2/monitors/<id>"
axiom-api <deployment> GET "/v2/monitors" | jq -r '.[] | select(.name | startswith("Cost Control:")) | "(.id)\t(.name)"'
axiom-api <deployment> DELETE "/v2/monitors/<id>"
Delete dashboard
删除仪表盘
dashboard-list <deployment> | grep -i cost
dashboard-delete <deployment> <id>
**Note:** Running `create-monitors` twice creates duplicates. Delete existing monitors first if re-deploying.
---dashboard-list <deployment> | grep -i cost
dashboard-delete <deployment> <id>
**注意:** 重复运行`create-monitors`会创建重复监控。若重新部署,请先删除现有监控。
---Reference
参考
Audit Dataset Fields
审计数据集字段
| Field | Description |
|---|---|
| |
| Hourly ingest in bytes |
| Hourly query cost |
| Dataset name |
| Org ID |
| User email |
| 字段 | 描述 |
|---|---|
| |
| 每小时数据摄入字节数 |
| 每小时计费查询成本 |
| 数据集名称 |
| 组织ID |
| 用户邮箱 |
Common Fields for Value Analysis
用于值分析的常见字段
| Dataset type | Primary field | Alternatives |
|---|---|---|
| Kubernetes logs | | |
| Application logs | | |
| Infrastructure | | |
| Traces | | |
| 数据集类型 | 主字段 | 替代字段 |
|---|---|---|
| Kubernetes日志 | | |
| 应用日志 | | |
| 基础设施 | | |
| 链路追踪 | | |
Units & Conversions
单位与转换
- Scripts use TB/day
- Dashboard filter uses GB/month
| Contract | TB/day | GB/month |
|---|---|---|
| 5 PB/month | 167 | 5,000,000 |
| 10 PB/month | 333 | 10,000,000 |
| 15 PB/month | 500 | 15,000,000 |
- 脚本使用 TB/天 作为单位
- 仪表盘筛选器使用 GB/月 作为单位
| 合约 | TB/天 | GB/月 |
|---|---|---|
| 5 PB/月 | 167 | 5,000,000 |
| 10 PB/月 | 333 | 10,000,000 |
| 15 PB/月 | 500 | 15,000,000 |
Optimization Actions
优化操作
| Signal | Action |
|---|---|
| Work/GB = 0 | Drop or stop ingesting |
| High-volume unqueried values | Sample or reduce log level |
| Empty values from system namespaces | Filter at ingest or accept |
| WoW spike | Check recent deploys |
| 信号 | 操作 |
|---|---|
| Work/GB = 0 | 删除或停止摄入 |
| 高数据量但未被查询的值 | 采样或降低日志级别 |
| 系统命名空间的空值 | 在摄入时过滤或接受现状 |
| 周环比峰值 | 检查近期部署情况 |