controlling-costs

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Axiom Cost Control

Axiom成本控制

Dashboards, monitors, and waste identification for Axiom usage optimization.

用于Axiom使用优化的仪表盘、监控及浪费识别方案。

Before You Start

开始之前

Load required skills:

skill: axiom-sre
skill: building-dashboards

Building-dashboards provides:

dashboard-list

dashboard-get

dashboard-create

dashboard-update

dashboard-delete

Find the audit dataset. Try

axiom-audit

first:

apl

['axiom-audit']
| where _time > ago(1h)
| summarize count() by action
| where action in ('usageCalculated', 'runAPLQueryCost')

If not found → ask user. Common names:
```
axiom-audit-logs-view
```
,
```
audit-logs
```
If found but no
```
usageCalculated
```
events → wrong dataset, ask user

Verify
```
axiom-history
```
access (required for Phase 4):
apl
```
['axiom-history'] | where _time > ago(1h) | take 1
```
If not found, Phase 4 optimization will not work.
Confirm with user:
- Deployment name?
- Audit dataset name?
- Contract limit in TB/day? (required for Phase 3 monitors)
Replace
```
<deployment>
```
and
```
<audit-dataset>
```
in all commands below.

Tips:

Run any script with
```
-h
```
for full usage
Do NOT pipe script output to
```
head
```
or
```
tail
```
— causes SIGPIPE errors
Requires
```
jq
```
for JSON parsing
Use axiom-sre's
```
axiom-query
```
for ad-hoc APL, not direct CLI

加载所需技能：

skill: axiom-sre
skill: building-dashboards

building-dashboards 技能提供以下功能：

dashboard-list

、

dashboard-get

、

dashboard-create

、

dashboard-update

、

dashboard-delete

查找审计数据集。首先尝试

axiom-audit

：

apl

['axiom-audit']
| where _time > ago(1h)
| summarize count() by action
| where action in ('usageCalculated', 'runAPLQueryCost')

若未找到 → 询问用户。常见名称：
```
axiom-audit-logs-view
```
、
```
audit-logs
```
若找到但无
```
usageCalculated
```
事件 → 数据集错误，询问用户

验证
```
axiom-history
```
访问权限（第4阶段必需）：
apl
```
['axiom-history'] | where _time > ago(1h) | take 1
```
若未找到，第4阶段的优化将无法进行。
与用户确认：
- 部署名称？
- 审计数据集名称？
- 每日合约限额（TB）？（第3阶段监控必需）
替换以下所有命令中的
```
<deployment>
```
和
```
<audit-dataset>
```
。

提示：

运行任何脚本时添加
```
-h
```
参数查看完整使用说明
请勿将脚本输出通过管道传递给
```
head
```
或
```
tail
```
——会导致SIGPIPE错误
需要
```
jq
```
工具进行JSON解析
使用axiom-sre的
```
axiom-query
```
执行临时APL查询，而非直接使用CLI

Which Phases to Run

运行哪些阶段

User request	Run these phases
"reduce costs" / "find waste"	0 → 1 → 4
"set up cost control"	0 → 1 → 2 → 3
"deploy dashboard"	0 → 2
"create monitors"	0 → 3
"check for drift"	0 only

用户需求	运行对应阶段
"降低成本" / "查找浪费"	0 → 1 → 4
"设置成本控制"	0 → 1 → 2 → 3
"部署仪表盘"	0 → 2
"创建监控"	0 → 3
"检查偏移"	仅0

Phase 0: Check Existing Setup

阶段0：检查现有配置

bash

undefined

bash

undefined

Existing dashboard?

是否存在现有仪表盘？

dashboard-list <deployment> | grep -i cost

Existing monitors?

是否存在现有监控？

axiom-api <deployment> GET "/v2/monitors" | jq -r '.[] | select(.name | startswith("Cost Control:")) | "(.id)\t(.name)"'


If found, fetch with `dashboard-get` and compare to `templates/dashboard.json` for drift.

---

axiom-api <deployment> GET "/v2/monitors" | jq -r '.[] | select(.name | startswith("Cost Control:")) | "(.id)\t(.name)"'


若找到，使用`dashboard-get`获取并与`templates/dashboard.json`对比，检查是否存在配置偏移。

---

Phase 1: Discovery

阶段1：发现

bash

scripts/baseline-stats -d <deployment> -a <audit-dataset>

Captures daily ingest stats and produces the Analysis Queue (needed for Phase 4).

bash

scripts/baseline-stats -d <deployment> -a <audit-dataset>

捕获每日数据摄入统计信息，并生成分析队列（第4阶段必需）。

Phase 2: Dashboard

阶段2：仪表盘

bash

scripts/deploy-dashboard -d <deployment> -a <audit-dataset>

Creates dashboard with: ingest trends, burn rate, projections, waste candidates, top users. See

reference/dashboard-panels.md

for details.

bash

scripts/deploy-dashboard -d <deployment> -a <audit-dataset>

创建包含以下内容的仪表盘：摄入趋势、消耗速率、预测数据、浪费候选项、顶级用户。详情请参阅

reference/dashboard-panels.md

。

Phase 3: Monitors

阶段3：监控

Contract is required. You must have the contract limit from preflight step 4.

必需合约信息。必须获取预检查步骤4中的合约限额。

Step 1: List available notifiers

步骤1：列出可用通知渠道

bash

scripts/list-notifiers -d <deployment>

Present the list to the user and ask which notifier they want for cost alerts. If they don't want notifications, proceed without

-n

bash

scripts/list-notifiers -d <deployment>

将列表展示给用户，询问他们希望使用哪个通知渠道接收成本告警。若用户不需要通知，则无需添加

-n

参数。

Step 2: Create monitors

步骤2：创建监控

bash

scripts/create-monitors -d <deployment> -a <audit-dataset> -c <contract_tb> [-n <notifier_id>]

Creates 3 monitors:

Total Ingest Guard — alerts when daily ingest >1.2x contract OR 7-day avg grows >15% vs baseline
Per-Dataset Spike — robust z-score detection, alerts per dataset with attribution
Query Cost Spike — same z-score approach for query costs (GB·ms)

The spike monitors use

notifyByGroup: true

so each dataset triggers a separate alert.

See

reference/monitor-strategy.md

for threshold derivation.

bash

scripts/create-monitors -d <deployment> -a <audit-dataset> -c <contract_tb> [-n <notifier_id>]

创建3个监控：

总摄入防护 — 当每日摄入超过合约的1.2倍，或7日平均值较基线增长超过15%时触发告警
数据集级峰值 — 采用鲁棒z分数检测，针对单个数据集触发告警并提供归因信息
查询成本峰值 — 采用相同z分数方法检测查询成本（GB·ms）的峰值

峰值监控使用

notifyByGroup: true

配置，因此每个数据集会触发单独的告警。

阈值推导详情请参阅

reference/monitor-strategy.md

。

Phase 4: Optimization

阶段4：优化

Get the Analysis Queue

获取分析队列

Run

scripts/baseline-stats

if not already done. It outputs a prioritized list:

Priority	Meaning
P0⛔	Top 3 by ingest OR >10% of total — MANDATORY
P1	Never queried — strong drop candidate
P2	Rarely queried (Work/GB < 100) — likely waste

Work/GB = query cost (GB·ms) / ingest (GB). Lower = less value from data.

若尚未运行

scripts/baseline-stats

，请先执行。该脚本会输出一个优先级列表：

优先级	含义
P0⛔	摄入总量排名前三或占总摄入的10%以上 — 必须处理
P1	从未被查询过 — 强烈建议删除
P2	极少被查询（Work/GB < 100） — 大概率为浪费数据

Work/GB = 查询成本（GB·ms） / 摄入数据量（GB）。数值越低，数据的价值越低。

Analyze datasets in order

按顺序分析数据集

Work top-to-bottom. For each dataset:

Step 1: Column analysis

bash

scripts/analyze-query-coverage -d <deployment> -D <dataset> -a <audit-dataset>

If 0 queries → recommend DROP, move to next.

Step 2: Field value analysis

Pick a field from suggested list (usually

app

service

, or

kubernetes.labels.app

bash

scripts/analyze-query-coverage -d <deployment> -D <dataset> -a <audit-dataset> -f <field>

Note values with high volume but never queried (⚠️ markers).

Step 3: Handle empty values

(empty)

has >5% volume, you MUST drill down with alternative field (e.g.,

kubernetes.namespace_name

Step 4: Record recommendation

For each dataset, note: name, ingest volume, Work/GB, top unqueried values, action (DROP/SAMPLE/KEEP), estimated savings.

从上到下处理。针对每个数据集：

步骤1：列分析

bash

scripts/analyze-query-coverage -d <deployment> -D <dataset> -a <audit-dataset>

若查询次数为0 → 建议删除，处理下一个数据集。

步骤2：字段值分析

从建议列表中选择一个字段（通常为

app

、

service

或

kubernetes.labels.app

）：

bash

scripts/analyze-query-coverage -d <deployment> -D <dataset> -a <audit-dataset> -f <field>

记录那些数据量高但从未被查询过的值（标记为⚠️）。

步骤3：处理空值

若

(empty)

占比超过5%，则必须使用替代字段（如

kubernetes.namespace_name

）深入分析。

步骤4：记录建议

针对每个数据集，记录：名称、摄入数据量、Work/GB、未被查询的顶级值、操作建议（删除/采样/保留）、预估节省成本。

Done when

完成条件

All P0⛔ and P1 datasets analyzed. Then compile report using

reference/analysis-report-template.md

所有P0⛔和P1数据集分析完毕。随后使用

reference/analysis-report-template.md

编译报告。

Cleanup

清理

bash

undefined

bash

undefined

Delete monitors

删除监控

axiom-api <deployment> GET "/v2/monitors" | jq -r '.[] | select(.name | startswith("Cost Control:")) | "(.id)\t(.name)"' axiom-api <deployment> DELETE "/v2/monitors/<id>"

Delete dashboard

删除仪表盘

dashboard-list <deployment> | grep -i cost dashboard-delete <deployment> <id>


**Note:** Running `create-monitors` twice creates duplicates. Delete existing monitors first if re-deploying.

---

dashboard-list <deployment> | grep -i cost dashboard-delete <deployment> <id>


**注意：** 重复运行`create-monitors`会创建重复监控。若重新部署，请先删除现有监控。

---

Reference

参考

Audit Dataset Fields

审计数据集字段

Field	Description
`action`	`usageCalculated` or `runAPLQueryCost`
`properties.hourly_ingest_bytes`	Hourly ingest in bytes
`properties.hourly_billable_query_gbms`	Hourly query cost
`properties.dataset`	Dataset name
`resource.id`	Org ID
`actor.email`	User email

字段	描述
`action`	`usageCalculated` 或 `runAPLQueryCost`
`properties.hourly_ingest_bytes`	每小时数据摄入字节数
`properties.hourly_billable_query_gbms`	每小时计费查询成本
`properties.dataset`	数据集名称
`resource.id`	组织ID
`actor.email`	用户邮箱

Common Fields for Value Analysis

用于值分析的常见字段

Dataset type	Primary field	Alternatives
Kubernetes logs	`kubernetes.labels.app`	`kubernetes.namespace_name` , `kubernetes.container_name`
Application logs	`app` or `service`	`level` , `logger` , `component`
Infrastructure	`host`	`region` , `instance`
Traces	`service.name`	`span.kind` , `http.route`

数据集类型	主字段	替代字段
Kubernetes日志	`kubernetes.labels.app`	`kubernetes.namespace_name` 、 `kubernetes.container_name`
应用日志	`app` 或 `service`	`level` 、 `logger` 、 `component`
基础设施	`host`	`region` 、 `instance`
链路追踪	`service.name`	`span.kind` 、 `http.route`

Units & Conversions

单位与转换

Scripts use TB/day
Dashboard filter uses GB/month

Contract	TB/day	GB/month
5 PB/month	167	5,000,000
10 PB/month	333	10,000,000
15 PB/month	500	15,000,000

脚本使用 TB/天 作为单位
仪表盘筛选器使用 GB/月 作为单位

合约	TB/天	GB/月
5 PB/月	167	5,000,000
10 PB/月	333	10,000,000
15 PB/月	500	15,000,000

Optimization Actions

优化操作

Signal	Action
Work/GB = 0	Drop or stop ingesting
High-volume unqueried values	Sample or reduce log level
Empty values from system namespaces	Filter at ingest or accept
WoW spike	Check recent deploys

信号	操作
Work/GB = 0	删除或停止摄入
高数据量但未被查询的值	采样或降低日志级别
系统命名空间的空值	在摄入时过滤或接受现状
周环比峰值	检查近期部署情况