controlling-costs

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Axiom Cost Control

Axiom成本控制

Dashboards, monitors, and waste identification for Axiom usage optimization.
用于Axiom使用优化的仪表盘、监控及浪费识别方案。

Before You Start

开始之前

  1. Load required skills:
    skill: axiom-sre
    skill: building-dashboards
    Building-dashboards provides:
    dashboard-list
    ,
    dashboard-get
    ,
    dashboard-create
    ,
    dashboard-update
    ,
    dashboard-delete
  2. Find the audit dataset. Try
    axiom-audit
    first:
    apl
    ['axiom-audit']
    | where _time > ago(1h)
    | summarize count() by action
    | where action in ('usageCalculated', 'runAPLQueryCost')
    • If not found → ask user. Common names:
      axiom-audit-logs-view
      ,
      audit-logs
    • If found but no
      usageCalculated
      events → wrong dataset, ask user
  3. Verify
    axiom-history
    access (required for Phase 4):
    apl
    ['axiom-history'] | where _time > ago(1h) | take 1
    If not found, Phase 4 optimization will not work.
  4. Confirm with user:
    • Deployment name?
    • Audit dataset name?
    • Contract limit in TB/day? (required for Phase 3 monitors)
  5. Replace
    <deployment>
    and
    <audit-dataset>
    in all commands below.
Tips:
  • Run any script with
    -h
    for full usage
  • Do NOT pipe script output to
    head
    or
    tail
    — causes SIGPIPE errors
  • Requires
    jq
    for JSON parsing
  • Use axiom-sre's
    axiom-query
    for ad-hoc APL, not direct CLI
  1. 加载所需技能:
    skill: axiom-sre
    skill: building-dashboards
    building-dashboards 技能提供以下功能:
    dashboard-list
    dashboard-get
    dashboard-create
    dashboard-update
    dashboard-delete
  2. 查找审计数据集。首先尝试
    axiom-audit
    apl
    ['axiom-audit']
    | where _time > ago(1h)
    | summarize count() by action
    | where action in ('usageCalculated', 'runAPLQueryCost')
    • 若未找到 → 询问用户。常见名称:
      axiom-audit-logs-view
      audit-logs
    • 若找到但无
      usageCalculated
      事件 → 数据集错误,询问用户
  3. 验证
    axiom-history
    访问权限(第4阶段必需):
    apl
    ['axiom-history'] | where _time > ago(1h) | take 1
    若未找到,第4阶段的优化将无法进行。
  4. 与用户确认:
    • 部署名称?
    • 审计数据集名称?
    • 每日合约限额(TB)?(第3阶段监控必需)
  5. 替换以下所有命令中的
    <deployment>
    <audit-dataset>
提示:
  • 运行任何脚本时添加
    -h
    参数查看完整使用说明
  • 请勿将脚本输出通过管道传递给
    head
    tail
    ——会导致SIGPIPE错误
  • 需要
    jq
    工具进行JSON解析
  • 使用axiom-sre的
    axiom-query
    执行临时APL查询,而非直接使用CLI

Which Phases to Run

运行哪些阶段

User requestRun these phases
"reduce costs" / "find waste"0 → 1 → 4
"set up cost control"0 → 1 → 2 → 3
"deploy dashboard"0 → 2
"create monitors"0 → 3
"check for drift"0 only

用户需求运行对应阶段
"降低成本" / "查找浪费"0 → 1 → 4
"设置成本控制"0 → 1 → 2 → 3
"部署仪表盘"0 → 2
"创建监控"0 → 3
"检查偏移"仅0

Phase 0: Check Existing Setup

阶段0:检查现有配置

bash
undefined
bash
undefined

Existing dashboard?

是否存在现有仪表盘?

dashboard-list <deployment> | grep -i cost
dashboard-list <deployment> | grep -i cost

Existing monitors?

是否存在现有监控?

axiom-api <deployment> GET "/v2/monitors" | jq -r '.[] | select(.name | startswith("Cost Control:")) | "(.id)\t(.name)"'

If found, fetch with `dashboard-get` and compare to `templates/dashboard.json` for drift.

---
axiom-api <deployment> GET "/v2/monitors" | jq -r '.[] | select(.name | startswith("Cost Control:")) | "(.id)\t(.name)"'

若找到,使用`dashboard-get`获取并与`templates/dashboard.json`对比,检查是否存在配置偏移。

---

Phase 1: Discovery

阶段1:发现

bash
scripts/baseline-stats -d <deployment> -a <audit-dataset>
Captures daily ingest stats and produces the Analysis Queue (needed for Phase 4).

bash
scripts/baseline-stats -d <deployment> -a <audit-dataset>
捕获每日数据摄入统计信息,并生成分析队列(第4阶段必需)。

Phase 2: Dashboard

阶段2:仪表盘

bash
scripts/deploy-dashboard -d <deployment> -a <audit-dataset>
Creates dashboard with: ingest trends, burn rate, projections, waste candidates, top users. See
reference/dashboard-panels.md
for details.

bash
scripts/deploy-dashboard -d <deployment> -a <audit-dataset>
创建包含以下内容的仪表盘:摄入趋势、消耗速率、预测数据、浪费候选项、顶级用户。详情请参阅
reference/dashboard-panels.md

Phase 3: Monitors

阶段3:监控

Contract is required. You must have the contract limit from preflight step 4.
必需合约信息。必须获取预检查步骤4中的合约限额。

Step 1: List available notifiers

步骤1:列出可用通知渠道

bash
scripts/list-notifiers -d <deployment>
Present the list to the user and ask which notifier they want for cost alerts. If they don't want notifications, proceed without
-n
.
bash
scripts/list-notifiers -d <deployment>
将列表展示给用户,询问他们希望使用哪个通知渠道接收成本告警。若用户不需要通知,则无需添加
-n
参数。

Step 2: Create monitors

步骤2:创建监控

bash
scripts/create-monitors -d <deployment> -a <audit-dataset> -c <contract_tb> [-n <notifier_id>]
Creates 3 monitors:
  1. Total Ingest Guard — alerts when daily ingest >1.2x contract OR 7-day avg grows >15% vs baseline
  2. Per-Dataset Spike — robust z-score detection, alerts per dataset with attribution
  3. Query Cost Spike — same z-score approach for query costs (GB·ms)
The spike monitors use
notifyByGroup: true
so each dataset triggers a separate alert.
See
reference/monitor-strategy.md
for threshold derivation.

bash
scripts/create-monitors -d <deployment> -a <audit-dataset> -c <contract_tb> [-n <notifier_id>]
创建3个监控:
  1. 总摄入防护 — 当每日摄入超过合约的1.2倍,或7日平均值较基线增长超过15%时触发告警
  2. 数据集级峰值 — 采用鲁棒z分数检测,针对单个数据集触发告警并提供归因信息
  3. 查询成本峰值 — 采用相同z分数方法检测查询成本(GB·ms)的峰值
峰值监控使用
notifyByGroup: true
配置,因此每个数据集会触发单独的告警。
阈值推导详情请参阅
reference/monitor-strategy.md

Phase 4: Optimization

阶段4:优化

Get the Analysis Queue

获取分析队列

Run
scripts/baseline-stats
if not already done. It outputs a prioritized list:
PriorityMeaning
P0⛔Top 3 by ingest OR >10% of total — MANDATORY
P1Never queried — strong drop candidate
P2Rarely queried (Work/GB < 100) — likely waste
Work/GB = query cost (GB·ms) / ingest (GB). Lower = less value from data.
若尚未运行
scripts/baseline-stats
,请先执行。该脚本会输出一个优先级列表:
优先级含义
P0⛔摄入总量排名前三 或 占总摄入的10%以上 — 必须处理
P1从未被查询过 — 强烈建议删除
P2极少被查询(Work/GB < 100) — 大概率为浪费数据
Work/GB = 查询成本(GB·ms) / 摄入数据量(GB)。数值越低,数据的价值越低。

Analyze datasets in order

按顺序分析数据集

Work top-to-bottom. For each dataset:
Step 1: Column analysis
bash
scripts/analyze-query-coverage -d <deployment> -D <dataset> -a <audit-dataset>
If 0 queries → recommend DROP, move to next.
Step 2: Field value analysis
Pick a field from suggested list (usually
app
,
service
, or
kubernetes.labels.app
):
bash
scripts/analyze-query-coverage -d <deployment> -D <dataset> -a <audit-dataset> -f <field>
Note values with high volume but never queried (⚠️ markers).
Step 3: Handle empty values
If
(empty)
has >5% volume, you MUST drill down with alternative field (e.g.,
kubernetes.namespace_name
).
Step 4: Record recommendation
For each dataset, note: name, ingest volume, Work/GB, top unqueried values, action (DROP/SAMPLE/KEEP), estimated savings.
从上到下处理。针对每个数据集:
步骤1:列分析
bash
scripts/analyze-query-coverage -d <deployment> -D <dataset> -a <audit-dataset>
若查询次数为0 → 建议删除,处理下一个数据集。
步骤2:字段值分析
从建议列表中选择一个字段(通常为
app
service
kubernetes.labels.app
):
bash
scripts/analyze-query-coverage -d <deployment> -D <dataset> -a <audit-dataset> -f <field>
记录那些数据量高但从未被查询过的值(标记为⚠️)。
步骤3:处理空值
(empty)
占比超过5%,则必须使用替代字段(如
kubernetes.namespace_name
)深入分析。
步骤4:记录建议
针对每个数据集,记录:名称、摄入数据量、Work/GB、未被查询的顶级值、操作建议(删除/采样/保留)、预估节省成本。

Done when

完成条件

All P0⛔ and P1 datasets analyzed. Then compile report using
reference/analysis-report-template.md
.


所有P0⛔和P1数据集分析完毕。随后使用
reference/analysis-report-template.md
编译报告。


Cleanup

清理

bash
undefined
bash
undefined

Delete monitors

删除监控

axiom-api <deployment> GET "/v2/monitors" | jq -r '.[] | select(.name | startswith("Cost Control:")) | "(.id)\t(.name)"' axiom-api <deployment> DELETE "/v2/monitors/<id>"
axiom-api <deployment> GET "/v2/monitors" | jq -r '.[] | select(.name | startswith("Cost Control:")) | "(.id)\t(.name)"' axiom-api <deployment> DELETE "/v2/monitors/<id>"

Delete dashboard

删除仪表盘

dashboard-list <deployment> | grep -i cost dashboard-delete <deployment> <id>

**Note:** Running `create-monitors` twice creates duplicates. Delete existing monitors first if re-deploying.

---
dashboard-list <deployment> | grep -i cost dashboard-delete <deployment> <id>

**注意:** 重复运行`create-monitors`会创建重复监控。若重新部署,请先删除现有监控。

---

Reference

参考

Audit Dataset Fields

审计数据集字段

FieldDescription
action
usageCalculated
or
runAPLQueryCost
properties.hourly_ingest_bytes
Hourly ingest in bytes
properties.hourly_billable_query_gbms
Hourly query cost
properties.dataset
Dataset name
resource.id
Org ID
actor.email
User email
字段描述
action
usageCalculated
runAPLQueryCost
properties.hourly_ingest_bytes
每小时数据摄入字节数
properties.hourly_billable_query_gbms
每小时计费查询成本
properties.dataset
数据集名称
resource.id
组织ID
actor.email
用户邮箱

Common Fields for Value Analysis

用于值分析的常见字段

Dataset typePrimary fieldAlternatives
Kubernetes logs
kubernetes.labels.app
kubernetes.namespace_name
,
kubernetes.container_name
Application logs
app
or
service
level
,
logger
,
component
Infrastructure
host
region
,
instance
Traces
service.name
span.kind
,
http.route
数据集类型主字段替代字段
Kubernetes日志
kubernetes.labels.app
kubernetes.namespace_name
kubernetes.container_name
应用日志
app
service
level
logger
component
基础设施
host
region
instance
链路追踪
service.name
span.kind
http.route

Units & Conversions

单位与转换

  • Scripts use TB/day
  • Dashboard filter uses GB/month
ContractTB/dayGB/month
5 PB/month1675,000,000
10 PB/month33310,000,000
15 PB/month50015,000,000
  • 脚本使用 TB/天 作为单位
  • 仪表盘筛选器使用 GB/月 作为单位
合约TB/天GB/月
5 PB/月1675,000,000
10 PB/月33310,000,000
15 PB/月50015,000,000

Optimization Actions

优化操作

SignalAction
Work/GB = 0Drop or stop ingesting
High-volume unqueried valuesSample or reduce log level
Empty values from system namespacesFilter at ingest or accept
WoW spikeCheck recent deploys
信号操作
Work/GB = 0删除或停止摄入
高数据量但未被查询的值采样或降低日志级别
系统命名空间的空值在摄入时过滤或接受现状
周环比峰值检查近期部署情况