axiom-alerting

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Axiom Alerting

Axiom 告警系统

You manage alerting in Axiom end-to-end: notifiers for routing and monitors for detection.
您可以端到端管理Axiom中的告警功能:负责路由的通知器和负责检测的监控器。

API Overview

API 概览

Base URL:
https://api.axiom.co/v2/
with Bearer token auth from
.axiom.toml
(project root or
~/.axiom.toml
).
基础URL:
https://api.axiom.co/v2/
,使用
.axiom.toml
(项目根目录或
~/.axiom.toml
)中的Bearer令牌进行身份验证。

Monitors (
/v2/monitors
)

监控器(
/v2/monitors

OperationMethodPath
ListGET
/v2/monitors
GetGET
/v2/monitors/{id}
HistoryGET
/v2/monitors/{id}/history
CreatePOST
/v2/monitors
UpdatePUT
/v2/monitors/{id}
DeleteDELETE
/v2/monitors/{id}
操作请求方法路径
列出GET
/v2/monitors
获取GET
/v2/monitors/{id}
查看历史GET
/v2/monitors/{id}/history
创建POST
/v2/monitors
更新PUT
/v2/monitors/{id}
删除DELETE
/v2/monitors/{id}

Notifiers (
/v2/notifiers
)

通知器(
/v2/notifiers

OperationMethodPath
ListGET
/v2/notifiers
GetGET
/v2/notifiers/{id}
CreatePOST
/v2/notifiers
UpdatePUT
/v2/notifiers/{id}
DeleteDELETE
/v2/notifiers/{id}
操作请求方法路径
列出GET
/v2/notifiers
获取GET
/v2/notifiers/{id}
创建POST
/v2/notifiers
更新PUT
/v2/notifiers/{id}
删除DELETE
/v2/notifiers/{id}

Prerequisites

前置条件

  1. Run
    scripts/setup
  2. Ensure
    .axiom.toml
    has a deployment:
toml
[deployments.prod]
url = "https://api.axiom.co"
token = "xaat-your-token"
org_id = "your-org-id"
  1. 运行
    scripts/setup
  2. 确保
    .axiom.toml
    中包含部署配置:
toml
[deployments.prod]
url = "https://api.axiom.co"
token = "xaat-your-token"
org_id = "your-org-id"

Scripts

脚本

Core:
  • scripts/axiom-api <deploy> <method> <path> [body]
Monitor scripts:
  • scripts/monitor-list <deployment> [--json]
  • scripts/monitor-get <deployment> <id>
  • scripts/monitor-history <deployment> <id> <startTime> <endTime>
  • scripts/monitor-create <deployment> <json-file>
  • scripts/monitor-update <deployment> <id> <json-file>
  • scripts/monitor-delete <deployment> <id>
Notifier scripts:
  • scripts/notifier-list <deployment> [--json]
  • scripts/notifier-get <deployment> <id>
  • scripts/notifier-create <deployment> <json-file>
  • scripts/notifier-update <deployment> <id> <json-file>
  • scripts/notifier-delete <deployment> <id>
核心脚本:
  • scripts/axiom-api <deploy> <method> <path> [body]
监控器相关脚本:
  • scripts/monitor-list <deployment> [--json]
  • scripts/monitor-get <deployment> <id>
  • scripts/monitor-history <deployment> <id> <startTime> <endTime>
  • scripts/monitor-create <deployment> <json-file>
  • scripts/monitor-update <deployment> <id> <json-file>
  • scripts/monitor-delete <deployment> <id>
通知器相关脚本:
  • scripts/notifier-list <deployment> [--json]
  • scripts/notifier-get <deployment> <id>
  • scripts/notifier-create <deployment> <json-file>
  • scripts/notifier-update <deployment> <id> <json-file>
  • scripts/notifier-delete <deployment> <id>

Recommended Workflow

推荐工作流

  1. Create notifier first.
  2. Create monitor and set
    notifierIds
    .
  3. Validate monitor behavior with
    monitor-history
    .
  4. Iterate monitor thresholds and schedule.
  1. 先创建通知器。
  2. 创建监控器并设置
    notifierIds
  3. 使用
    monitor-history
    验证监控器行为。
  4. 迭代调整监控器阈值和调度规则。

Workflow: End-To-End Alerting

端到端告警工作流

  1. Run
    scripts/setup
    .
  2. List existing notifiers with
    scripts/notifier-list <deployment>
    and reuse one if appropriate.
  3. If no suitable notifier exists, create one with
    scripts/notifier-create
    .
  4. Create or update the monitor with
    notifierIds
    attached.
  5. Validate with
    scripts/monitor-history <deployment> <id> <startTime> <endTime>
    .
  6. If behavior is noisy or silent, tune
    threshold
    ,
    rangeMinutes
    ,
    intervalMinutes
    , and N-of-M trigger fields.
  7. Re-check history after each change.
  1. 运行
    scripts/setup
  2. 使用
    scripts/notifier-list <deployment>
    列出已有通知器,如有合适的可直接复用。
  3. 若无合适通知器,使用
    scripts/notifier-create
    创建新的。
  4. 创建或更新监控器并关联
    notifierIds
  5. 使用
    scripts/monitor-history <deployment> <id> <startTime> <endTime>
    验证监控器。
  6. 若告警过于频繁或无告警,调整
    threshold
    rangeMinutes
    intervalMinutes
    以及N-of-M触发字段。
  7. 每次修改后重新查看历史记录。

Best Practices

最佳实践

  • Configure one channel per notifier.
  • Use
    emails
    (not
    recipients
    ) for email notifier payloads.
  • Prefer
    triggerAfterNPositiveResults
    /
    triggerFromNRuns
    for noisy signals.
  • Use explicit
    bin()
    in monitor queries; avoid
    bin_auto()
    for alert logic.
  • For metrics-backed monitors, prefer
    mplQuery
    for definitions; API responses may include both
    aplQuery
    and
    mplQuery
    .
  • 每个通知器配置一个渠道。
  • 邮件通知器 payload 使用
    emails
    而非
    recipients
  • 针对噪声信号,优先使用
    triggerAfterNPositiveResults
    /
    triggerFromNRuns
  • 监控器查询中使用显式
    bin()
    ;告警逻辑避免使用
    bin_auto()
  • 对于基于指标的监控器,优先使用
    mplQuery
    定义;API响应可能同时包含
    aplQuery
    mplQuery

Monitor Types And Operators

监控器类型与运算符

Monitor types:
  • Threshold
  • MatchEvent
  • AnomalyDetection
Operators:
  • Above
  • Below
  • AboveOrEqual
  • BelowOrEqual
  • AboveOrBelow
监控器类型:
  • Threshold
    (阈值型)
  • MatchEvent
    (事件匹配型)
  • AnomalyDetection
    (异常检测型)
运算符:
  • Above
    (高于)
  • Below
    (低于)
  • AboveOrEqual
    (大于等于)
  • BelowOrEqual
    (小于等于)
  • AboveOrBelow
    (超出范围)

Monitor Field Reference

监控器字段参考

Core fields:
  • name
    : Human-readable monitor name.
  • type
    :
    Threshold
    ,
    MatchEvent
    , or
    AnomalyDetection
    .
  • aplQuery
    /
    mplQuery
    : Query evaluated by the monitor.
  • notifierIds
    : Array of notifier IDs to notify.
  • disabled
    : Whether monitor is disabled.
  • disabledUntil
    : Optional timestamp for temporary disable/snooze.
  • description
    : Optional monitor description.
Threshold and evaluation fields:
  • operator
    : Threshold comparison operator.
  • threshold
    : Numeric threshold value.
  • rangeMinutes
    : Query evaluation window in minutes.
  • intervalMinutes
    : Evaluation cadence in minutes.
  • alertOnNoData
    : Whether no-data should trigger alerting.
  • triggerAfterNPositiveResults
    : Positive evaluations required before firing.
  • triggerFromNRuns
    : Total evaluation runs considered for N-of-M logic.
Advanced behavior fields:
  • resolvable
    : Whether alerts can resolve automatically.
  • notifyByGroup
    : Notify per group key/value result.
  • notifyEveryRun
    : Notify on every positive evaluation.
  • skipResolved
    : Skip sending resolved notifications.
  • secondDelay
    : Delay (seconds) to tolerate late-arriving data.
Type-specific fields:
  • columnName
    : Field used by some anomaly/value-anomaly monitors.
核心字段:
  • name
    :监控器的可读名称。
  • type
    Threshold
    MatchEvent
    AnomalyDetection
  • aplQuery
    /
    mplQuery
    :监控器执行的查询语句。
  • notifierIds
    :要通知的通知器ID数组。
  • disabled
    :监控器是否禁用。
  • disabledUntil
    :临时禁用/ snooze的可选时间戳。
  • description
    :监控器的可选描述。
阈值与评估字段:
  • operator
    :阈值比较运算符。
  • threshold
    :数值阈值。
  • rangeMinutes
    :查询评估窗口(分钟)。
  • intervalMinutes
    :评估频率(分钟)。
  • alertOnNoData
    :无数据时是否触发告警。
  • triggerAfterNPositiveResults
    :触发告警所需的正评估次数。
  • triggerFromNRuns
    :N-of-M逻辑考虑的总评估次数。
高级行为字段:
  • resolvable
    :告警是否可自动解除。
  • notifyByGroup
    :按分组键/值结果发送通知。
  • notifyEveryRun
    :每次正评估都发送通知。
  • skipResolved
    :跳过发送解除告警的通知。
  • secondDelay
    :容忍延迟数据的延迟时间(秒)。
类型专属字段:
  • columnName
    :部分异常/值异常监控器使用的字段。

Minimal Valid Monitor Examples

最简有效监控器示例

Threshold:
json
{
  "name": "High Error Count",
  "type": "Threshold",
  "aplQuery": "['logs'] | where status >= 500 | summarize count()",
  "operator": "Above",
  "threshold": 100,
  "rangeMinutes": 5,
  "intervalMinutes": 5,
  "notifierIds": ["notifier-id"],
  "triggerAfterNPositiveResults": 2,
  "triggerFromNRuns": 3,
  "disabled": false
}
MatchEvent:
json
{
  "name": "Error Event Match",
  "type": "MatchEvent",
  "aplQuery": "['logs'] | where level == 'error'",
  "rangeMinutes": 5,
  "intervalMinutes": 5,
  "notifierIds": ["notifier-id"],
  "disabled": false
}
AnomalyDetection:
json
{
  "name": "CPU Anomaly",
  "type": "AnomalyDetection",
  "aplQuery": "['metrics'] | summarize avg(cpu_usage)",
  "columnName": "cpu_usage",
  "operator": "AboveOrBelow",
  "rangeMinutes": 5,
  "intervalMinutes": 5,
  "notifierIds": ["notifier-id"],
  "disabled": false
}
阈值型:
json
{
  "name": "High Error Count",
  "type": "Threshold",
  "aplQuery": "['logs'] | where status >= 500 | summarize count()",
  "operator": "Above",
  "threshold": 100,
  "rangeMinutes": 5,
  "intervalMinutes": 5,
  "notifierIds": ["notifier-id"],
  "triggerAfterNPositiveResults": 2,
  "triggerFromNRuns": 3,
  "disabled": false
}
事件匹配型:
json
{
  "name": "Error Event Match",
  "type": "MatchEvent",
  "aplQuery": "['logs'] | where level == 'error'",
  "rangeMinutes": 5,
  "intervalMinutes": 5,
  "notifierIds": ["notifier-id"],
  "disabled": false
}
异常检测型:
json
{
  "name": "CPU Anomaly",
  "type": "AnomalyDetection",
  "aplQuery": "['metrics'] | summarize avg(cpu_usage)",
  "columnName": "cpu_usage",
  "operator": "AboveOrBelow",
  "rangeMinutes": 5,
  "intervalMinutes": 5,
  "notifierIds": ["notifier-id"],
  "disabled": false
}

Minimal Valid Notifier Examples

最简有效通知器示例

Email:
json
{
  "name": "Oncall Email",
  "properties": {
    "email": {
      "emails": ["oncall@example.com"]
    }
  }
}
Slack:
json
{
  "name": "Oncall Slack",
  "properties": {
    "slack": {
      "slackUrl": "https://hooks.slack.com/services/T.../B.../XXX"
    }
  }
}
Custom webhook:
json
{
  "name": "Oncall Custom Webhook",
  "properties": {
    "customWebhook": {
      "url": "https://api.example.com/alerts",
      "body": "{\"action\":\"{{.Action}}\",\"monitorID\":\"{{.MonitorID}}\"}"
    }
  }
}
邮件通知器:
json
{
  "name": "Oncall Email",
  "properties": {
    "email": {
      "emails": ["oncall@example.com"]
    }
  }
}
Slack通知器:
json
{
  "name": "Oncall Slack",
  "properties": {
    "slack": {
      "slackUrl": "https://hooks.slack.com/services/T.../B.../XXX"
    }
  }
}
自定义Webhook:
json
{
  "name": "Oncall Custom Webhook",
  "properties": {
    "customWebhook": {
      "url": "https://api.example.com/alerts",
      "body": "{\"action\":\"{{.Action}}\",\"monitorID\":\"{{.MonitorID}}\"}"
    }
  }
}

Troubleshooting

故障排查

401 Unauthorized
:
  • Cause: invalid or expired token.
  • Fix:
    • Verify token in
      ~/.axiom.toml
      .
    • Re-run
      scripts/setup
      and retry:
      • scripts/notifier-list <deployment>
403 Forbidden
:
  • Cause: token lacks required permissions.
  • Fix:
    • Create/assign token scopes for monitor/notifier management and dataset query access.
    • Retry:
      • scripts/monitor-list <deployment>
404 Not Found
on get/update/delete:
  • Cause: wrong monitor/notifier ID or wrong deployment/org.
  • Fix:
    • Confirm deployment in
      .axiom.toml
      .
    • Re-list objects and use exact IDs:
      • scripts/monitor-list <deployment> --json
      • scripts/notifier-list <deployment> --json
400 Bad Request
on notifier create/update:
  • Cause: invalid notifier payload shape.
  • Fix:
    • Use one notifier channel inside
      properties
      .
    • For email, use
      emails
      (not
      recipients
      ).
    • Validate against a known-good example and retry:
      • scripts/notifier-create <deployment> <json-file>
400 Bad Request
on monitor create/update:
  • Cause: invalid monitor schema, operator/type mismatch, or invalid query fields.
  • Fix:
    • Validate required fields:
      name
      ,
      type
      , query field, schedule, and
      notifierIds
      .
    • Confirm
      operator
      matches monitor type and threshold logic.
    • Retry:
      • scripts/monitor-create <deployment> <json-file>
      • scripts/monitor-update <deployment> <id> <json-file>
Monitor created but never alerts:
  • Cause: threshold too strict, wrong query window, or not enough positive runs.
  • Fix:
    • Inspect history over a known active period:
      • scripts/monitor-history <deployment> <id> <startTime> <endTime>
    • Reduce threshold or widen
      rangeMinutes
      .
    • Tune
      triggerAfterNPositiveResults
      /
      triggerFromNRuns
      .
Too many alerts (noisy monitor):
  • Cause: threshold too low or interval too short.
  • Fix:
    • Increase threshold.
    • Increase
      triggerAfterNPositiveResults
      and/or
      triggerFromNRuns
      .
    • Increase
      intervalMinutes
      or narrow match conditions.
Notifier exists but no delivery:
  • Cause: destination config invalid (URL/key/channel/email list), or destination-side rejection.
  • Fix:
    • Fetch notifier and verify destination fields:
      • scripts/notifier-get <deployment> <id>
    • Recreate/update notifier with corrected properties:
      • scripts/notifier-update <deployment> <id> <json-file>
    • Confirm monitor references correct notifier IDs.
401 Unauthorized
  • 原因:令牌无效或过期。
  • 解决方法:
    • 验证
      ~/.axiom.toml
      中的令牌。
    • 重新运行
      scripts/setup
      后重试:
      • scripts/notifier-list <deployment>
403 Forbidden
  • 原因:令牌缺少必要权限。
  • 解决方法:
    • 创建/分配具备监控器/通知器管理和数据集查询权限的令牌范围。
    • 重试:
      • scripts/monitor-list <deployment>
获取/更新/删除时出现
404 Not Found
  • 原因:监控器/通知器ID错误,或部署/组织错误。
  • 解决方法:
    • 确认
      .axiom.toml
      中的部署配置。
    • 重新列出对象并使用准确ID:
      • scripts/monitor-list <deployment> --json
      • scripts/notifier-list <deployment> --json
创建/更新通知器时出现
400 Bad Request
  • 原因:通知器 payload 格式无效。
  • 解决方法:
    • properties
      中仅配置一个通知渠道。
    • 邮件通知器使用
      emails
      而非
      recipients
    • 对照已知正确示例验证后重试:
      • scripts/notifier-create <deployment> <json-file>
创建/更新监控器时出现
400 Bad Request
  • 原因:监控器 schema 无效、运算符/类型不匹配,或查询字段无效。
  • 解决方法:
    • 验证必填字段:
      name
      type
      、查询字段、调度规则和
      notifierIds
    • 确认
      operator
      与监控器类型和阈值逻辑匹配。
    • 重试:
      • scripts/monitor-create <deployment> <json-file>
      • scripts/monitor-update <deployment> <id> <json-file>
监控器已创建但从未触发告警:
  • 原因:阈值过于严格、查询窗口错误,或正评估次数不足。
  • 解决方法:
    • 在已知活跃时段查看历史记录:
      • scripts/monitor-history <deployment> <id> <startTime> <endTime>
    • 降低阈值或扩大
      rangeMinutes
    • 调整
      triggerAfterNPositiveResults
      /
      triggerFromNRuns
告警过多(监控器噪声大):
  • 原因:阈值过低或评估间隔过短。
  • 解决方法:
    • 提高阈值。
    • 增加
      triggerAfterNPositiveResults
      和/或
      triggerFromNRuns
    • 增加
      intervalMinutes
      或缩小匹配条件。
通知器存在但未送达:
  • 原因:目标配置无效(URL/密钥/渠道/邮件列表),或目标端拒绝。
  • 解决方法:
    • 获取通知器并验证目标字段:
      • scripts/notifier-get <deployment> <id>
    • 使用修正后的属性重新创建/更新通知器:
      • scripts/notifier-update <deployment> <id> <json-file>
    • 确认监控器引用了正确的通知器ID。