observability-k8s-investigation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseKubernetes Investigation
Kubernetes 问题排查
Diagnose Kubernetes issues using OTel telemetry collected via EDOT (Elastic Distribution of OpenTelemetry) and the
kube-stack collector. Correlate cluster state, pod runtime metrics, K8s events, application logs, and APM to identify
root cause across the workload, node, and control-plane layers.
使用通过EDOT(Elastic Distribution of OpenTelemetry)和kube-stack收集器采集的OTel遥测数据诊断Kubernetes问题。关联集群状态、Pod运行时指标、K8s事件、应用日志和APM数据,以定位工作负载、节点和控制平面各层的根因。
Scope
适用范围
In scope: OTel-receiver-namespaced indices (,
, , ) and OTel
semantic conventions (, , ).
metrics-kubeletstatsreceiver.otel-*metrics-k8sclusterreceiver.otel-*logs-k8seventsreceiver.otel-*logs-k8sobjectsreceiver.otel-*k8s.pod.namek8s.namespace.namek8s.container.restartsOut of scope:
- The legacy Elastic Agent Kubernetes integration (,
metrics-kubernetes.*,logs-kubernetes.*fields). Being deprecated — do not author queries against these paths.kubernetes.* - APM-layer analysis (service SLO breaches, transaction error rates, upstream dependency health). Different domain — once a K8s root cause is ruled in or out, APM investigation continues outside this skill.
- Cluster provisioning, capacity planning, cost optimization. Different domain.
包含范围: OTel接收方命名空间索引(、、、)以及OTel语义约定(、、)。
metrics-kubeletstatsreceiver.otel-*metrics-k8sclusterreceiver.otel-*logs-k8seventsreceiver.otel-*logs-k8sobjectsreceiver.otel-*k8s.pod.namek8s.namespace.namek8s.container.restarts排除范围:
- 传统Elastic Agent Kubernetes集成(、
metrics-kubernetes.*、logs-kubernetes.*字段)。该集成已被弃用——请勿针对这些路径编写查询。kubernetes.* - APM层分析(服务SLO违规、事务错误率、上游依赖健康状态)。属于不同领域——在确定K8s根因存在或排除后,APM排查需在本技能外继续进行。
- 集群配置、容量规划、成本优化。属于不同领域。
Guidelines
指导原则
These apply to every investigation. When in doubt, re-read them before writing the synthesis.
Absence of evidence is not evidence. Do not confabulate from empty results. If log queries return 0 rows, logs are
likely not collected or the pod has no recent lines — this does not mean "dependency unavailable" or any other
specific failure mode. Report and weight remaining signals accordingly.
no_logs_availableEmpty dependency data ≠ upstream healthy. Services without APM instrumentation (load generators, workers) emit no
destination metrics. Report , not "upstreams OK."
insufficient_dependency_dataCo-symptoms are not causes. Two services degrading simultaneously usually share an upstream, not a causal link. Only
attribute causation when (a) one service's degradation clearly precedes the other's, and (b) the delta is large (>5×
error rate, >3× latency).
OOMKilled ≠ memory leak by default. The limit may simply be undersized for the workload's working set. Compare
against a 7-day baseline at the same hour-of-day before claiming a leak.
Error-termination ≠ application bug by default. Check first. CFS throttling driving
liveness probe timeouts is the most common misdiagnosis in this space.
k8s.pod.cpu_limit_utilizationAverage CPU hides throttling. A pod can look healthy at 40–60% average while being throttled
severely at p99. Linux enforces CPU limits in 100ms periods; bursty workloads hit quota mid-period and stall. Look at
max and p95, not just average.
cpu_limit_utilizationRestart count is boolean, not a counter. is pulled directly from the K8s API and may be
pruned by the kubelet at any time, so the absolute value is unreliable. Treat it as (no recent restarts) vs
(recently restarting); do not derive backoff timing or "linear vs exponential" patterns from it. Confirm the restart
pattern via K8s / events instead.
k8s.container.restarts== 0> 0KillingBackOffPrefer to report uncertainty over manufacturing confidence. If the evidence is ambiguous, the synthesis should say
so. Competing hypotheses are a valid output.
以下原则适用于所有排查场景。如有疑问,撰写结论前请重新阅读。
没有证据不等于不存在问题。不要从空结果中臆断结论。 如果日志查询返回0行,可能是未收集日志或Pod近期无日志输出——这并不意味着“依赖不可用”或其他特定故障模式。请报告,并相应权衡剩余信号的权重。
no_logs_available空依赖数据≠上游健康。 未进行APM埋点的服务(负载生成器、工作节点)不会发送目标指标。请报告,而非“上游正常”。
insufficient_dependency_data并发症状不等同于因果关系。 两个服务同时降级通常是共享上游依赖,而非存在因果关联。只有当(a)一个服务的降级明显早于另一个,且(b)差异显著(错误率>5倍、延迟>3倍)时,才能判定因果关系。
OOMKilled默认不等于内存泄漏。 可能只是内存限制相对于工作负载的工作集设置过小。在声称存在泄漏前,请与同一天同一时段的7天基线进行对比。
错误终止默认不等于应用程序Bug。 请先检查。CFS节流导致存活探针超时是该领域最常见的误判情况。
k8s.pod.cpu_limit_utilization平均CPU使用率会掩盖节流情况。 Pod的平均可能处于40–60%的健康区间,但p99时段可能存在严重节流。Linux以100ms周期强制执行CPU限制;突发工作负载会在周期中期耗尽配额并停滞。请查看最大值和p95值,而不仅仅是平均值。
cpu_limit_utilization重启次数是布尔值,而非计数器。 直接从K8s API获取,kubelet可能随时清理该数据,因此绝对值不可靠。应将其视为(近期无重启)或(近期有重启);请勿从中推导退避时间或“线性 vs 指数”模式。请通过K8s的/事件确认重启模式。
k8s.container.restarts== 0> 0KillingBackOff宁可报告不确定性,也不要强行制造确定性。 如果证据模糊,结论应明确说明。多种竞争假设是有效的输出。
Indices and fields
索引与字段
Where to look
查询位置
| Signal | Index pattern | Use |
|---|---|---|
| Pod/container runtime | | CPU, memory, network, filesystem. Utilization ratios. |
| Cluster state | | Restarts, phase, last-terminated reason, HPA, quota, node condition |
| K8s events | | Killing, BackOff, FailedScheduling, Evicted, image pull events |
| K8s object snapshots | | Deployment/service/configmap state over time |
| Application logs | | |
| APM | | Correlate via |
| ML anomalies | | Memory-growth, restart-rate, throttle jobs (if configured) |
| 信号类型 | 索引模式 | 用途 |
|---|---|---|
| Pod/容器运行时 | | CPU、内存、网络、文件系统。使用率占比。 |
| 集群状态 | | 重启次数、阶段、上次终止原因、HPA、配额、节点状态 |
| K8s事件 | | Killing、BackOff、FailedScheduling、Evicted、镜像拉取事件 |
| K8s对象快照 | | Deployment/服务/configmap的状态变化 |
| 应用日志 | | |
| APM | | 通过 |
| ML异常 | | 内存增长、重启率、节流任务(若已配置) |
Key fields
关键字段
Flat OTel paths work in ES|QL. Prefer the flat form for readability; the nested form is for raw
log documents only.
resource.attributes.*| Field | Index | What it is |
|---|---|---|
| all k8s | Pod name |
| all k8s | Namespace |
| all k8s | Container within pod |
| k8sclusterreceiver + others | Parent deployment |
| k8sclusterreceiver | Pending=1/Running=2/Succeeded=3/Failed=4/Unknown=5 |
| k8sclusterreceiver | Total container restart count |
| k8sclusterreceiver | |
| k8sclusterreceiver | Pod-level reason ( |
| kubeletstatsreceiver | 0.0–1.0+ (can exceed 1 transiently before OOM) |
| kubeletstatsreceiver | 0.0–N (frequently >1 under CFS throttling) |
| kubeletstatsreceiver | Bytes |
| k8sclusterreceiver | 1 = pressure, 0 = ok |
| k8sclusterreceiver | 0 = NotReady |
| k8sclusterreceiver | HPA state |
| k8seventsreceiver | Event reason (filter on this) |
| k8seventsreceiver / logs | Event message / log message |
| k8seventsreceiver | involvedObject name (log attribute, use flat form) |
扁平OTel路径可在ES|QL中使用。为提高可读性,优先使用扁平形式;嵌套的形式仅适用于原始日志文档。
resource.attributes.*| 字段 | 索引 | 说明 |
|---|---|---|
| 所有k8s相关索引 | Pod名称 |
| 所有k8s相关索引 | 命名空间 |
| 所有k8s相关索引 | Pod内的容器 |
| k8sclusterreceiver及其他相关索引 | 所属Deployment |
| k8sclusterreceiver | Pending=1/Running=2/Succeeded=3/Failed=4/Unknown=5 |
| k8sclusterreceiver | 容器总重启次数 |
| k8sclusterreceiver | |
| k8sclusterreceiver | Pod层面的原因( |
| kubeletstatsreceiver | 0.0–1.0+(OOM前可能短暂超过1) |
| kubeletstatsreceiver | 0.0–N(CFS节流时通常>1) |
| kubeletstatsreceiver | 字节数 |
| k8sclusterreceiver | 1 = 存在压力,0 = 正常 |
| k8sclusterreceiver | 0 = 未就绪 |
| k8sclusterreceiver | HPA状态 |
| k8seventsreceiver | 事件原因(以此过滤) |
| k8seventsreceiver / 日志 | 事件消息 / 日志消息 |
| k8seventsreceiver | 关联对象名称(日志属性,使用扁平形式) |
Field availability
字段可用性
Several fields above are off by default in stock kube-stack collectors and require explicit configuration. Verify
presence before relying on them; if absent, fall back as noted and call out the substitution in the synthesis.
| Field | Why it might be missing | Fall-back |
|---|---|---|
| Optional metric in k8sclusterreceiver; gated behind | Infer from K8s |
| Same — optional metric on k8sclusterreceiver. | Infer from events: |
| Only emitted when the pod has the corresponding limit set, and the kubeletstatsreceiver metric is enabled. | Compute manually as |
| Gated behind k8sclusterreceiver | Compare |
If a fall-back is used, note it in the synthesis (e.g. ) so the
reader knows the signal is indirect.
(via memory.usage; limit_utilization not collected)上述部分字段在默认kube-stack收集器中未启用,需要显式配置。在依赖这些字段之前,请先确认其存在;若不存在,请按说明使用替代方案,并在结论中注明替代方式。
| 字段 | 缺失原因 | 替代方案 |
|---|---|---|
| k8sclusterreceiver中的可选指标;需开启 | 从 |
| 同上——k8sclusterreceiver中的可选指标。 | 从事件推断: |
| 仅当Pod设置了相应限制且kubeletstatsreceiver指标已启用时才会生成。 | 从k8sclusterreceiver手动计算为 |
| 受k8sclusterreceiver的 | 将 |
如果使用了替代方案,请在结论中注明(例如:),以便读者知晓该信号为间接信号。
(通过memory.usage推断;未收集limit_utilization)ES|QL gotchas
ES|QL注意事项
Before writing queries, know these. Each of them silently produces wrong answers rather than failing loudly.
VALUES()| firstMV_FIRST(VALUES(...))PERCENTILEhistogramAVGaggregate_metric_doubleAVG(transaction.duration.summary)COUNT(agg_metric_double)value_countSUM(field)AVG(field)SUM(transaction.duration.summary)K8s metrics use flat OTel field paths in ES|QL. , not . The nested
form is for raw log documents.
k8s.pod.nameresource.attributes.k8s.pod.name编写查询前,请了解以下要点。这些问题不会直接报错,但会悄无声息地返回错误结果。
VALUES()| firstMV_FIRST(VALUES(...))PERCENTILEhistogramaggregate_metric_doubleAVGAVG(transaction.duration.summary)COUNT(agg_metric_double)value_countSUM(field)AVG(field)SUM(transaction.duration.summary)K8s指标在ES|QL中使用扁平OTel字段路径。 使用,而非。嵌套形式仅适用于原始日志文档。
k8s.pod.nameresource.attributes.k8s.pod.nameFailure-mode taxonomy
故障模式分类
Vocabulary for classification, not a decision tree. Use the pivotal-signal column to recognize which mode you're looking
at; use "Investigate" to know what else should corroborate.
用于分类的词汇表,而非决策树。使用关键信号列识别当前排查的模式;使用“需排查内容”列了解需要哪些佐证信息。
Workload layer
工作负载层
| Mode | Pivotal signal | Investigate |
|---|---|---|
| OOMKilled | | Monotonic rise (leak) vs. load-driven spike? Compare current trend to 7-day baseline. Check heap metrics (JVM, Go, Node) for GC pressure. |
| CPU throttling → Error exit | | Liveness/readiness probe timeouts from CFS throttling. Average CPU can look fine (40–60%) while p99 throttle is severe. Check probe timeouts vs observed startup/health latency. |
| Liveness probe misconfiguration | Restarts without resource pressure; | K8s events show |
| CrashLoopBackOff (generic) | | Branch on |
| ImagePullBackOff | K8s events | Registry rate limit? Missing tag? Wrong imagePullSecret? Check recency of |
| Stuck rollout | New pods | Check |
| Termination signal race | Brief 5xx bursts correlated with rolling deploys | Endpoint removal races termination. New requests can hit the pod after SIGTERM starts. NGINX gotcha: |
| 模式 | 关键信号 | 需排查内容 |
|---|---|---|
| OOMKilled | | 是单调增长(泄漏)还是负载驱动的峰值?将当前趋势与7天基线对比。检查语言对应的堆指标(JVM、Go、Node)是否存在GC压力。 |
| CPU节流→错误退出 | | CFS节流导致存活/就绪探针超时。平均CPU使用率可能看起来正常(40–60%),但p99时段节流严重。检查探针超时时间与观测到的启动/健康延迟是否匹配。 |
| 存活探针配置错误 | 无资源压力但出现重启; | K8s事件显示 |
| CrashLoopBackOff(通用) | | 根据 |
| ImagePullBackOff | K8s事件中 | 镜像仓库限流?标签缺失?ImagePullSecret错误?检查 |
| 发布停滞 | 新Pod处于 | 检查 |
| 终止信号竞争 | 滚动发布期间出现短暂5xx峰值 | 端点移除与终止存在竞争。SIGTERM启动后,新请求仍可能命中该Pod。NGINX注意事项: |
Node layer
节点层
| Mode | Pivotal signal | Investigate |
|---|---|---|
| Node NotReady cascade | | Memory pressure? Disk pressure? Network partition from API server? Inspect kubelet logs, |
| Resource eviction | | Node-level noisy neighbor. QoS order: BestEffort → Burstable → Guaranteed. Identify which pod drove node memory up. |
| Node affinity/selector conflict | Mass unschedulable pods after label change | K8s events show |
| 模式 | 关键信号 | 需排查内容 |
|---|---|---|
| 节点NotReady连锁反应 | | 内存压力?磁盘压力?与API服务器网络分区?检查kubelet日志、 |
| 资源驱逐 | | 节点层面的“噪声邻居”。QoS优先级:BestEffort → Burstable → Guaranteed。确定是哪个Pod导致节点内存上升。 |
| 节点亲和性/选择器冲突 | 标签变更后出现大量无法调度的Pod | K8s事件显示 |
Control plane
控制平面
| Mode | Pivotal signal | Investigate |
|---|---|---|
| etcd I/O cascade | API server latency spike + cluster-wide kubelet heartbeat failures | Disk IOPS, fsync latency (must be <10ms). Cloud-burst-credit exhaustion is common. |
| Admission webhook block | Mass | |
| Priority preemption storm | Production pods terminating with | New |
| PDB drain deadlock | Node drain stuck indefinitely; HTTP 429 from Eviction API | PDB |
| 模式 | 关键信号 | 需排查内容 |
|---|---|---|
| etcd I/O连锁反应 | API服务器延迟飙升 + 集群范围内kubelet心跳失败 | 磁盘IOPS、fsync延迟(必须<10ms)。云突发信用耗尽是常见原因。 |
| 准入Webhook阻塞 | 跨命名空间出现大量 | |
| 优先级抢占风暴 | 生产Pod因 | 新的 |
| PDB排空死锁 | 节点排空无限期停滞;Eviction API返回HTTP 429 | PDB的 |
Autoscaling & admission
自动扩缩容与准入
| Mode | Pivotal signal | Investigate |
|---|---|---|
| HPA unready-pod dampening | Load rising, HPA not scaling; unready pods included in calculation | HPA averages CPU across all replicas including unready (0% contribution). Check |
| Resource quota silent 403 | Deployment stuck at n-1/n; | Namespace quota exhausted (often CronJob accumulation). Check |
| 模式 | 关键信号 | 需排查内容 |
|---|---|---|
| HPA未就绪Pod抑制 | 负载上升,但HPA未扩缩容;计算包含未就绪Pod | HPA会对所有副本(包括未就绪Pod,贡献0%)的CPU使用率取平均值。检查 |
| 资源配额静默403 | 部署停滞在n-1/n;ReplicaSet出现 | 命名空间配额耗尽(通常由CronJob累积导致)。检查 |
Networking
网络
| Mode | Pivotal signal | Investigate |
|---|---|---|
| StatefulSet split-brain | Duplicate pod identities across partitioned nodes | Network partition + eviction timeout race. Two instances of same ordinal running. No fencing by default. |
| CoreDNS OOMKill | CoreDNS restarts + cluster-wide DNS timeouts in app logs | Default CoreDNS memory (~170Mi) insufficient under query amplification (ndots:5, each external lookup → ~10 lookups). |
| 模式 | 关键信号 | 需排查内容 |
|---|---|---|
| StatefulSet脑裂 | 分区节点上出现重复Pod身份 | 网络分区 + 驱逐超时竞争。同一序号的两个实例同时运行。默认无隔离机制。 |
| CoreDNS OOMKill | CoreDNS重启 + 集群范围内应用日志出现DNS超时 | 默认CoreDNS内存(约170Mi)在查询放大(ndots:5,每个外部查询→约10次查询)场景下不足。 |
When classification is ambiguous
分类模糊时的处理
Real incidents often match two modes. Examples:
- OOMKilled pod with simultaneous CPU throttling — memory usually drives the kill, but verify by checking whether memory or CPU hit limit first.
- Stuck rollout with HPA dampening and resource quota near-exhaustion — both can freeze a deploy. Check which constraint is binding.
- Node NotReady with pods that were already crashing — the node issue may be incidental.
When two modes fit, name both in the synthesis and say which one you believe is causal and why. Do not force a single
hypothesis when the evidence supports two.
实际事件通常符合两种模式。示例:
- OOMKilled Pod同时存在CPU节流——通常是内存导致终止,但需验证是内存还是CPU先达到限制。
- 发布停滞同时存在HPA抑制和资源配额接近耗尽——两者都可能导致部署冻结。检查哪个约束是绑定状态。
- 节点NotReady同时Pod已在崩溃——节点问题可能只是偶然事件。
当两种模式都符合时,请在结论中同时列出,并说明你认为哪个是因果原因及理由。当证据支持两种假设时,不要强行选择单一假设。
Signal interpretation
信号解读
Memory
内存
- Monotonic rise over 30–60 min → leak. Check GC metrics for the language: JVM , Go
jvm.gc.duration, Nodeprocess.runtime.go.gc.pause_ns. Rising GC frequency/pause with stable live-set is the canonical leak signature.v8js_gc_duration - Diurnal / load-correlated spikes → load-driven, not leak. Consider HPA tuning or limit increase.
- Hits 1.0, then restart → OOMKilled confirmed. Exit code 137 (SIGKILL) in app logs consistent.
- 30–60分钟内单调上升 → 内存泄漏。检查对应语言的GC指标:JVM 、Go
jvm.gc.duration、Nodeprocess.runtime.go.gc.pause_ns。GC频率/暂停时间上升但活跃集稳定是典型的泄漏特征。v8js_gc_duration - 昼夜/负载相关峰值 → 负载驱动,而非泄漏。考虑调整HPA或增加内存限制。
- 达到1.0后重启 → 确认OOMKilled。应用日志中退出码137(SIGKILL)与此一致。
CPU
CPU
- sustained → CFS throttling. Node has spare CPU; the pod is quota-blocked.
cpu_limit_utilization > 1.0 - Symptoms of throttling (not the throttle metric itself): liveness probe timeouts, p99 latency 4–16× p50, queue backpressure upstream, Error-reason container terminations.
- Average can look healthy while p95 is throttled. Do not trust average alone.
- 持续存在 → CFS节流。节点有空闲CPU;Pod受配额限制。
cpu_limit_utilization > 1.0 - 节流症状(而非节流指标本身):存活探针超时、p99延迟是p50的4–16倍、上游队列积压、Error原因导致容器终止。
- 平均使用率可能看起来健康,但p95时段存在节流。不要仅依赖平均值。
Restart patterns
重启模式
- recently → workload has been restarting. Don't read magnitude into the count (see Restart count is boolean); confirm the pattern from K8s
restarts > 0/Killingevent timestamps inBackOff.logs-k8seventsreceiver.otel-* - Restarts correlated with memory pressure () → OOMKilled path.
memory_limit_utilization → 1.0 - Restarts without memory/CPU pressure → probe misconfig, app bug, or startup dependency failure. Pull events for
and
Unhealthy.Killing
- (近期) → 工作负载出现过重启。不要从数值大小推断结论(参考“重启次数是布尔值”);请从
restarts > 0中的K8slogs-k8seventsreceiver.otel-*/Killing事件时间戳确认模式。BackOff - 重启与内存压力()相关 → OOMKilled路径。
memory_limit_utilization → 1.0 - 无内存/CPU压力但出现重启 → 探针配置错误、应用Bug或启动依赖失败。查询和
Unhealthy事件。Killing
Termination reasons
终止原因
- → memory path.
OOMKilled - → non-zero exit. Check app logs; if empty/minimal, check CPU throttling before attributing to app logic.
Error - → ran to completion. Normal for Jobs/CronJobs/init containers; anomalous otherwise.
Completed - → runtime/image/exec issue. Check image pull events.
ContainerCannotRun
- → 内存路径。
OOMKilled - → 非零退出码。检查应用日志;如果日志为空/内容极少,请先检查CPU节流,再归因于应用逻辑。
Error - → 运行完成。对于Jobs/CronJobs/初始化容器是正常情况;否则异常。
Completed - → 运行时/镜像/执行问题。检查镜像拉取事件。
ContainerCannotRun
Investigation flow
排查流程
An investigation is not a checklist. The sections below describe a typical arc — compress, skip, or revisit them based on what you find. Terminate as soon as you have enough evidence to synthesize at a known confidence. Chasing signals past the point of diminishing returns is a failure mode, not thoroughness.
排查不是 checklist。以下部分描述的是典型流程——根据发现的内容压缩、跳过或重新访问相关步骤。一旦有足够证据支持已知置信度的结论,即可终止排查。过度追逐信号导致收益递减是一种错误模式,而非严谨。
Orient
定位目标
Resolve the target: , , optionally and . If no
time window is given, default to the last hour for pod-level investigations, last 2 hours for event correlation, last 6
hours for ongoing/unresolved incidents.
k8s.pod.namek8s.namespace.namek8s.deployment.nameservice.nameIf the alert payload already tells you the failure mode (e.g., it fires specifically on ), note that and skip
classification; move to confirmation and baseline comparison.
OOMKilled确定目标:、,可选和。如果未指定时间窗口,Pod级排查默认最近1小时,事件关联默认最近2小时,持续/未解决事件默认最近6小时。
k8s.pod.namek8s.namespace.namek8s.deployment.nameservice.name如果告警负载已明确故障模式(例如,专门针对触发),请注明并跳过分类步骤;直接进入确认和基线对比环节。
OOMKilledCharacterize
特征分析
Get the shape of the workload's recent behavior: restart count, termination reasons, phase, utilization. One or two
queries usually suffice.
esql
FROM metrics-k8sclusterreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND k8s.namespace.name == "<ns>"
AND @timestamp > NOW() - 1 hour
| STATS restarts = MAX(k8s.container.restarts),
term_reasons = VALUES(k8s.container.status.last_terminated_reason),
phase = MAX(k8s.pod.phase)esql
FROM metrics-kubeletstatsreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND @timestamp > NOW() - 15 minutes
| STATS mem_pct = ROUND(MAX(k8s.pod.memory_limit_utilization) * 100, 1),
cpu_pct = ROUND(MAX(k8s.pod.cpu_limit_utilization) * 100, 1)获取工作负载近期行为的特征:重启次数、终止原因、阶段、使用率。通常1-2个查询即可满足需求。
esql
FROM metrics-k8sclusterreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND k8s.namespace.name == "<ns>"
AND @timestamp > NOW() - 1 hour
| STATS restarts = MAX(k8s.container.restarts),
term_reasons = VALUES(k8s.container.status.last_terminated_reason),
phase = MAX(k8s.pod.phase)esql
FROM metrics-kubeletstatsreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND @timestamp > NOW() - 15 minutes
| STATS mem_pct = ROUND(MAX(k8s.pod.memory_limit_utilization) * 100, 1),
cpu_pct = ROUND(MAX(k8s.pod.cpu_limit_utilization) * 100, 1)Classify
分类
Use the taxonomy. The pivotal signal should match; the "Investigate" column tells you what corroboration to seek.
When two modes fit, note both and proceed with the one that has the stronger pivotal signal. You may revise during
corroboration.
使用上述分类体系。关键信号应匹配;“需排查内容”列会告诉你需要哪些佐证信息。
当两种模式都符合时,请同时列出,并优先选择关键信号更明确的模式。你可能会在佐证环节修正结论。
Corroborate
佐证
Pull the evidence your classification predicts you'll find. Typical sources:
K8s events for the namespace and window:
esql
FROM logs-k8seventsreceiver.otel-*
| WHERE k8s.namespace.name == "<ns>"
AND @timestamp > NOW() - 2 hours
AND attributes.k8s.event.reason IN (
"BackOff", "Killing", "Unhealthy", "Failed",
"FailedScheduling", "Evicted", "SuccessfulRescale",
"Pulling", "Pulled", "Started", "Created"
)
| SORT @timestamp DESC
| KEEP @timestamp, attributes.k8s.event.reason, body.text, k8s.object.name
| LIMIT 30Application logs if available — look at the 200 most recent lines before the termination timestamp. If absent, flag
; do not invent a log pattern.
no_logs_availableAPM if the pod runs an instrumented service — resolve from pod resource attributes for later
correlation. SLO / latency / error-rate analysis itself is APM-layer work and out of scope for this skill.
service.nameBaseline comparison — for utilization-based findings, compare current values to 7-day-prior at the same hour-of-day.
"High memory" is meaningful only relative to what's normal for this workload.
提取分类预测的证据。典型来源:
指定命名空间和时间窗口的K8s事件:
esql
FROM logs-k8seventsreceiver.otel-*
| WHERE k8s.namespace.name == "<ns>"
AND @timestamp > NOW() - 2 hours
AND attributes.k8s.event.reason IN (
"BackOff", "Killing", "Unhealthy", "Failed",
"FailedScheduling", "Evicted", "SuccessfulRescale",
"Pulling", "Pulled", "Started", "Created"
)
| SORT @timestamp DESC
| KEEP @timestamp, attributes.k8s.event.reason, body.text, k8s.object.name
| LIMIT 30应用日志(若可用)——查看终止时间前的200条最新日志。如果日志缺失,请标记;不要编造日志模式。
no_logs_availableAPM(若Pod运行已埋点的服务)——从Pod资源属性中解析用于后续关联。SLO/延迟/错误率分析本身属于APM层工作,不在本技能范围内。
service.name基线对比——对于基于使用率的发现,请将当前值与7天前同一时段的值对比。“高内存”只有相对于该工作负载的正常情况才有意义。
Check for upstream cause (conditional)
检查上游原因(可选)
Only pursue if the symptom pattern suggests it. Threshold: upstream error rate >5× baseline or latency >3× baseline,
AND degradation started before the symptom on the target service. Co-symptoms do not establish causation.
If has no rows for the service, report —
not "upstreams healthy."
metrics-service_destination.1m.otel-defaultinsufficient_dependency_data仅当症状模式表明可能存在上游原因时才进行。阈值:上游错误率>基线5倍 或 延迟>基线3倍,且 上游服务的早于目标服务出现降级。并发症状不能确立因果关系。
如果中无该服务的数据,请报告——而非“上游健康”。
metrics-service_destination.1m.otel-defaultinsufficient_dependency_dataCheck for recent change (conditional)
检查近期变更(可选)
SuccessfulCreatePulledlogs-k8sobjectsreceiver.otel-*最近2小时内的/事件通常与发布相关。显示configmap/secret/deployment的配置变更。症状出现前15分钟内的变更是强关联,但仍只是关联——请验证该变更是否能合理解释你分类的故障模式。
SuccessfulCreatePulledlogs-k8sobjectsreceiver.otel-*Synthesize and stop
总结并终止
Synthesize as soon as you have enough evidence to support a hypothesis at known confidence. You do not need to complete
every section above — investigation terminates when either:
- You have a high-confidence hypothesis with corroboration, or
- You have a low/medium-confidence hypothesis and further queries are unlikely to change the picture (e.g., logs are unavailable, APM isn't instrumented, no recent changes found).
一旦有足够证据支持已知置信度的假设,即可总结结论。你无需完成上述所有步骤——当出现以下任一情况时,排查即可终止:
- 你有高置信度的假设并得到佐证,或
- 你有低/中置信度的假设,且进一步查询不太可能改变结论(例如,日志不可用、未进行APM埋点、未发现近期变更)。
Synthesis
结论模板
Default structure:
text
HYPOTHESIS (confidence: high | medium | low)
<One paragraph: service, symptom, most likely cause. Name the failure mode from the taxonomy.>
EVIDENCE
- <Finding from characterization, with the concrete metric or value.>
- <Finding from events / logs / APM.>
- <Finding from baseline comparison, dependency check, or change correlation if pursued.>
CONFIDENCE NOTE
<Only if not 'high'. What specific evidence is missing or ambiguous.>
RECOMMENDED NEXT STEPS
1. <Most actionable — typically a config check or metric to observe.>
2. <Secondary.>
DOWNSTREAM IMPACT
<Services depending on this workload, or 'No downstream dependencies identified.'>When two hypotheses are live: replace HYPOTHESIS with COMPETING HYPOTHESES; list both, say which you lean toward and
why, and list the evidence that would disambiguate them.
When no incident is found (symptom resolved, or alert appears spurious): say so directly.
is a valid output. List what you checked and what you didn't find.
ALERT FIRED BUT SYSTEM APPEARS HEALTHY默认结构:
text
假设(置信度:高 | 中 | 低)
<一段文字:服务、症状、最可能的原因。引用分类体系中的故障模式。>
证据
- <特征分析的发现,包含具体指标或数值。>
- <事件/日志/APM的发现。>
- <基线对比、依赖检查或变更关联的发现(若进行)。>
置信度说明
<仅当置信度不为“高”时填写。说明缺失或模糊的具体证据。>
建议下一步操作
1. <最具可操作性的步骤——通常是配置检查或需观测的指标。>
2. <次要步骤。>
下游影响
<依赖该工作负载的服务,或“未识别到下游依赖”。>当存在两种假设时:将“假设”替换为“竞争假设”;列出两种假设,说明你倾向于哪一种及理由,并列出可消除歧义的证据。
未发现事件时(症状已解决,或告警为误报):直接说明。是有效的输出。列出你检查的内容及未发现的问题。
告警触发但系统看似正常Confidence calibration
置信度校准
Start at high and downgrade based on what's missing:
- Downgrade to medium if: primary signal is clear but corroboration is missing (no logs, no APM, no baseline comparison possible). Or: two modes fit and you can't disambiguate.
- Downgrade to low if: only a single signal supports the hypothesis, signals conflict, or the mode requires evidence you couldn't fetch.
Never return high when application log data was absent and the hypothesis depends on application behavior. Absence
of evidence does not corroborate a hypothesis.
默认从高开始,根据缺失内容降级:
- 降级为中:主要信号明确但缺乏佐证(无日志、无APM、无法进行基线对比)。或:两种模式都符合且无法消除歧义。
- 降级为低:只有单个信号支持假设,信号冲突,或该模式需要的证据无法获取。
当应用日志缺失且假设依赖应用行为时,请勿返回高置信度。没有证据不能佐证假设。
Query recipes
查询示例
Most-restarting pods in a namespace
命名空间内重启次数最多的Pod
esql
FROM metrics-k8sclusterreceiver.otel-*
| WHERE k8s.namespace.name == "<ns>" AND @timestamp > NOW() - 1 hour
| STATS restarts = MAX(k8s.container.restarts) BY k8s.pod.name, k8s.container.status.last_terminated_reason
| WHERE restarts > 0
| SORT restarts DESC
| LIMIT 20esql
FROM metrics-k8sclusterreceiver.otel-*
| WHERE k8s.namespace.name == "<ns>" AND @timestamp > NOW() - 1 hour
| STATS restarts = MAX(k8s.container.restarts) BY k8s.pod.name, k8s.container.status.last_terminated_reason
| WHERE restarts > 0
| SORT restarts DESC
| LIMIT 20CPU throttling check for a pod
Pod的CPU节流检查
esql
FROM metrics-kubeletstatsreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND @timestamp > NOW() - 30 minutes
| STATS max_cpu_ratio = ROUND(MAX(k8s.pod.cpu_limit_utilization), 2),
avg_cpu_ratio = ROUND(AVG(k8s.pod.cpu_limit_utilization), 2),
max_cpu_cores = ROUND(MAX(k8s.pod.cpu.usage), 3)Sustained ratio >1.0 = throttling. Transient >1.0 with avg <0.5 is usually benign burst.
esql
FROM metrics-kubeletstatsreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND @timestamp > NOW() - 30 minutes
| STATS max_cpu_ratio = ROUND(MAX(k8s.pod.cpu_limit_utilization), 2),
avg_cpu_ratio = ROUND(AVG(k8s.pod.cpu_limit_utilization), 2),
max_cpu_cores = ROUND(MAX(k8s.pod.cpu.usage), 3)持续比值>1.0 = 节流。短暂>1.0且平均值<0.5通常是良性突发。
Nodes under memory pressure (right now)
当前存在内存压力的节点
esql
FROM metrics-k8sclusterreceiver.otel-*
| WHERE @timestamp > NOW() - 15 minutes AND k8s.node.condition_memory_pressure == 1
| STATS ts = MAX(@timestamp) BY k8s.node.name
| SORT ts DESCesql
FROM metrics-k8sclusterreceiver.otel-*
| WHERE @timestamp > NOW() - 15 minutes AND k8s.node.condition_memory_pressure == 1
| STATS ts = MAX(@timestamp) BY k8s.node.name
| SORT ts DESCAdmission denials (webhook or quota) last hour
最近1小时的准入拒绝(Webhook或配额)
esql
FROM logs-k8seventsreceiver.otel-*
| WHERE @timestamp > NOW() - 1 hour
AND (attributes.k8s.event.reason == "FailedCreate"
OR body.text LIKE "*admission webhook*"
OR body.text LIKE "*exceeded quota*")
| SORT @timestamp DESC
| KEEP @timestamp, k8s.namespace.name, attributes.k8s.event.reason, body.text
| LIMIT 30esql
FROM logs-k8seventsreceiver.otel-*
| WHERE @timestamp > NOW() - 1 hour
AND (attributes.k8s.event.reason == "FailedCreate"
OR body.text LIKE "*admission webhook*"
OR body.text LIKE "*exceeded quota*")
| SORT @timestamp DESC
| KEEP @timestamp, k8s.namespace.name, attributes.k8s.event.reason, body.text
| LIMIT 30Firing K8s alerts
触发中的K8s告警
text
GET /api/alerting/rules/_find?search=k8s&search_fields=tags&filter=alert.attributes.executionStatus.status:activetext
GET /api/alerting/rules/_find?search=k8s&search_fields=tags&filter=alert.attributes.executionStatus.status:activeExamples
示例场景
"Why is my pod CrashLoopBackOff-ing?"
“我的Pod为什么CrashLoopBackOff?”
Characterize first: get restart count, termination reason, memory and CPU utilization.
- If and memory utilization hit 1.0 → memory path. Corroborate with 7-day baseline: monotonic rise over days = leak; spiky = load-driven. Check GC metrics if language is known.
last_terminated_reason == "OOMKilled" - If and
last_terminated_reason == "Error"→ CPU throttling path. Corroborate with liveness probe config (initialDelaySeconds, timeoutSeconds) and K8s events forcpu_limit_utilization > 1.0.Unhealthy - If and CPU is fine → application-logic path. Pull recent logs before termination.
last_terminated_reason == "Error" - If → image/exec path. Check K8s events for
last_terminated_reason == "ContainerCannotRun"pull events.Failed
Synthesize with appropriate confidence. If logs were unavailable on the Error path, downgrade to medium and say so.
首先进行特征分析:获取重启次数、终止原因、内存和CPU使用率。
- 如果且内存使用率达到1.0 → 内存路径。用7天基线佐证:数天内单调增长=泄漏;峰值=负载驱动。若已知语言,检查GC指标。
last_terminated_reason == "OOMKilled" - 如果且
last_terminated_reason == "Error"→ CPU节流路径。用存活探针配置(initialDelaySeconds、timeoutSeconds)和K8s的cpu_limit_utilization > 1.0事件佐证。Unhealthy - 如果且CPU正常 → 应用逻辑路径。获取终止前的最新日志。
last_terminated_reason == "Error" - 如果→ 镜像/执行路径。检查K8s的
last_terminated_reason == "ContainerCannotRun"拉取事件。Failed
根据置信度撰写结论。如果Error路径下日志不可用,将置信度降级为中并说明。
"Is my rollout stuck?"
“我的发布是否停滞?”
Authoritative signal: for > 10 minutes.
k8s.deployment.available < k8s.deployment.desiredDiagnose the constraint:
- K8s events on the new ReplicaSet: → admission rejection (quota, webhook, PSP).
FailedCreate→ no node fits.FailedScheduling - New-pod utilization: all at 0% memory → never started (image pull failure); high CPU with low memory → slow startup hitting readiness probe.
- HPA state: stable under load → unready-pod dampening.
current_replicas < desired_replicas
权威信号:持续超过10分钟。
k8s.deployment.available < k8s.deployment.desired诊断约束条件:
- 新ReplicaSet的K8s事件:→ 准入拒绝(配额、Webhook、PSP)。
FailedCreate→ 无匹配节点。FailedScheduling - 新Pod使用率:内存全为0% → 从未启动(镜像拉取失败);CPU高内存低 → 启动缓慢触发就绪探针。
- HPA状态:负载下稳定 → 未就绪Pod抑制。
current_replicas < desired_replicas
"Alert fired but everything looks healthy"
“告警触发但一切看似正常”
Possible and worth naming explicitly. Check:
- Has the symptom resolved? Compare current utilization/restart rate to the alert trigger point.
- Was the alert a transient spike that's already decayed?
- Is the alert tuned appropriately (e.g., too-short evaluation window)?
Output: with what you checked. Recommend alert tuning if the pattern is
recurrent.
ALERT FIRED BUT SYSTEM APPEARS HEALTHY这种情况是可能的,值得明确说明。检查:
- 症状是否已解决?将当前使用率/重启率与告警触发阈值对比。
- 告警是否是已衰减的短暂峰值?
- 告警配置是否合理(例如,评估窗口过短)?
输出:,并列出你检查的内容。如果该模式反复出现,建议调整告警配置。
告警触发但系统看似正常Related
相关资源
- Workflow: — alert-triggered automated version of the pod-level path above. Runs deterministic ESQL + branches; this skill provides the interpretation layer the workflow lacks.
K8s CrashLoopBackOff Investigation - Forge genome library: 16 K8s failure scenarios (OOMKill cascade, CPU throttling, probe misconfig, node NotReady, admission webhook block, etc.) validating this skill's coverage.
- 工作流: ——上述Pod级路径的告警触发自动化版本。运行确定性ESQL并分支;本技能提供该工作流缺失的解读层。
K8s CrashLoopBackOff Investigation - Forge基因组库: 16种K8s故障场景(OOMKill连锁反应、CPU节流、探针配置错误、节点NotReady、准入Webhook阻塞等),验证本技能的覆盖范围。