observability-k8s-investigation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Kubernetes Investigation

Kubernetes 问题排查

Diagnose Kubernetes issues using OTel telemetry collected via EDOT (Elastic Distribution of OpenTelemetry) and the kube-stack collector. Correlate cluster state, pod runtime metrics, K8s events, application logs, and APM to identify root cause across the workload, node, and control-plane layers.

使用通过EDOT（Elastic Distribution of OpenTelemetry）和kube-stack收集器采集的OTel遥测数据诊断Kubernetes问题。关联集群状态、Pod运行时指标、K8s事件、应用日志和APM数据，以定位工作负载、节点和控制平面各层的根因。

Scope

适用范围

In scope: OTel-receiver-namespaced indices (

metrics-kubeletstatsreceiver.otel-*

metrics-k8sclusterreceiver.otel-*

logs-k8seventsreceiver.otel-*

logs-k8sobjectsreceiver.otel-*

) and OTel semantic conventions (

k8s.pod.name

k8s.namespace.name

k8s.container.restarts

Out of scope:

The legacy Elastic Agent Kubernetes integration (
```
metrics-kubernetes.*
```
,
```
logs-kubernetes.*
```
,
```
kubernetes.*
```
fields). Being deprecated — do not author queries against these paths.
APM-layer analysis (service SLO breaches, transaction error rates, upstream dependency health). Different domain — once a K8s root cause is ruled in or out, APM investigation continues outside this skill.
Cluster provisioning, capacity planning, cost optimization. Different domain.

包含范围： OTel接收方命名空间索引（

metrics-kubeletstatsreceiver.otel-*

、

metrics-k8sclusterreceiver.otel-*

、

logs-k8seventsreceiver.otel-*

、

logs-k8sobjectsreceiver.otel-*

）以及OTel语义约定（

k8s.pod.name

、

k8s.namespace.name

、

k8s.container.restarts

）。

排除范围：

传统Elastic Agent Kubernetes集成（
```
metrics-kubernetes.*
```
、
```
logs-kubernetes.*
```
、
```
kubernetes.*
```
字段）。该集成已被弃用——请勿针对这些路径编写查询。
APM层分析（服务SLO违规、事务错误率、上游依赖健康状态）。属于不同领域——在确定K8s根因存在或排除后，APM排查需在本技能外继续进行。
集群配置、容量规划、成本优化。属于不同领域。

Guidelines

指导原则

These apply to every investigation. When in doubt, re-read them before writing the synthesis.

Absence of evidence is not evidence. Do not confabulate from empty results. If log queries return 0 rows, logs are likely not collected or the pod has no recent lines — this does not mean "dependency unavailable" or any other specific failure mode. Report

no_logs_available

and weight remaining signals accordingly.

Empty dependency data ≠ upstream healthy. Services without APM instrumentation (load generators, workers) emit no destination metrics. Report

insufficient_dependency_data

, not "upstreams OK."

Co-symptoms are not causes. Two services degrading simultaneously usually share an upstream, not a causal link. Only attribute causation when (a) one service's degradation clearly precedes the other's, and (b) the delta is large (>5× error rate, >3× latency).

OOMKilled ≠ memory leak by default. The limit may simply be undersized for the workload's working set. Compare against a 7-day baseline at the same hour-of-day before claiming a leak.

Error-termination ≠ application bug by default. Check

k8s.pod.cpu_limit_utilization

first. CFS throttling driving liveness probe timeouts is the most common misdiagnosis in this space.

Average CPU hides throttling. A pod can look healthy at 40–60% average

cpu_limit_utilization

while being throttled severely at p99. Linux enforces CPU limits in 100ms periods; bursty workloads hit quota mid-period and stall. Look at max and p95, not just average.

Restart count is boolean, not a counter.

k8s.container.restarts

is pulled directly from the K8s API and may be pruned by the kubelet at any time, so the absolute value is unreliable. Treat it as

== 0

(no recent restarts) vs

> 0

(recently restarting); do not derive backoff timing or "linear vs exponential" patterns from it. Confirm the restart pattern via K8s

Killing

BackOff

events instead.

Prefer to report uncertainty over manufacturing confidence. If the evidence is ambiguous, the synthesis should say so. Competing hypotheses are a valid output.

以下原则适用于所有排查场景。如有疑问，撰写结论前请重新阅读。

没有证据不等于不存在问题。不要从空结果中臆断结论。 如果日志查询返回0行，可能是未收集日志或Pod近期无日志输出——这并不意味着“依赖不可用”或其他特定故障模式。请报告

no_logs_available

，并相应权衡剩余信号的权重。

空依赖数据≠上游健康。 未进行APM埋点的服务（负载生成器、工作节点）不会发送目标指标。请报告

insufficient_dependency_data

，而非“上游正常”。

并发症状不等同于因果关系。 两个服务同时降级通常是共享上游依赖，而非存在因果关联。只有当(a)一个服务的降级明显早于另一个，且(b)差异显著（错误率>5倍、延迟>3倍）时，才能判定因果关系。

OOMKilled默认不等于内存泄漏。 可能只是内存限制相对于工作负载的工作集设置过小。在声称存在泄漏前，请与同一天同一时段的7天基线进行对比。

错误终止默认不等于应用程序Bug。 请先检查

k8s.pod.cpu_limit_utilization

。CFS节流导致存活探针超时是该领域最常见的误判情况。

平均CPU使用率会掩盖节流情况。 Pod的平均

cpu_limit_utilization

可能处于40–60%的健康区间，但p99时段可能存在严重节流。Linux以100ms周期强制执行CPU限制；突发工作负载会在周期中期耗尽配额并停滞。请查看最大值和p95值，而不仅仅是平均值。

重启次数是布尔值，而非计数器。

k8s.container.restarts

直接从K8s API获取，kubelet可能随时清理该数据，因此绝对值不可靠。应将其视为

== 0

（近期无重启）或

> 0

（近期有重启）；请勿从中推导退避时间或“线性 vs 指数”模式。请通过K8s的

Killing

BackOff

事件确认重启模式。

宁可报告不确定性，也不要强行制造确定性。 如果证据模糊，结论应明确说明。多种竞争假设是有效的输出。

Indices and fields

索引与字段

Where to look

查询位置

Signal	Index pattern	Use
Pod/container runtime	`metrics-kubeletstatsreceiver.otel-*`	CPU, memory, network, filesystem. Utilization ratios.
Cluster state	`metrics-k8sclusterreceiver.otel-*`	Restarts, phase, last-terminated reason, HPA, quota, node condition
K8s events	`logs-k8seventsreceiver.otel-*`	Killing, BackOff, FailedScheduling, Evicted, image pull events
K8s object snapshots	`logs-k8sobjectsreceiver.otel-*`	Deployment/service/configmap state over time
Application logs	`logs-.otel-`	`body.text` , `severity_text` , filtered by `k8s.pod.name`
APM	`traces-.otel-` , `metrics-service_*.otel-default`	Correlate via `service.name` + K8s resource attrs
ML anomalies	`.ml-anomalies-*`	Memory-growth, restart-rate, throttle jobs (if configured)

信号类型	索引模式	用途
Pod/容器运行时	`metrics-kubeletstatsreceiver.otel-*`	CPU、内存、网络、文件系统。使用率占比。
集群状态	`metrics-k8sclusterreceiver.otel-*`	重启次数、阶段、上次终止原因、HPA、配额、节点状态
K8s事件	`logs-k8seventsreceiver.otel-*`	Killing、BackOff、FailedScheduling、Evicted、镜像拉取事件
K8s对象快照	`logs-k8sobjectsreceiver.otel-*`	Deployment/服务/configmap的状态变化
应用日志	`logs-.otel-`	`body.text` 、 `severity_text` ，按 `k8s.pod.name` 过滤
APM	`traces-.otel-` , `metrics-service_*.otel-default`	通过 `service.name` + K8s资源属性关联
ML异常	`.ml-anomalies-*`	内存增长、重启率、节流任务（若已配置）

Key fields

关键字段

Flat OTel paths work in ES|QL. Prefer the flat form for readability; the nested

resource.attributes.*

form is for raw log documents only.

Field	Index	What it is
`k8s.pod.name`	all k8s	Pod name
`k8s.namespace.name`	all k8s	Namespace
`k8s.container.name`	all k8s	Container within pod
`k8s.deployment.name`	k8sclusterreceiver + others	Parent deployment
`k8s.pod.phase`	k8sclusterreceiver	Pending=1/Running=2/Succeeded=3/Failed=4/Unknown=5
`k8s.container.restarts`	k8sclusterreceiver	Total container restart count
`k8s.container.status.last_terminated_reason`	k8sclusterreceiver	`OOMKilled` , `Error` , `Completed` , `ContainerCannotRun`
`k8s.pod.status_reason`	k8sclusterreceiver	Pod-level reason ( `Evicted` , `NodeLost` )
`k8s.pod.memory_limit_utilization`	kubeletstatsreceiver	0.0–1.0+ (can exceed 1 transiently before OOM)
`k8s.pod.cpu_limit_utilization`	kubeletstatsreceiver	0.0–N (frequently >1 under CFS throttling)
`k8s.pod.memory.usage` / `.working_set`	kubeletstatsreceiver	Bytes
`k8s.node.condition_memory_pressure`	k8sclusterreceiver	1 = pressure, 0 = ok
`k8s.node.condition_ready`	k8sclusterreceiver	0 = NotReady
`k8s.hpa.current_replicas` / `.desired_replicas`	k8sclusterreceiver	HPA state
`attributes.k8s.event.reason`	k8seventsreceiver	Event reason (filter on this)
`body.text`	k8seventsreceiver / logs	Event message / log message
`k8s.object.name`	k8seventsreceiver	involvedObject name (log attribute, use flat form)

扁平OTel路径可在ES|QL中使用。为提高可读性，优先使用扁平形式；嵌套的

resource.attributes.*

形式仅适用于原始日志文档。

字段	索引	说明
`k8s.pod.name`	所有k8s相关索引	Pod名称
`k8s.namespace.name`	所有k8s相关索引	命名空间
`k8s.container.name`	所有k8s相关索引	Pod内的容器
`k8s.deployment.name`	k8sclusterreceiver及其他相关索引	所属Deployment
`k8s.pod.phase`	k8sclusterreceiver	Pending=1/Running=2/Succeeded=3/Failed=4/Unknown=5
`k8s.container.restarts`	k8sclusterreceiver	容器总重启次数
`k8s.container.status.last_terminated_reason`	k8sclusterreceiver	`OOMKilled` 、 `Error` 、 `Completed` 、 `ContainerCannotRun`
`k8s.pod.status_reason`	k8sclusterreceiver	Pod层面的原因（ `Evicted` 、 `NodeLost` ）
`k8s.pod.memory_limit_utilization`	kubeletstatsreceiver	0.0–1.0+（OOM前可能短暂超过1）
`k8s.pod.cpu_limit_utilization`	kubeletstatsreceiver	0.0–N（CFS节流时通常>1）
`k8s.pod.memory.usage` / `.working_set`	kubeletstatsreceiver	字节数
`k8s.node.condition_memory_pressure`	k8sclusterreceiver	1 = 存在压力，0 = 正常
`k8s.node.condition_ready`	k8sclusterreceiver	0 = 未就绪
`k8s.hpa.current_replicas` / `.desired_replicas`	k8sclusterreceiver	HPA状态
`attributes.k8s.event.reason`	k8seventsreceiver	事件原因（以此过滤）
`body.text`	k8seventsreceiver / 日志	事件消息 / 日志消息
`k8s.object.name`	k8seventsreceiver	关联对象名称（日志属性，使用扁平形式）

Field availability

字段可用性

Several fields above are off by default in stock kube-stack collectors and require explicit configuration. Verify presence before relying on them; if absent, fall back as noted and call out the substitution in the synthesis.

Field	Why it might be missing	Fall-back
`k8s.container.status.last_terminated_reason`	Optional metric in k8sclusterreceiver; gated behind `metrics_collected.metadata` config.	Infer from K8s `Killing` / `OOMKilling` events in `logs-k8seventsreceiver.otel-*` and exit codes in app logs.
`k8s.pod.status_reason`	Same — optional metric on k8sclusterreceiver.	Infer from events: `Evicted` , `NodeLost` , `Preempted` .
`k8s.pod.cpu_limit_utilization` / `memory_limit_utilization`	Only emitted when the pod has the corresponding limit set, and the kubeletstatsreceiver metric is enabled.	Compute manually as `k8s.pod.cpu.usage / <limit>` from k8sclusterreceiver, or use absolute usage trending against a baseline.
`k8s.node.condition_memory_pressure`	Gated behind k8sclusterreceiver `node_conditions_to_report` (default omits this).	Compare `k8s.node.memory.usage` against `k8s.node.allocatable_memory` , or look for `Evicted` events on the node.

If a fall-back is used, note it in the synthesis (e.g.

(via memory.usage; limit_utilization not collected)

) so the reader knows the signal is indirect.

上述部分字段在默认kube-stack收集器中未启用，需要显式配置。在依赖这些字段之前，请先确认其存在；若不存在，请按说明使用替代方案，并在结论中注明替代方式。

字段	缺失原因	替代方案
`k8s.container.status.last_terminated_reason`	k8sclusterreceiver中的可选指标；需开启 `metrics_collected.metadata` 配置。	从 `logs-k8seventsreceiver.otel-*` 中的K8s `Killing` / `OOMKilling` 事件以及应用日志中的退出码推断。
`k8s.pod.status_reason`	同上——k8sclusterreceiver中的可选指标。	从事件推断： `Evicted` 、 `NodeLost` 、 `Preempted` 。
`k8s.pod.cpu_limit_utilization` / `memory_limit_utilization`	仅当Pod设置了相应限制且kubeletstatsreceiver指标已启用时才会生成。	从k8sclusterreceiver手动计算为 `k8s.pod.cpu.usage / <limit>` ，或使用绝对使用率与基线进行趋势对比。
`k8s.node.condition_memory_pressure`	受k8sclusterreceiver的 `node_conditions_to_report` 配置控制（默认不包含该字段）。	将 `k8s.node.memory.usage` 与 `k8s.node.allocatable_memory` 对比，或查看节点上的 `Evicted` 事件。

如果使用了替代方案，请在结论中注明（例如：

(通过memory.usage推断；未收集limit_utilization)

），以便读者知晓该信号为间接信号。

ES|QL gotchas

ES|QL注意事项

Before writing queries, know these. Each of them silently produces wrong answers rather than failing loudly.

VALUES()
returns scalar for single distinct value, array for multiple. Templating that assumes array shape (e.g.

| first

) extracts the first character of the string when scalar. Use

MV_FIRST(VALUES(...))

or handle both.

PERCENTILE
does not work on OTel
histogram
type (as of 8.15). For APM duration percentiles, use

AVG

on the

aggregate_metric_double

summary field (

AVG(transaction.duration.summary)

divides sum by value_count). For true percentiles, fall back to Kibana Query DSL.

COUNT(agg_metric_double)
returns
value_count
(events), not doc count.

SUM(field)

gives the sum component;

AVG(field)

gives sum/value_count. Do not use

SUM(transaction.duration.summary)

as an event-count proxy — it returns total duration.

K8s metrics use flat OTel field paths in ES|QL.

k8s.pod.name

, not

resource.attributes.k8s.pod.name

. The nested form is for raw log documents.

编写查询前，请了解以下要点。这些问题不会直接报错，但会悄无声息地返回错误结果。

VALUES()
在单个唯一值时返回标量，多个值时返回数组。假设数组形状的模板（如

| first

）在处理标量时会提取字符串的第一个字符。请使用

MV_FIRST(VALUES(...))

或同时处理两种情况。

PERCENTILE
不适用于OTel的
histogram
类型（截至8.15版本）。对于APM延迟百分位数，请对

aggregate_metric_double

汇总字段使用

AVG

（

AVG(transaction.duration.summary)

将总和除以value_count）。如需真实百分位数，请退回到Kibana Query DSL。

COUNT(agg_metric_double)
返回
value_count
（事件数），而非文档数。

SUM(field)

返回总和分量；

AVG(field)

返回总和/value_count。请勿将

SUM(transaction.duration.summary)

用作事件计数的代理——它返回的是总持续时间。

K8s指标在ES|QL中使用扁平OTel字段路径。 使用

k8s.pod.name

，而非

resource.attributes.k8s.pod.name

。嵌套形式仅适用于原始日志文档。

Failure-mode taxonomy

故障模式分类

Vocabulary for classification, not a decision tree. Use the pivotal-signal column to recognize which mode you're looking at; use "Investigate" to know what else should corroborate.

用于分类的词汇表，而非决策树。使用关键信号列识别当前排查的模式；使用“需排查内容”列了解需要哪些佐证信息。

Workload layer

工作负载层

Mode	Pivotal signal	Investigate
OOMKilled	`last_terminated_reason == "OOMKilled"` + `memory_limit_utilization → 1.0`	Monotonic rise (leak) vs. load-driven spike? Compare current trend to 7-day baseline. Check heap metrics (JVM, Go, Node) for GC pressure.
CPU throttling → Error exit	`cpu_limit_utilization > 1.0` + `last_terminated_reason == "Error"`	Liveness/readiness probe timeouts from CFS throttling. Average CPU can look fine (40–60%) while p99 throttle is severe. Check probe timeouts vs observed startup/health latency.
Liveness probe misconfiguration	Restarts without resource pressure; `initialDelaySeconds` < startup time	K8s events show `Unhealthy` / `Killing` . `kubectl logs --previous` typically shows healthy startup before kill.
CrashLoopBackOff (generic)	`BackOff` events + rising `k8s.container.restarts`	Branch on `last_terminated_reason` — this is a meta-mode. OOMKilled → memory path; Error → logs + throttling; ContainerCannotRun → image/exec.
ImagePullBackOff	K8s events `Failed` with image name + `429` or `not found`	Registry rate limit? Missing tag? Wrong imagePullSecret? Check recency of `Pulling` / `Pulled` events.
Stuck rollout	New pods `Pending` /not-Ready > `progressDeadlineSeconds` ; old pods still serving	Check `k8s.deployment.available` vs `.desired` . Admission rejection? Readiness probe failing on new pods? HPA not scaling?
Termination signal race	Brief 5xx bursts correlated with rolling deploys	Endpoint removal races termination. New requests can hit the pod after SIGTERM starts. NGINX gotcha: `STOPSIGNAL SIGTERM` triggers fast shutdown, not graceful — use `STOPSIGNAL SIGQUIT` for graceful drain. Check ingress 502 rate vs rollout timing.

模式	关键信号	需排查内容
OOMKilled	`last_terminated_reason == "OOMKilled"` + `memory_limit_utilization → 1.0`	是单调增长（泄漏）还是负载驱动的峰值？将当前趋势与7天基线对比。检查语言对应的堆指标（JVM、Go、Node）是否存在GC压力。
CPU节流→错误退出	`cpu_limit_utilization > 1.0` + `last_terminated_reason == "Error"`	CFS节流导致存活/就绪探针超时。平均CPU使用率可能看起来正常（40–60%），但p99时段节流严重。检查探针超时时间与观测到的启动/健康延迟是否匹配。
存活探针配置错误	无资源压力但出现重启； `initialDelaySeconds` < 启动时间	K8s事件显示 `Unhealthy` / `Killing` 。 `kubectl logs --previous` 通常会显示被终止前的健康启动日志。
CrashLoopBackOff（通用）	`BackOff` 事件 + `k8s.container.restarts` 上升	根据 `last_terminated_reason` 分支排查——这是一个元模式。OOMKilled→内存路径；Error→日志+节流；ContainerCannotRun→镜像/执行路径。
ImagePullBackOff	K8s事件中 `Failed` 伴随镜像名称 + `429` 或 `not found`	镜像仓库限流？标签缺失？ImagePullSecret错误？检查 `Pulling` / `Pulled` 事件的时间。
发布停滞	新Pod处于 `Pending` /未就绪状态超过 `progressDeadlineSeconds` ；旧Pod仍在提供服务	检查 `k8s.deployment.available` 与 `.desired` 的对比。准入拒绝？新Pod的就绪探针失败？HPA未扩缩容？
终止信号竞争	滚动发布期间出现短暂5xx峰值	端点移除与终止存在竞争。SIGTERM启动后，新请求仍可能命中该Pod。NGINX注意事项： `STOPSIGNAL SIGTERM` 会触发快速关闭，而非优雅关闭——请使用 `STOPSIGNAL SIGQUIT` 实现优雅排空。检查入口502错误率与发布时间的关联。

Node layer

节点层

Mode	Pivotal signal	Investigate
Node NotReady cascade	`k8s.node.condition_ready == 0` + mass `Evicted` events	Memory pressure? Disk pressure? Network partition from API server? Inspect kubelet logs, `k8s.node.condition_*` history.
Resource eviction	`status_reason == "Evicted"` + `condition_memory_pressure == 1` on node	Node-level noisy neighbor. QoS order: BestEffort → Burstable → Guaranteed. Identify which pod drove node memory up.
Node affinity/selector conflict	Mass unschedulable pods after label change	K8s events show `FailedScheduling` . Often triggered by cluster upgrades (e.g. `node-role.kubernetes.io/master` → `control-plane` ).

模式	关键信号	需排查内容
节点NotReady连锁反应	`k8s.node.condition_ready == 0` + 大量 `Evicted` 事件	内存压力？磁盘压力？与API服务器网络分区？检查kubelet日志、 `k8s.node.condition_*` 历史记录。
资源驱逐	`status_reason == "Evicted"` + 节点 `condition_memory_pressure == 1`	节点层面的“噪声邻居”。QoS优先级：BestEffort → Burstable → Guaranteed。确定是哪个Pod导致节点内存上升。
节点亲和性/选择器冲突	标签变更后出现大量无法调度的Pod	K8s事件显示 `FailedScheduling` 。通常由集群升级触发（例如 `node-role.kubernetes.io/master` → `control-plane` ）。

Control plane

控制平面

Mode	Pivotal signal	Investigate
etcd I/O cascade	API server latency spike + cluster-wide kubelet heartbeat failures	Disk IOPS, fsync latency (must be <10ms). Cloud-burst-credit exhaustion is common.
Admission webhook block	Mass `FailedCreate` across namespaces; deployments frozen	`failurePolicy:Fail` webhook pod crashed. Check webhook pod health + API server TCP connection cache (caches dead connections ~15 min).
Priority preemption storm	Production pods terminating with `preempted-by` annotation	New `PriorityClass` with `globalDefault:true` caused cascade. Check `kube-scheduler` events.
PDB drain deadlock	Node drain stuck indefinitely; HTTP 429 from Eviction API	PDB `minAvailable` / `maxUnavailable` too strict. No default drain timeout. Manual PDB deletion unblocks.

模式	关键信号	需排查内容
etcd I/O连锁反应	API服务器延迟飙升 + 集群范围内kubelet心跳失败	磁盘IOPS、fsync延迟（必须<10ms）。云突发信用耗尽是常见原因。
准入Webhook阻塞	跨命名空间出现大量 `FailedCreate` ；部署冻结	`failurePolicy:Fail` 的Webhook Pod崩溃。检查Webhook Pod健康状态 + API服务器TCP连接缓存（缓存死连接约15分钟）。
优先级抢占风暴	生产Pod因 `preempted-by` 注解终止	新的 `PriorityClass` 设置 `globalDefault:true` 导致连锁反应。检查 `kube-scheduler` 事件。
PDB排空死锁	节点排空无限期停滞；Eviction API返回HTTP 429	PDB的 `minAvailable` / `maxUnavailable` 设置过于严格。默认无排空超时。手动删除PDB可解除阻塞。

Autoscaling & admission

自动扩缩容与准入

Mode	Pivotal signal	Investigate
HPA unready-pod dampening	Load rising, HPA not scaling; unready pods included in calculation	HPA averages CPU across all replicas including unready (0% contribution). Check `k8s.hpa.current_replicas` vs `.desired_replicas` + pod readiness.
Resource quota silent 403	Deployment stuck at n-1/n; `FailedCreate` on ReplicaSet	Namespace quota exhausted (often CronJob accumulation). Check `k8s.resource_quota.used` vs `.hard_limit` .

模式	关键信号	需排查内容
HPA未就绪Pod抑制	负载上升，但HPA未扩缩容；计算包含未就绪Pod	HPA会对所有副本（包括未就绪Pod，贡献0%）的CPU使用率取平均值。检查 `k8s.hpa.current_replicas` 与 `.desired_replicas` + Pod就绪状态的对比。
资源配额静默403	部署停滞在n-1/n；ReplicaSet出现 `FailedCreate`	命名空间配额耗尽（通常由CronJob累积导致）。检查 `k8s.resource_quota.used` 与 `.hard_limit` 的对比。

Networking

网络

Mode	Pivotal signal	Investigate
StatefulSet split-brain	Duplicate pod identities across partitioned nodes	Network partition + eviction timeout race. Two instances of same ordinal running. No fencing by default.
CoreDNS OOMKill	CoreDNS restarts + cluster-wide DNS timeouts in app logs	Default CoreDNS memory (~170Mi) insufficient under query amplification (ndots:5, each external lookup → ~10 lookups).

模式	关键信号	需排查内容
StatefulSet脑裂	分区节点上出现重复Pod身份	网络分区 + 驱逐超时竞争。同一序号的两个实例同时运行。默认无隔离机制。
CoreDNS OOMKill	CoreDNS重启 + 集群范围内应用日志出现DNS超时	默认CoreDNS内存（约170Mi）在查询放大（ndots:5，每个外部查询→约10次查询）场景下不足。

When classification is ambiguous

分类模糊时的处理

Real incidents often match two modes. Examples:

OOMKilled pod with simultaneous CPU throttling — memory usually drives the kill, but verify by checking whether memory or CPU hit limit first.
Stuck rollout with HPA dampening and resource quota near-exhaustion — both can freeze a deploy. Check which constraint is binding.
Node NotReady with pods that were already crashing — the node issue may be incidental.

When two modes fit, name both in the synthesis and say which one you believe is causal and why. Do not force a single hypothesis when the evidence supports two.

实际事件通常符合两种模式。示例：

OOMKilled Pod同时存在CPU节流——通常是内存导致终止，但需验证是内存还是CPU先达到限制。
发布停滞同时存在HPA抑制和资源配额接近耗尽——两者都可能导致部署冻结。检查哪个约束是绑定状态。
节点NotReady同时Pod已在崩溃——节点问题可能只是偶然事件。

当两种模式都符合时，请在结论中同时列出，并说明你认为哪个是因果原因及理由。当证据支持两种假设时，不要强行选择单一假设。

Signal interpretation

信号解读

Memory

内存

Monotonic rise over 30–60 min → leak. Check GC metrics for the language: JVM
```
jvm.gc.duration
```
, Go
```
process.runtime.go.gc.pause_ns
```
, Node
```
v8js_gc_duration
```
. Rising GC frequency/pause with stable live-set is the canonical leak signature.
Diurnal / load-correlated spikes → load-driven, not leak. Consider HPA tuning or limit increase.
Hits 1.0, then restart → OOMKilled confirmed. Exit code 137 (SIGKILL) in app logs consistent.

30–60分钟内单调上升 → 内存泄漏。检查对应语言的GC指标：JVM
```
jvm.gc.duration
```
、Go
```
process.runtime.go.gc.pause_ns
```
、Node
```
v8js_gc_duration
```
。GC频率/暂停时间上升但活跃集稳定是典型的泄漏特征。
昼夜/负载相关峰值 → 负载驱动，而非泄漏。考虑调整HPA或增加内存限制。
达到1.0后重启 → 确认OOMKilled。应用日志中退出码137（SIGKILL）与此一致。

CPU

```
cpu_limit_utilization > 1.0
```
sustained → CFS throttling. Node has spare CPU; the pod is quota-blocked.
Symptoms of throttling (not the throttle metric itself): liveness probe timeouts, p99 latency 4–16× p50, queue backpressure upstream, Error-reason container terminations.
Average can look healthy while p95 is throttled. Do not trust average alone.

```
cpu_limit_utilization > 1.0
```
持续存在 → CFS节流。节点有空闲CPU；Pod受配额限制。
节流症状（而非节流指标本身）：存活探针超时、p99延迟是p50的4–16倍、上游队列积压、Error原因导致容器终止。
平均使用率可能看起来健康，但p95时段存在节流。不要仅依赖平均值。

Restart patterns

重启模式

```
restarts > 0
```
recently → workload has been restarting. Don't read magnitude into the count (see Restart count is boolean); confirm the pattern from K8s
```
Killing
```
/
```
BackOff
```
event timestamps in
```
logs-k8seventsreceiver.otel-*
```
.
Restarts correlated with memory pressure (
```
memory_limit_utilization → 1.0
```
) → OOMKilled path.
Restarts without memory/CPU pressure → probe misconfig, app bug, or startup dependency failure. Pull events for
```
Unhealthy
```
and
```
Killing
```
.

```
restarts > 0
```
（近期） → 工作负载出现过重启。不要从数值大小推断结论（参考“重启次数是布尔值”）；请从
```
logs-k8seventsreceiver.otel-*
```
中的K8s
```
Killing
```
/
```
BackOff
```
事件时间戳确认模式。
重启与内存压力（
```
memory_limit_utilization → 1.0
```
）相关 → OOMKilled路径。
无内存/CPU压力但出现重启 → 探针配置错误、应用Bug或启动依赖失败。查询
```
Unhealthy
```
和
```
Killing
```
事件。

Termination reasons

终止原因

```
OOMKilled
```
→ memory path.
```
Error
```
→ non-zero exit. Check app logs; if empty/minimal, check CPU throttling before attributing to app logic.
```
Completed
```
→ ran to completion. Normal for Jobs/CronJobs/init containers; anomalous otherwise.
```
ContainerCannotRun
```
→ runtime/image/exec issue. Check image pull events.

```
OOMKilled
```
→ 内存路径。
```
Error
```
→ 非零退出码。检查应用日志；如果日志为空/内容极少，请先检查CPU节流，再归因于应用逻辑。
```
Completed
```
→ 运行完成。对于Jobs/CronJobs/初始化容器是正常情况；否则异常。
```
ContainerCannotRun
```
→ 运行时/镜像/执行问题。检查镜像拉取事件。

Investigation flow

排查流程

An investigation is not a checklist. The sections below describe a typical arc — compress, skip, or revisit them based on what you find. Terminate as soon as you have enough evidence to synthesize at a known confidence. Chasing signals past the point of diminishing returns is a failure mode, not thoroughness.

排查不是 checklist。以下部分描述的是典型流程——根据发现的内容压缩、跳过或重新访问相关步骤。一旦有足够证据支持已知置信度的结论，即可终止排查。过度追逐信号导致收益递减是一种错误模式，而非严谨。

Orient

定位目标

Resolve the target:

k8s.pod.name

k8s.namespace.name

, optionally

k8s.deployment.name

and

service.name

. If no time window is given, default to the last hour for pod-level investigations, last 2 hours for event correlation, last 6 hours for ongoing/unresolved incidents.

If the alert payload already tells you the failure mode (e.g., it fires specifically on

OOMKilled

), note that and skip classification; move to confirmation and baseline comparison.

确定目标：

k8s.pod.name

、

k8s.namespace.name

，可选

k8s.deployment.name

和

service.name

。如果未指定时间窗口，Pod级排查默认最近1小时，事件关联默认最近2小时，持续/未解决事件默认最近6小时。

如果告警负载已明确故障模式（例如，专门针对

OOMKilled

触发），请注明并跳过分类步骤；直接进入确认和基线对比环节。

Characterize

特征分析

Get the shape of the workload's recent behavior: restart count, termination reasons, phase, utilization. One or two queries usually suffice.

esql

FROM metrics-k8sclusterreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND k8s.namespace.name == "<ns>"
  AND @timestamp > NOW() - 1 hour
| STATS restarts = MAX(k8s.container.restarts),
        term_reasons = VALUES(k8s.container.status.last_terminated_reason),
        phase = MAX(k8s.pod.phase)

esql

FROM metrics-kubeletstatsreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND @timestamp > NOW() - 15 minutes
| STATS mem_pct = ROUND(MAX(k8s.pod.memory_limit_utilization) * 100, 1),
        cpu_pct = ROUND(MAX(k8s.pod.cpu_limit_utilization) * 100, 1)

获取工作负载近期行为的特征：重启次数、终止原因、阶段、使用率。通常1-2个查询即可满足需求。

esql

FROM metrics-k8sclusterreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND k8s.namespace.name == "<ns>"
  AND @timestamp > NOW() - 1 hour
| STATS restarts = MAX(k8s.container.restarts),
        term_reasons = VALUES(k8s.container.status.last_terminated_reason),
        phase = MAX(k8s.pod.phase)

esql

FROM metrics-kubeletstatsreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND @timestamp > NOW() - 15 minutes
| STATS mem_pct = ROUND(MAX(k8s.pod.memory_limit_utilization) * 100, 1),
        cpu_pct = ROUND(MAX(k8s.pod.cpu_limit_utilization) * 100, 1)

Classify

分类

Use the taxonomy. The pivotal signal should match; the "Investigate" column tells you what corroboration to seek.

When two modes fit, note both and proceed with the one that has the stronger pivotal signal. You may revise during corroboration.

使用上述分类体系。关键信号应匹配；“需排查内容”列会告诉你需要哪些佐证信息。

当两种模式都符合时，请同时列出，并优先选择关键信号更明确的模式。你可能会在佐证环节修正结论。

Corroborate

佐证

Pull the evidence your classification predicts you'll find. Typical sources:

K8s events for the namespace and window:

esql

FROM logs-k8seventsreceiver.otel-*
| WHERE k8s.namespace.name == "<ns>"
  AND @timestamp > NOW() - 2 hours
  AND attributes.k8s.event.reason IN (
    "BackOff", "Killing", "Unhealthy", "Failed",
    "FailedScheduling", "Evicted", "SuccessfulRescale",
    "Pulling", "Pulled", "Started", "Created"
  )
| SORT @timestamp DESC
| KEEP @timestamp, attributes.k8s.event.reason, body.text, k8s.object.name
| LIMIT 30

Application logs if available — look at the 200 most recent lines before the termination timestamp. If absent, flag

no_logs_available

; do not invent a log pattern.

APM if the pod runs an instrumented service — resolve

service.name

from pod resource attributes for later correlation. SLO / latency / error-rate analysis itself is APM-layer work and out of scope for this skill.

Baseline comparison — for utilization-based findings, compare current values to 7-day-prior at the same hour-of-day. "High memory" is meaningful only relative to what's normal for this workload.

提取分类预测的证据。典型来源：

指定命名空间和时间窗口的K8s事件：

esql

FROM logs-k8seventsreceiver.otel-*
| WHERE k8s.namespace.name == "<ns>"
  AND @timestamp > NOW() - 2 hours
  AND attributes.k8s.event.reason IN (
    "BackOff", "Killing", "Unhealthy", "Failed",
    "FailedScheduling", "Evicted", "SuccessfulRescale",
    "Pulling", "Pulled", "Started", "Created"
  )
| SORT @timestamp DESC
| KEEP @timestamp, attributes.k8s.event.reason, body.text, k8s.object.name
| LIMIT 30

应用日志（若可用）——查看终止时间前的200条最新日志。如果日志缺失，请标记

no_logs_available

；不要编造日志模式。

APM（若Pod运行已埋点的服务）——从Pod资源属性中解析

service.name

用于后续关联。SLO/延迟/错误率分析本身属于APM层工作，不在本技能范围内。

基线对比——对于基于使用率的发现，请将当前值与7天前同一时段的值对比。“高内存”只有相对于该工作负载的正常情况才有意义。

Check for upstream cause (conditional)

检查上游原因（可选）

Only pursue if the symptom pattern suggests it. Threshold: upstream error rate >5× baseline or latency >3× baseline, AND degradation started before the symptom on the target service. Co-symptoms do not establish causation.

metrics-service_destination.1m.otel-default

has no rows for the service, report

insufficient_dependency_data

— not "upstreams healthy."

仅当症状模式表明可能存在上游原因时才进行。阈值：上游错误率>基线5倍或延迟>基线3倍，且上游服务的早于目标服务出现降级。并发症状不能确立因果关系。

如果

metrics-service_destination.1m.otel-default

中无该服务的数据，请报告

insufficient_dependency_data

——而非“上游健康”。

Check for recent change (conditional)

检查近期变更（可选）

SuccessfulCreate

Pulled

events in the last 2 hours often correlate with deploys.

logs-k8sobjectsreceiver.otel-*

shows configmap/secret/deployment spec changes. A change within 15 minutes of the symptom onset is a strong correlation, but still a correlation — verify it plausibly explains the mode you've classified.

Synthesize and stop

总结并终止

Synthesize as soon as you have enough evidence to support a hypothesis at known confidence. You do not need to complete every section above — investigation terminates when either:

You have a high-confidence hypothesis with corroboration, or
You have a low/medium-confidence hypothesis and further queries are unlikely to change the picture (e.g., logs are unavailable, APM isn't instrumented, no recent changes found).

一旦有足够证据支持已知置信度的假设，即可总结结论。你无需完成上述所有步骤——当出现以下任一情况时，排查即可终止：

你有高置信度的假设并得到佐证，或
你有低/中置信度的假设，且进一步查询不太可能改变结论（例如，日志不可用、未进行APM埋点、未发现近期变更）。

Synthesis

结论模板

Default structure:

text

HYPOTHESIS (confidence: high | medium | low)
<One paragraph: service, symptom, most likely cause. Name the failure mode from the taxonomy.>

EVIDENCE
- <Finding from characterization, with the concrete metric or value.>
- <Finding from events / logs / APM.>
- <Finding from baseline comparison, dependency check, or change correlation if pursued.>

CONFIDENCE NOTE
<Only if not 'high'. What specific evidence is missing or ambiguous.>

RECOMMENDED NEXT STEPS
1. <Most actionable — typically a config check or metric to observe.>
2. <Secondary.>

DOWNSTREAM IMPACT
<Services depending on this workload, or 'No downstream dependencies identified.'>

When two hypotheses are live: replace HYPOTHESIS with COMPETING HYPOTHESES; list both, say which you lean toward and why, and list the evidence that would disambiguate them.

When no incident is found (symptom resolved, or alert appears spurious): say so directly.

ALERT FIRED BUT SYSTEM APPEARS HEALTHY

is a valid output. List what you checked and what you didn't find.

默认结构：

text

假设（置信度：高 | 中 | 低）
<一段文字：服务、症状、最可能的原因。引用分类体系中的故障模式。>

证据
- <特征分析的发现，包含具体指标或数值。>
- <事件/日志/APM的发现。>
- <基线对比、依赖检查或变更关联的发现（若进行）。>

置信度说明
<仅当置信度不为“高”时填写。说明缺失或模糊的具体证据。>

建议下一步操作
1. <最具可操作性的步骤——通常是配置检查或需观测的指标。>
2. <次要步骤。>

下游影响
<依赖该工作负载的服务，或“未识别到下游依赖”。>

当存在两种假设时：将“假设”替换为“竞争假设”；列出两种假设，说明你倾向于哪一种及理由，并列出可消除歧义的证据。

未发现事件时（症状已解决，或告警为误报）：直接说明。

告警触发但系统看似正常

是有效的输出。列出你检查的内容及未发现的问题。

Confidence calibration

置信度校准

Start at high and downgrade based on what's missing:

Downgrade to medium if: primary signal is clear but corroboration is missing (no logs, no APM, no baseline comparison possible). Or: two modes fit and you can't disambiguate.
Downgrade to low if: only a single signal supports the hypothesis, signals conflict, or the mode requires evidence you couldn't fetch.

Never return high when application log data was absent and the hypothesis depends on application behavior. Absence of evidence does not corroborate a hypothesis.

默认从高开始，根据缺失内容降级：

降级为中：主要信号明确但缺乏佐证（无日志、无APM、无法进行基线对比）。或：两种模式都符合且无法消除歧义。
降级为低：只有单个信号支持假设，信号冲突，或该模式需要的证据无法获取。

当应用日志缺失且假设依赖应用行为时，请勿返回高置信度。没有证据不能佐证假设。

Query recipes

查询示例

Most-restarting pods in a namespace

命名空间内重启次数最多的Pod

esql

FROM metrics-k8sclusterreceiver.otel-*
| WHERE k8s.namespace.name == "<ns>" AND @timestamp > NOW() - 1 hour
| STATS restarts = MAX(k8s.container.restarts) BY k8s.pod.name, k8s.container.status.last_terminated_reason
| WHERE restarts > 0
| SORT restarts DESC
| LIMIT 20

esql

FROM metrics-k8sclusterreceiver.otel-*
| WHERE k8s.namespace.name == "<ns>" AND @timestamp > NOW() - 1 hour
| STATS restarts = MAX(k8s.container.restarts) BY k8s.pod.name, k8s.container.status.last_terminated_reason
| WHERE restarts > 0
| SORT restarts DESC
| LIMIT 20

CPU throttling check for a pod

Pod的CPU节流检查

esql

FROM metrics-kubeletstatsreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND @timestamp > NOW() - 30 minutes
| STATS max_cpu_ratio = ROUND(MAX(k8s.pod.cpu_limit_utilization), 2),
        avg_cpu_ratio = ROUND(AVG(k8s.pod.cpu_limit_utilization), 2),
        max_cpu_cores = ROUND(MAX(k8s.pod.cpu.usage), 3)

Sustained ratio >1.0 = throttling. Transient >1.0 with avg <0.5 is usually benign burst.

esql

FROM metrics-kubeletstatsreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND @timestamp > NOW() - 30 minutes
| STATS max_cpu_ratio = ROUND(MAX(k8s.pod.cpu_limit_utilization), 2),
        avg_cpu_ratio = ROUND(AVG(k8s.pod.cpu_limit_utilization), 2),
        max_cpu_cores = ROUND(MAX(k8s.pod.cpu.usage), 3)

持续比值>1.0 = 节流。短暂>1.0且平均值<0.5通常是良性突发。

Nodes under memory pressure (right now)

当前存在内存压力的节点

esql

FROM metrics-k8sclusterreceiver.otel-*
| WHERE @timestamp > NOW() - 15 minutes AND k8s.node.condition_memory_pressure == 1
| STATS ts = MAX(@timestamp) BY k8s.node.name
| SORT ts DESC

esql

FROM metrics-k8sclusterreceiver.otel-*
| WHERE @timestamp > NOW() - 15 minutes AND k8s.node.condition_memory_pressure == 1
| STATS ts = MAX(@timestamp) BY k8s.node.name
| SORT ts DESC

Admission denials (webhook or quota) last hour

最近1小时的准入拒绝（Webhook或配额）

esql

FROM logs-k8seventsreceiver.otel-*
| WHERE @timestamp > NOW() - 1 hour
  AND (attributes.k8s.event.reason == "FailedCreate"
       OR body.text LIKE "*admission webhook*"
       OR body.text LIKE "*exceeded quota*")
| SORT @timestamp DESC
| KEEP @timestamp, k8s.namespace.name, attributes.k8s.event.reason, body.text
| LIMIT 30

esql

FROM logs-k8seventsreceiver.otel-*
| WHERE @timestamp > NOW() - 1 hour
  AND (attributes.k8s.event.reason == "FailedCreate"
       OR body.text LIKE "*admission webhook*"
       OR body.text LIKE "*exceeded quota*")
| SORT @timestamp DESC
| KEEP @timestamp, k8s.namespace.name, attributes.k8s.event.reason, body.text
| LIMIT 30

Firing K8s alerts

触发中的K8s告警

text

GET /api/alerting/rules/_find?search=k8s&search_fields=tags&filter=alert.attributes.executionStatus.status:active

text

GET /api/alerting/rules/_find?search=k8s&search_fields=tags&filter=alert.attributes.executionStatus.status:active

Examples

示例场景

"Why is my pod CrashLoopBackOff-ing?"

“我的Pod为什么CrashLoopBackOff？”

Characterize first: get restart count, termination reason, memory and CPU utilization.

If
```
last_terminated_reason == "OOMKilled"
```
and memory utilization hit 1.0 → memory path. Corroborate with 7-day baseline: monotonic rise over days = leak; spiky = load-driven. Check GC metrics if language is known.
If
```
last_terminated_reason == "Error"
```
and
```
cpu_limit_utilization > 1.0
```
→ CPU throttling path. Corroborate with liveness probe config (initialDelaySeconds, timeoutSeconds) and K8s events for
```
Unhealthy
```
.
If
```
last_terminated_reason == "Error"
```
and CPU is fine → application-logic path. Pull recent logs before termination.
If
```
last_terminated_reason == "ContainerCannotRun"
```
→ image/exec path. Check K8s events for
```
Failed
```
pull events.

Synthesize with appropriate confidence. If logs were unavailable on the Error path, downgrade to medium and say so.

首先进行特征分析：获取重启次数、终止原因、内存和CPU使用率。

如果
```
last_terminated_reason == "OOMKilled"
```
且内存使用率达到1.0 → 内存路径。用7天基线佐证：数天内单调增长=泄漏；峰值=负载驱动。若已知语言，检查GC指标。
如果
```
last_terminated_reason == "Error"
```
且
```
cpu_limit_utilization > 1.0
```
→ CPU节流路径。用存活探针配置（initialDelaySeconds、timeoutSeconds）和K8s的
```
Unhealthy
```
事件佐证。
如果
```
last_terminated_reason == "Error"
```
且CPU正常 → 应用逻辑路径。获取终止前的最新日志。
如果
```
last_terminated_reason == "ContainerCannotRun"
```
→ 镜像/执行路径。检查K8s的
```
Failed
```
拉取事件。

根据置信度撰写结论。如果Error路径下日志不可用，将置信度降级为中并说明。

"Is my rollout stuck?"

“我的发布是否停滞？”

Authoritative signal:

k8s.deployment.available < k8s.deployment.desired

for > 10 minutes.

Diagnose the constraint:

K8s events on the new ReplicaSet:
```
FailedCreate
```
→ admission rejection (quota, webhook, PSP).
```
FailedScheduling
```
→ no node fits.
New-pod utilization: all at 0% memory → never started (image pull failure); high CPU with low memory → slow startup hitting readiness probe.
HPA state: stable
```
current_replicas < desired_replicas
```
under load → unready-pod dampening.

权威信号：

k8s.deployment.available < k8s.deployment.desired

持续超过10分钟。

诊断约束条件：

新ReplicaSet的K8s事件：
```
FailedCreate
```
→ 准入拒绝（配额、Webhook、PSP）。
```
FailedScheduling
```
→ 无匹配节点。
新Pod使用率：内存全为0% → 从未启动（镜像拉取失败）；CPU高内存低 → 启动缓慢触发就绪探针。
HPA状态：负载下
```
current_replicas < desired_replicas
```
稳定 → 未就绪Pod抑制。

"Alert fired but everything looks healthy"

“告警触发但一切看似正常”

Possible and worth naming explicitly. Check:

Has the symptom resolved? Compare current utilization/restart rate to the alert trigger point.
Was the alert a transient spike that's already decayed?
Is the alert tuned appropriately (e.g., too-short evaluation window)?

Output:

ALERT FIRED BUT SYSTEM APPEARS HEALTHY

with what you checked. Recommend alert tuning if the pattern is recurrent.

这种情况是可能的，值得明确说明。检查：

症状是否已解决？将当前使用率/重启率与告警触发阈值对比。
告警是否是已衰减的短暂峰值？
告警配置是否合理（例如，评估窗口过短）？

输出：

告警触发但系统看似正常

，并列出你检查的内容。如果该模式反复出现，建议调整告警配置。