observability-k8s-investigation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Kubernetes Investigation

Kubernetes 问题排查

Diagnose Kubernetes issues using OTel telemetry collected via EDOT (Elastic Distribution of OpenTelemetry) and the kube-stack collector. Correlate cluster state, pod runtime metrics, K8s events, application logs, and APM to identify root cause across the workload, node, and control-plane layers.
使用通过EDOT(Elastic Distribution of OpenTelemetry)和kube-stack收集器采集的OTel遥测数据诊断Kubernetes问题。关联集群状态、Pod运行时指标、K8s事件、应用日志和APM数据,以定位工作负载、节点和控制平面各层的根因。

Scope

适用范围

In scope: OTel-receiver-namespaced indices (
metrics-kubeletstatsreceiver.otel-*
,
metrics-k8sclusterreceiver.otel-*
,
logs-k8seventsreceiver.otel-*
,
logs-k8sobjectsreceiver.otel-*
) and OTel semantic conventions (
k8s.pod.name
,
k8s.namespace.name
,
k8s.container.restarts
).
Out of scope:
  • The legacy Elastic Agent Kubernetes integration (
    metrics-kubernetes.*
    ,
    logs-kubernetes.*
    ,
    kubernetes.*
    fields). Being deprecated — do not author queries against these paths.
  • APM-layer analysis (service SLO breaches, transaction error rates, upstream dependency health). Different domain — once a K8s root cause is ruled in or out, APM investigation continues outside this skill.
  • Cluster provisioning, capacity planning, cost optimization. Different domain.
包含范围: OTel接收方命名空间索引(
metrics-kubeletstatsreceiver.otel-*
metrics-k8sclusterreceiver.otel-*
logs-k8seventsreceiver.otel-*
logs-k8sobjectsreceiver.otel-*
)以及OTel语义约定(
k8s.pod.name
k8s.namespace.name
k8s.container.restarts
)。
排除范围:
  • 传统Elastic Agent Kubernetes集成(
    metrics-kubernetes.*
    logs-kubernetes.*
    kubernetes.*
    字段)。该集成已被弃用——请勿针对这些路径编写查询。
  • APM层分析(服务SLO违规、事务错误率、上游依赖健康状态)。属于不同领域——在确定K8s根因存在或排除后,APM排查需在本技能外继续进行。
  • 集群配置、容量规划、成本优化。属于不同领域。

Guidelines

指导原则

These apply to every investigation. When in doubt, re-read them before writing the synthesis.
Absence of evidence is not evidence. Do not confabulate from empty results. If log queries return 0 rows, logs are likely not collected or the pod has no recent lines — this does not mean "dependency unavailable" or any other specific failure mode. Report
no_logs_available
and weight remaining signals accordingly.
Empty dependency data ≠ upstream healthy. Services without APM instrumentation (load generators, workers) emit no destination metrics. Report
insufficient_dependency_data
, not "upstreams OK."
Co-symptoms are not causes. Two services degrading simultaneously usually share an upstream, not a causal link. Only attribute causation when (a) one service's degradation clearly precedes the other's, and (b) the delta is large (>5× error rate, >3× latency).
OOMKilled ≠ memory leak by default. The limit may simply be undersized for the workload's working set. Compare against a 7-day baseline at the same hour-of-day before claiming a leak.
Error-termination ≠ application bug by default. Check
k8s.pod.cpu_limit_utilization
first. CFS throttling driving liveness probe timeouts is the most common misdiagnosis in this space.
Average CPU hides throttling. A pod can look healthy at 40–60% average
cpu_limit_utilization
while being throttled severely at p99. Linux enforces CPU limits in 100ms periods; bursty workloads hit quota mid-period and stall. Look at max and p95, not just average.
Restart count is boolean, not a counter.
k8s.container.restarts
is pulled directly from the K8s API and may be pruned by the kubelet at any time, so the absolute value is unreliable. Treat it as
== 0
(no recent restarts) vs
> 0
(recently restarting); do not derive backoff timing or "linear vs exponential" patterns from it. Confirm the restart pattern via K8s
Killing
/
BackOff
events instead.
Prefer to report uncertainty over manufacturing confidence. If the evidence is ambiguous, the synthesis should say so. Competing hypotheses are a valid output.
以下原则适用于所有排查场景。如有疑问,撰写结论前请重新阅读。
没有证据不等于不存在问题。不要从空结果中臆断结论。 如果日志查询返回0行,可能是未收集日志或Pod近期无日志输出——这并不意味着“依赖不可用”或其他特定故障模式。请报告
no_logs_available
,并相应权衡剩余信号的权重。
空依赖数据≠上游健康。 未进行APM埋点的服务(负载生成器、工作节点)不会发送目标指标。请报告
insufficient_dependency_data
,而非“上游正常”。
并发症状不等同于因果关系。 两个服务同时降级通常是共享上游依赖,而非存在因果关联。只有当(a)一个服务的降级明显早于另一个,且(b)差异显著(错误率>5倍、延迟>3倍)时,才能判定因果关系。
OOMKilled默认不等于内存泄漏。 可能只是内存限制相对于工作负载的工作集设置过小。在声称存在泄漏前,请与同一天同一时段的7天基线进行对比。
错误终止默认不等于应用程序Bug。 请先检查
k8s.pod.cpu_limit_utilization
。CFS节流导致存活探针超时是该领域最常见的误判情况。
平均CPU使用率会掩盖节流情况。 Pod的平均
cpu_limit_utilization
可能处于40–60%的健康区间,但p99时段可能存在严重节流。Linux以100ms周期强制执行CPU限制;突发工作负载会在周期中期耗尽配额并停滞。请查看最大值和p95值,而不仅仅是平均值。
重启次数是布尔值,而非计数器。
k8s.container.restarts
直接从K8s API获取,kubelet可能随时清理该数据,因此绝对值不可靠。应将其视为
== 0
(近期无重启)或
> 0
(近期有重启);请勿从中推导退避时间或“线性 vs 指数”模式。请通过K8s的
Killing
/
BackOff
事件确认重启模式。
宁可报告不确定性,也不要强行制造确定性。 如果证据模糊,结论应明确说明。多种竞争假设是有效的输出。

Indices and fields

索引与字段

Where to look

查询位置

SignalIndex patternUse
Pod/container runtime
metrics-kubeletstatsreceiver.otel-*
CPU, memory, network, filesystem. Utilization ratios.
Cluster state
metrics-k8sclusterreceiver.otel-*
Restarts, phase, last-terminated reason, HPA, quota, node condition
K8s events
logs-k8seventsreceiver.otel-*
Killing, BackOff, FailedScheduling, Evicted, image pull events
K8s object snapshots
logs-k8sobjectsreceiver.otel-*
Deployment/service/configmap state over time
Application logs
logs-*.otel-*
body.text
,
severity_text
, filtered by
k8s.pod.name
APM
traces-*.otel-*
,
metrics-service_*.otel-default
Correlate via
service.name
+ K8s resource attrs
ML anomalies
.ml-anomalies-*
Memory-growth, restart-rate, throttle jobs (if configured)
信号类型索引模式用途
Pod/容器运行时
metrics-kubeletstatsreceiver.otel-*
CPU、内存、网络、文件系统。使用率占比。
集群状态
metrics-k8sclusterreceiver.otel-*
重启次数、阶段、上次终止原因、HPA、配额、节点状态
K8s事件
logs-k8seventsreceiver.otel-*
Killing、BackOff、FailedScheduling、Evicted、镜像拉取事件
K8s对象快照
logs-k8sobjectsreceiver.otel-*
Deployment/服务/configmap的状态变化
应用日志
logs-*.otel-*
body.text
severity_text
,按
k8s.pod.name
过滤
APM
traces-*.otel-*
,
metrics-service_*.otel-default
通过
service.name
+ K8s资源属性关联
ML异常
.ml-anomalies-*
内存增长、重启率、节流任务(若已配置)

Key fields

关键字段

Flat OTel paths work in ES|QL. Prefer the flat form for readability; the nested
resource.attributes.*
form is for raw log documents only.
FieldIndexWhat it is
k8s.pod.name
all k8sPod name
k8s.namespace.name
all k8sNamespace
k8s.container.name
all k8sContainer within pod
k8s.deployment.name
k8sclusterreceiver + othersParent deployment
k8s.pod.phase
k8sclusterreceiverPending=1/Running=2/Succeeded=3/Failed=4/Unknown=5
k8s.container.restarts
k8sclusterreceiverTotal container restart count
k8s.container.status.last_terminated_reason
k8sclusterreceiver
OOMKilled
,
Error
,
Completed
,
ContainerCannotRun
k8s.pod.status_reason
k8sclusterreceiverPod-level reason (
Evicted
,
NodeLost
)
k8s.pod.memory_limit_utilization
kubeletstatsreceiver0.0–1.0+ (can exceed 1 transiently before OOM)
k8s.pod.cpu_limit_utilization
kubeletstatsreceiver0.0–N (frequently >1 under CFS throttling)
k8s.pod.memory.usage
/
.working_set
kubeletstatsreceiverBytes
k8s.node.condition_memory_pressure
k8sclusterreceiver1 = pressure, 0 = ok
k8s.node.condition_ready
k8sclusterreceiver0 = NotReady
k8s.hpa.current_replicas
/
.desired_replicas
k8sclusterreceiverHPA state
attributes.k8s.event.reason
k8seventsreceiverEvent reason (filter on this)
body.text
k8seventsreceiver / logsEvent message / log message
k8s.object.name
k8seventsreceiverinvolvedObject name (log attribute, use flat form)
扁平OTel路径可在ES|QL中使用。为提高可读性,优先使用扁平形式;嵌套的
resource.attributes.*
形式仅适用于原始日志文档。
字段索引说明
k8s.pod.name
所有k8s相关索引Pod名称
k8s.namespace.name
所有k8s相关索引命名空间
k8s.container.name
所有k8s相关索引Pod内的容器
k8s.deployment.name
k8sclusterreceiver及其他相关索引所属Deployment
k8s.pod.phase
k8sclusterreceiverPending=1/Running=2/Succeeded=3/Failed=4/Unknown=5
k8s.container.restarts
k8sclusterreceiver容器总重启次数
k8s.container.status.last_terminated_reason
k8sclusterreceiver
OOMKilled
Error
Completed
ContainerCannotRun
k8s.pod.status_reason
k8sclusterreceiverPod层面的原因(
Evicted
NodeLost
k8s.pod.memory_limit_utilization
kubeletstatsreceiver0.0–1.0+(OOM前可能短暂超过1)
k8s.pod.cpu_limit_utilization
kubeletstatsreceiver0.0–N(CFS节流时通常>1)
k8s.pod.memory.usage
/
.working_set
kubeletstatsreceiver字节数
k8s.node.condition_memory_pressure
k8sclusterreceiver1 = 存在压力,0 = 正常
k8s.node.condition_ready
k8sclusterreceiver0 = 未就绪
k8s.hpa.current_replicas
/
.desired_replicas
k8sclusterreceiverHPA状态
attributes.k8s.event.reason
k8seventsreceiver事件原因(以此过滤)
body.text
k8seventsreceiver / 日志事件消息 / 日志消息
k8s.object.name
k8seventsreceiver关联对象名称(日志属性,使用扁平形式)

Field availability

字段可用性

Several fields above are off by default in stock kube-stack collectors and require explicit configuration. Verify presence before relying on them; if absent, fall back as noted and call out the substitution in the synthesis.
FieldWhy it might be missingFall-back
k8s.container.status.last_terminated_reason
Optional metric in k8sclusterreceiver; gated behind
metrics_collected.metadata
config.
Infer from K8s
Killing
/
OOMKilling
events in
logs-k8seventsreceiver.otel-*
and exit codes in app logs.
k8s.pod.status_reason
Same — optional metric on k8sclusterreceiver.Infer from events:
Evicted
,
NodeLost
,
Preempted
.
k8s.pod.cpu_limit_utilization
/
memory_limit_utilization
Only emitted when the pod has the corresponding limit set, and the kubeletstatsreceiver metric is enabled.Compute manually as
k8s.pod.cpu.usage / <limit>
from k8sclusterreceiver, or use absolute usage trending against a baseline.
k8s.node.condition_memory_pressure
Gated behind k8sclusterreceiver
node_conditions_to_report
(default omits this).
Compare
k8s.node.memory.usage
against
k8s.node.allocatable_memory
, or look for
Evicted
events on the node.
If a fall-back is used, note it in the synthesis (e.g.
(via memory.usage; limit_utilization not collected)
) so the reader knows the signal is indirect.
上述部分字段在默认kube-stack收集器中未启用,需要显式配置。在依赖这些字段之前,请先确认其存在;若不存在,请按说明使用替代方案,并在结论中注明替代方式。
字段缺失原因替代方案
k8s.container.status.last_terminated_reason
k8sclusterreceiver中的可选指标;需开启
metrics_collected.metadata
配置。
logs-k8seventsreceiver.otel-*
中的K8s
Killing
/
OOMKilling
事件以及应用日志中的退出码推断。
k8s.pod.status_reason
同上——k8sclusterreceiver中的可选指标。从事件推断:
Evicted
NodeLost
Preempted
k8s.pod.cpu_limit_utilization
/
memory_limit_utilization
仅当Pod设置了相应限制且kubeletstatsreceiver指标已启用时才会生成。从k8sclusterreceiver手动计算为
k8s.pod.cpu.usage / <limit>
,或使用绝对使用率与基线进行趋势对比。
k8s.node.condition_memory_pressure
受k8sclusterreceiver的
node_conditions_to_report
配置控制(默认不包含该字段)。
k8s.node.memory.usage
k8s.node.allocatable_memory
对比,或查看节点上的
Evicted
事件。
如果使用了替代方案,请在结论中注明(例如:
(通过memory.usage推断;未收集limit_utilization)
),以便读者知晓该信号为间接信号。

ES|QL gotchas

ES|QL注意事项

Before writing queries, know these. Each of them silently produces wrong answers rather than failing loudly.
VALUES()
returns scalar for single distinct value, array for multiple.
Templating that assumes array shape (e.g.
| first
) extracts the first character of the string when scalar. Use
MV_FIRST(VALUES(...))
or handle both.
PERCENTILE
does not work on OTel
histogram
type
(as of 8.15). For APM duration percentiles, use
AVG
on the
aggregate_metric_double
summary field (
AVG(transaction.duration.summary)
divides sum by value_count). For true percentiles, fall back to Kibana Query DSL.
COUNT(agg_metric_double)
returns
value_count
(events), not doc count.
SUM(field)
gives the sum component;
AVG(field)
gives sum/value_count. Do not use
SUM(transaction.duration.summary)
as an event-count proxy — it returns total duration.
K8s metrics use flat OTel field paths in ES|QL.
k8s.pod.name
, not
resource.attributes.k8s.pod.name
. The nested form is for raw log documents.
编写查询前,请了解以下要点。这些问题不会直接报错,但会悄无声息地返回错误结果。
VALUES()
在单个唯一值时返回标量,多个值时返回数组。
假设数组形状的模板(如
| first
)在处理标量时会提取字符串的第一个字符。请使用
MV_FIRST(VALUES(...))
或同时处理两种情况。
PERCENTILE
不适用于OTel的
histogram
类型
(截至8.15版本)。对于APM延迟百分位数,请对
aggregate_metric_double
汇总字段使用
AVG
AVG(transaction.duration.summary)
将总和除以value_count)。如需真实百分位数,请退回到Kibana Query DSL。
COUNT(agg_metric_double)
返回
value_count
(事件数),而非文档数。
SUM(field)
返回总和分量;
AVG(field)
返回总和/value_count。请勿将
SUM(transaction.duration.summary)
用作事件计数的代理——它返回的是总持续时间。
K8s指标在ES|QL中使用扁平OTel字段路径。 使用
k8s.pod.name
,而非
resource.attributes.k8s.pod.name
。嵌套形式仅适用于原始日志文档。

Failure-mode taxonomy

故障模式分类

Vocabulary for classification, not a decision tree. Use the pivotal-signal column to recognize which mode you're looking at; use "Investigate" to know what else should corroborate.
用于分类的词汇表,而非决策树。使用关键信号列识别当前排查的模式;使用“需排查内容”列了解需要哪些佐证信息。

Workload layer

工作负载层

ModePivotal signalInvestigate
OOMKilled
last_terminated_reason == "OOMKilled"
+
memory_limit_utilization → 1.0
Monotonic rise (leak) vs. load-driven spike? Compare current trend to 7-day baseline. Check heap metrics (JVM, Go, Node) for GC pressure.
CPU throttling → Error exit
cpu_limit_utilization > 1.0
+
last_terminated_reason == "Error"
Liveness/readiness probe timeouts from CFS throttling. Average CPU can look fine (40–60%) while p99 throttle is severe. Check probe timeouts vs observed startup/health latency.
Liveness probe misconfigurationRestarts without resource pressure;
initialDelaySeconds
< startup time
K8s events show
Unhealthy
/
Killing
.
kubectl logs --previous
typically shows healthy startup before kill.
CrashLoopBackOff (generic)
BackOff
events + rising
k8s.container.restarts
Branch on
last_terminated_reason
— this is a meta-mode. OOMKilled → memory path; Error → logs + throttling; ContainerCannotRun → image/exec.
ImagePullBackOffK8s events
Failed
with image name +
429
or
not found
Registry rate limit? Missing tag? Wrong imagePullSecret? Check recency of
Pulling
/
Pulled
events.
Stuck rolloutNew pods
Pending
/not-Ready >
progressDeadlineSeconds
; old pods still serving
Check
k8s.deployment.available
vs
.desired
. Admission rejection? Readiness probe failing on new pods? HPA not scaling?
Termination signal raceBrief 5xx bursts correlated with rolling deploysEndpoint removal races termination. New requests can hit the pod after SIGTERM starts. NGINX gotcha:
STOPSIGNAL SIGTERM
triggers fast shutdown, not graceful — use
STOPSIGNAL SIGQUIT
for graceful drain. Check ingress 502 rate vs rollout timing.
模式关键信号需排查内容
OOMKilled
last_terminated_reason == "OOMKilled"
+
memory_limit_utilization → 1.0
是单调增长(泄漏)还是负载驱动的峰值?将当前趋势与7天基线对比。检查语言对应的堆指标(JVM、Go、Node)是否存在GC压力。
CPU节流→错误退出
cpu_limit_utilization > 1.0
+
last_terminated_reason == "Error"
CFS节流导致存活/就绪探针超时。平均CPU使用率可能看起来正常(40–60%),但p99时段节流严重。检查探针超时时间与观测到的启动/健康延迟是否匹配。
存活探针配置错误无资源压力但出现重启;
initialDelaySeconds
< 启动时间
K8s事件显示
Unhealthy
/
Killing
kubectl logs --previous
通常会显示被终止前的健康启动日志。
CrashLoopBackOff(通用)
BackOff
事件 +
k8s.container.restarts
上升
根据
last_terminated_reason
分支排查——这是一个元模式。OOMKilled→内存路径;Error→日志+节流;ContainerCannotRun→镜像/执行路径。
ImagePullBackOffK8s事件中
Failed
伴随镜像名称 +
429
not found
镜像仓库限流?标签缺失?ImagePullSecret错误?检查
Pulling
/
Pulled
事件的时间。
发布停滞新Pod处于
Pending
/未就绪状态超过
progressDeadlineSeconds
;旧Pod仍在提供服务
检查
k8s.deployment.available
.desired
的对比。准入拒绝?新Pod的就绪探针失败?HPA未扩缩容?
终止信号竞争滚动发布期间出现短暂5xx峰值端点移除与终止存在竞争。SIGTERM启动后,新请求仍可能命中该Pod。NGINX注意事项:
STOPSIGNAL SIGTERM
会触发快速关闭,而非优雅关闭——请使用
STOPSIGNAL SIGQUIT
实现优雅排空。检查入口502错误率与发布时间的关联。

Node layer

节点层

ModePivotal signalInvestigate
Node NotReady cascade
k8s.node.condition_ready == 0
+ mass
Evicted
events
Memory pressure? Disk pressure? Network partition from API server? Inspect kubelet logs,
k8s.node.condition_*
history.
Resource eviction
status_reason == "Evicted"
+
condition_memory_pressure == 1
on node
Node-level noisy neighbor. QoS order: BestEffort → Burstable → Guaranteed. Identify which pod drove node memory up.
Node affinity/selector conflictMass unschedulable pods after label changeK8s events show
FailedScheduling
. Often triggered by cluster upgrades (e.g.
node-role.kubernetes.io/master
control-plane
).
模式关键信号需排查内容
节点NotReady连锁反应
k8s.node.condition_ready == 0
+ 大量
Evicted
事件
内存压力?磁盘压力?与API服务器网络分区?检查kubelet日志、
k8s.node.condition_*
历史记录。
资源驱逐
status_reason == "Evicted"
+ 节点
condition_memory_pressure == 1
节点层面的“噪声邻居”。QoS优先级:BestEffort → Burstable → Guaranteed。确定是哪个Pod导致节点内存上升。
节点亲和性/选择器冲突标签变更后出现大量无法调度的PodK8s事件显示
FailedScheduling
。通常由集群升级触发(例如
node-role.kubernetes.io/master
control-plane
)。

Control plane

控制平面

ModePivotal signalInvestigate
etcd I/O cascadeAPI server latency spike + cluster-wide kubelet heartbeat failuresDisk IOPS, fsync latency (must be <10ms). Cloud-burst-credit exhaustion is common.
Admission webhook blockMass
FailedCreate
across namespaces; deployments frozen
failurePolicy:Fail
webhook pod crashed. Check webhook pod health + API server TCP connection cache (caches dead connections ~15 min).
Priority preemption stormProduction pods terminating with
preempted-by
annotation
New
PriorityClass
with
globalDefault:true
caused cascade. Check
kube-scheduler
events.
PDB drain deadlockNode drain stuck indefinitely; HTTP 429 from Eviction APIPDB
minAvailable
/
maxUnavailable
too strict. No default drain timeout. Manual PDB deletion unblocks.
模式关键信号需排查内容
etcd I/O连锁反应API服务器延迟飙升 + 集群范围内kubelet心跳失败磁盘IOPS、fsync延迟(必须<10ms)。云突发信用耗尽是常见原因。
准入Webhook阻塞跨命名空间出现大量
FailedCreate
;部署冻结
failurePolicy:Fail
的Webhook Pod崩溃。检查Webhook Pod健康状态 + API服务器TCP连接缓存(缓存死连接约15分钟)。
优先级抢占风暴生产Pod因
preempted-by
注解终止
新的
PriorityClass
设置
globalDefault:true
导致连锁反应。检查
kube-scheduler
事件。
PDB排空死锁节点排空无限期停滞;Eviction API返回HTTP 429PDB的
minAvailable
/
maxUnavailable
设置过于严格。默认无排空超时。手动删除PDB可解除阻塞。

Autoscaling & admission

自动扩缩容与准入

ModePivotal signalInvestigate
HPA unready-pod dampeningLoad rising, HPA not scaling; unready pods included in calculationHPA averages CPU across all replicas including unready (0% contribution). Check
k8s.hpa.current_replicas
vs
.desired_replicas
+ pod readiness.
Resource quota silent 403Deployment stuck at n-1/n;
FailedCreate
on ReplicaSet
Namespace quota exhausted (often CronJob accumulation). Check
k8s.resource_quota.used
vs
.hard_limit
.
模式关键信号需排查内容
HPA未就绪Pod抑制负载上升,但HPA未扩缩容;计算包含未就绪PodHPA会对所有副本(包括未就绪Pod,贡献0%)的CPU使用率取平均值。检查
k8s.hpa.current_replicas
.desired_replicas
+ Pod就绪状态的对比。
资源配额静默403部署停滞在n-1/n;ReplicaSet出现
FailedCreate
命名空间配额耗尽(通常由CronJob累积导致)。检查
k8s.resource_quota.used
.hard_limit
的对比。

Networking

网络

ModePivotal signalInvestigate
StatefulSet split-brainDuplicate pod identities across partitioned nodesNetwork partition + eviction timeout race. Two instances of same ordinal running. No fencing by default.
CoreDNS OOMKillCoreDNS restarts + cluster-wide DNS timeouts in app logsDefault CoreDNS memory (~170Mi) insufficient under query amplification (ndots:5, each external lookup → ~10 lookups).
模式关键信号需排查内容
StatefulSet脑裂分区节点上出现重复Pod身份网络分区 + 驱逐超时竞争。同一序号的两个实例同时运行。默认无隔离机制。
CoreDNS OOMKillCoreDNS重启 + 集群范围内应用日志出现DNS超时默认CoreDNS内存(约170Mi)在查询放大(ndots:5,每个外部查询→约10次查询)场景下不足。

When classification is ambiguous

分类模糊时的处理

Real incidents often match two modes. Examples:
  • OOMKilled pod with simultaneous CPU throttling — memory usually drives the kill, but verify by checking whether memory or CPU hit limit first.
  • Stuck rollout with HPA dampening and resource quota near-exhaustion — both can freeze a deploy. Check which constraint is binding.
  • Node NotReady with pods that were already crashing — the node issue may be incidental.
When two modes fit, name both in the synthesis and say which one you believe is causal and why. Do not force a single hypothesis when the evidence supports two.
实际事件通常符合两种模式。示例:
  • OOMKilled Pod同时存在CPU节流——通常是内存导致终止,但需验证是内存还是CPU先达到限制。
  • 发布停滞同时存在HPA抑制和资源配额接近耗尽——两者都可能导致部署冻结。检查哪个约束是绑定状态。
  • 节点NotReady同时Pod已在崩溃——节点问题可能只是偶然事件。
当两种模式都符合时,请在结论中同时列出,并说明你认为哪个是因果原因及理由。当证据支持两种假设时,不要强行选择单一假设。

Signal interpretation

信号解读

Memory

内存

  • Monotonic rise over 30–60 min → leak. Check GC metrics for the language: JVM
    jvm.gc.duration
    , Go
    process.runtime.go.gc.pause_ns
    , Node
    v8js_gc_duration
    . Rising GC frequency/pause with stable live-set is the canonical leak signature.
  • Diurnal / load-correlated spikes → load-driven, not leak. Consider HPA tuning or limit increase.
  • Hits 1.0, then restart → OOMKilled confirmed. Exit code 137 (SIGKILL) in app logs consistent.
  • 30–60分钟内单调上升 → 内存泄漏。检查对应语言的GC指标:JVM
    jvm.gc.duration
    、Go
    process.runtime.go.gc.pause_ns
    、Node
    v8js_gc_duration
    。GC频率/暂停时间上升但活跃集稳定是典型的泄漏特征。
  • 昼夜/负载相关峰值 → 负载驱动,而非泄漏。考虑调整HPA或增加内存限制。
  • 达到1.0后重启 → 确认OOMKilled。应用日志中退出码137(SIGKILL)与此一致。

CPU

CPU

  • cpu_limit_utilization > 1.0
    sustained → CFS throttling. Node has spare CPU; the pod is quota-blocked.
  • Symptoms of throttling (not the throttle metric itself): liveness probe timeouts, p99 latency 4–16× p50, queue backpressure upstream, Error-reason container terminations.
  • Average can look healthy while p95 is throttled. Do not trust average alone.
  • cpu_limit_utilization > 1.0
    持续存在 → CFS节流。节点有空闲CPU;Pod受配额限制。
  • 节流症状(而非节流指标本身):存活探针超时、p99延迟是p50的4–16倍、上游队列积压、Error原因导致容器终止。
  • 平均使用率可能看起来健康,但p95时段存在节流。不要仅依赖平均值。

Restart patterns

重启模式

  • restarts > 0
    recently → workload has been restarting. Don't read magnitude into the count (see Restart count is boolean); confirm the pattern from K8s
    Killing
    /
    BackOff
    event timestamps in
    logs-k8seventsreceiver.otel-*
    .
  • Restarts correlated with memory pressure (
    memory_limit_utilization → 1.0
    ) → OOMKilled path.
  • Restarts without memory/CPU pressure → probe misconfig, app bug, or startup dependency failure. Pull events for
    Unhealthy
    and
    Killing
    .
  • restarts > 0
    (近期) → 工作负载出现过重启。不要从数值大小推断结论(参考“重启次数是布尔值”);请从
    logs-k8seventsreceiver.otel-*
    中的K8s
    Killing
    /
    BackOff
    事件时间戳确认模式。
  • 重启与内存压力(
    memory_limit_utilization → 1.0
    )相关 → OOMKilled路径。
  • 无内存/CPU压力但出现重启 → 探针配置错误、应用Bug或启动依赖失败。查询
    Unhealthy
    Killing
    事件。

Termination reasons

终止原因

  • OOMKilled
    → memory path.
  • Error
    → non-zero exit. Check app logs; if empty/minimal, check CPU throttling before attributing to app logic.
  • Completed
    → ran to completion. Normal for Jobs/CronJobs/init containers; anomalous otherwise.
  • ContainerCannotRun
    → runtime/image/exec issue. Check image pull events.
  • OOMKilled
    → 内存路径。
  • Error
    → 非零退出码。检查应用日志;如果日志为空/内容极少,请先检查CPU节流,再归因于应用逻辑。
  • Completed
    → 运行完成。对于Jobs/CronJobs/初始化容器是正常情况;否则异常。
  • ContainerCannotRun
    → 运行时/镜像/执行问题。检查镜像拉取事件。

Investigation flow

排查流程

An investigation is not a checklist. The sections below describe a typical arc — compress, skip, or revisit them based on what you find. Terminate as soon as you have enough evidence to synthesize at a known confidence. Chasing signals past the point of diminishing returns is a failure mode, not thoroughness.
排查不是 checklist。以下部分描述的是典型流程——根据发现的内容压缩、跳过或重新访问相关步骤。一旦有足够证据支持已知置信度的结论,即可终止排查。过度追逐信号导致收益递减是一种错误模式,而非严谨。

Orient

定位目标

Resolve the target:
k8s.pod.name
,
k8s.namespace.name
, optionally
k8s.deployment.name
and
service.name
. If no time window is given, default to the last hour for pod-level investigations, last 2 hours for event correlation, last 6 hours for ongoing/unresolved incidents.
If the alert payload already tells you the failure mode (e.g., it fires specifically on
OOMKilled
), note that and skip classification; move to confirmation and baseline comparison.
确定目标:
k8s.pod.name
k8s.namespace.name
,可选
k8s.deployment.name
service.name
。如果未指定时间窗口,Pod级排查默认最近1小时,事件关联默认最近2小时,持续/未解决事件默认最近6小时。
如果告警负载已明确故障模式(例如,专门针对
OOMKilled
触发),请注明并跳过分类步骤;直接进入确认和基线对比环节。

Characterize

特征分析

Get the shape of the workload's recent behavior: restart count, termination reasons, phase, utilization. One or two queries usually suffice.
esql
FROM metrics-k8sclusterreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND k8s.namespace.name == "<ns>"
  AND @timestamp > NOW() - 1 hour
| STATS restarts = MAX(k8s.container.restarts),
        term_reasons = VALUES(k8s.container.status.last_terminated_reason),
        phase = MAX(k8s.pod.phase)
esql
FROM metrics-kubeletstatsreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND @timestamp > NOW() - 15 minutes
| STATS mem_pct = ROUND(MAX(k8s.pod.memory_limit_utilization) * 100, 1),
        cpu_pct = ROUND(MAX(k8s.pod.cpu_limit_utilization) * 100, 1)
获取工作负载近期行为的特征:重启次数、终止原因、阶段、使用率。通常1-2个查询即可满足需求。
esql
FROM metrics-k8sclusterreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND k8s.namespace.name == "<ns>"
  AND @timestamp > NOW() - 1 hour
| STATS restarts = MAX(k8s.container.restarts),
        term_reasons = VALUES(k8s.container.status.last_terminated_reason),
        phase = MAX(k8s.pod.phase)
esql
FROM metrics-kubeletstatsreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND @timestamp > NOW() - 15 minutes
| STATS mem_pct = ROUND(MAX(k8s.pod.memory_limit_utilization) * 100, 1),
        cpu_pct = ROUND(MAX(k8s.pod.cpu_limit_utilization) * 100, 1)

Classify

分类

Use the taxonomy. The pivotal signal should match; the "Investigate" column tells you what corroboration to seek.
When two modes fit, note both and proceed with the one that has the stronger pivotal signal. You may revise during corroboration.
使用上述分类体系。关键信号应匹配;“需排查内容”列会告诉你需要哪些佐证信息。
当两种模式都符合时,请同时列出,并优先选择关键信号更明确的模式。你可能会在佐证环节修正结论。

Corroborate

佐证

Pull the evidence your classification predicts you'll find. Typical sources:
K8s events for the namespace and window:
esql
FROM logs-k8seventsreceiver.otel-*
| WHERE k8s.namespace.name == "<ns>"
  AND @timestamp > NOW() - 2 hours
  AND attributes.k8s.event.reason IN (
    "BackOff", "Killing", "Unhealthy", "Failed",
    "FailedScheduling", "Evicted", "SuccessfulRescale",
    "Pulling", "Pulled", "Started", "Created"
  )
| SORT @timestamp DESC
| KEEP @timestamp, attributes.k8s.event.reason, body.text, k8s.object.name
| LIMIT 30
Application logs if available — look at the 200 most recent lines before the termination timestamp. If absent, flag
no_logs_available
; do not invent a log pattern.
APM if the pod runs an instrumented service — resolve
service.name
from pod resource attributes for later correlation. SLO / latency / error-rate analysis itself is APM-layer work and out of scope for this skill.
Baseline comparison — for utilization-based findings, compare current values to 7-day-prior at the same hour-of-day. "High memory" is meaningful only relative to what's normal for this workload.
提取分类预测的证据。典型来源:
指定命名空间和时间窗口的K8s事件
esql
FROM logs-k8seventsreceiver.otel-*
| WHERE k8s.namespace.name == "<ns>"
  AND @timestamp > NOW() - 2 hours
  AND attributes.k8s.event.reason IN (
    "BackOff", "Killing", "Unhealthy", "Failed",
    "FailedScheduling", "Evicted", "SuccessfulRescale",
    "Pulling", "Pulled", "Started", "Created"
  )
| SORT @timestamp DESC
| KEEP @timestamp, attributes.k8s.event.reason, body.text, k8s.object.name
| LIMIT 30
应用日志(若可用)——查看终止时间前的200条最新日志。如果日志缺失,请标记
no_logs_available
;不要编造日志模式。
APM(若Pod运行已埋点的服务)——从Pod资源属性中解析
service.name
用于后续关联。SLO/延迟/错误率分析本身属于APM层工作,不在本技能范围内。
基线对比——对于基于使用率的发现,请将当前值与7天前同一时段的值对比。“高内存”只有相对于该工作负载的正常情况才有意义。

Check for upstream cause (conditional)

检查上游原因(可选)

Only pursue if the symptom pattern suggests it. Threshold: upstream error rate >5× baseline or latency >3× baseline, AND degradation started before the symptom on the target service. Co-symptoms do not establish causation.
If
metrics-service_destination.1m.otel-default
has no rows for the service, report
insufficient_dependency_data
— not "upstreams healthy."
仅当症状模式表明可能存在上游原因时才进行。阈值:上游错误率>基线5倍 延迟>基线3倍, 上游服务的早于目标服务出现降级。并发症状不能确立因果关系。
如果
metrics-service_destination.1m.otel-default
中无该服务的数据,请报告
insufficient_dependency_data
——而非“上游健康”。

Check for recent change (conditional)

检查近期变更(可选)

SuccessfulCreate
/
Pulled
events in the last 2 hours often correlate with deploys.
logs-k8sobjectsreceiver.otel-*
shows configmap/secret/deployment spec changes. A change within 15 minutes of the symptom onset is a strong correlation, but still a correlation — verify it plausibly explains the mode you've classified.
最近2小时内的
SuccessfulCreate
/
Pulled
事件通常与发布相关。
logs-k8sobjectsreceiver.otel-*
显示configmap/secret/deployment的配置变更。症状出现前15分钟内的变更是强关联,但仍只是关联——请验证该变更是否能合理解释你分类的故障模式。

Synthesize and stop

总结并终止

Synthesize as soon as you have enough evidence to support a hypothesis at known confidence. You do not need to complete every section above — investigation terminates when either:
  • You have a high-confidence hypothesis with corroboration, or
  • You have a low/medium-confidence hypothesis and further queries are unlikely to change the picture (e.g., logs are unavailable, APM isn't instrumented, no recent changes found).
一旦有足够证据支持已知置信度的假设,即可总结结论。你无需完成上述所有步骤——当出现以下任一情况时,排查即可终止:
  • 你有高置信度的假设并得到佐证,或
  • 你有低/中置信度的假设,且进一步查询不太可能改变结论(例如,日志不可用、未进行APM埋点、未发现近期变更)。

Synthesis

结论模板

Default structure:
text
HYPOTHESIS (confidence: high | medium | low)
<One paragraph: service, symptom, most likely cause. Name the failure mode from the taxonomy.>

EVIDENCE
- <Finding from characterization, with the concrete metric or value.>
- <Finding from events / logs / APM.>
- <Finding from baseline comparison, dependency check, or change correlation if pursued.>

CONFIDENCE NOTE
<Only if not 'high'. What specific evidence is missing or ambiguous.>

RECOMMENDED NEXT STEPS
1. <Most actionable — typically a config check or metric to observe.>
2. <Secondary.>

DOWNSTREAM IMPACT
<Services depending on this workload, or 'No downstream dependencies identified.'>
When two hypotheses are live: replace HYPOTHESIS with COMPETING HYPOTHESES; list both, say which you lean toward and why, and list the evidence that would disambiguate them.
When no incident is found (symptom resolved, or alert appears spurious): say so directly.
ALERT FIRED BUT SYSTEM APPEARS HEALTHY
is a valid output. List what you checked and what you didn't find.
默认结构:
text
假设(置信度:高 | 中 | 低)
<一段文字:服务、症状、最可能的原因。引用分类体系中的故障模式。>

证据
- <特征分析的发现,包含具体指标或数值。>
- <事件/日志/APM的发现。>
- <基线对比、依赖检查或变更关联的发现(若进行)。>

置信度说明
<仅当置信度不为“高”时填写。说明缺失或模糊的具体证据。>

建议下一步操作
1. <最具可操作性的步骤——通常是配置检查或需观测的指标。>
2. <次要步骤。>

下游影响
<依赖该工作负载的服务,或“未识别到下游依赖”。>
当存在两种假设时:将“假设”替换为“竞争假设”;列出两种假设,说明你倾向于哪一种及理由,并列出可消除歧义的证据。
未发现事件时(症状已解决,或告警为误报):直接说明。
告警触发但系统看似正常
是有效的输出。列出你检查的内容及未发现的问题。

Confidence calibration

置信度校准

Start at high and downgrade based on what's missing:
  • Downgrade to medium if: primary signal is clear but corroboration is missing (no logs, no APM, no baseline comparison possible). Or: two modes fit and you can't disambiguate.
  • Downgrade to low if: only a single signal supports the hypothesis, signals conflict, or the mode requires evidence you couldn't fetch.
Never return high when application log data was absent and the hypothesis depends on application behavior. Absence of evidence does not corroborate a hypothesis.
默认从开始,根据缺失内容降级:
  • 降级为:主要信号明确但缺乏佐证(无日志、无APM、无法进行基线对比)。或:两种模式都符合且无法消除歧义。
  • 降级为:只有单个信号支持假设,信号冲突,或该模式需要的证据无法获取。
当应用日志缺失且假设依赖应用行为时,请勿返回置信度。没有证据不能佐证假设。

Query recipes

查询示例

Most-restarting pods in a namespace

命名空间内重启次数最多的Pod

esql
FROM metrics-k8sclusterreceiver.otel-*
| WHERE k8s.namespace.name == "<ns>" AND @timestamp > NOW() - 1 hour
| STATS restarts = MAX(k8s.container.restarts) BY k8s.pod.name, k8s.container.status.last_terminated_reason
| WHERE restarts > 0
| SORT restarts DESC
| LIMIT 20
esql
FROM metrics-k8sclusterreceiver.otel-*
| WHERE k8s.namespace.name == "<ns>" AND @timestamp > NOW() - 1 hour
| STATS restarts = MAX(k8s.container.restarts) BY k8s.pod.name, k8s.container.status.last_terminated_reason
| WHERE restarts > 0
| SORT restarts DESC
| LIMIT 20

CPU throttling check for a pod

Pod的CPU节流检查

esql
FROM metrics-kubeletstatsreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND @timestamp > NOW() - 30 minutes
| STATS max_cpu_ratio = ROUND(MAX(k8s.pod.cpu_limit_utilization), 2),
        avg_cpu_ratio = ROUND(AVG(k8s.pod.cpu_limit_utilization), 2),
        max_cpu_cores = ROUND(MAX(k8s.pod.cpu.usage), 3)
Sustained ratio >1.0 = throttling. Transient >1.0 with avg <0.5 is usually benign burst.
esql
FROM metrics-kubeletstatsreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND @timestamp > NOW() - 30 minutes
| STATS max_cpu_ratio = ROUND(MAX(k8s.pod.cpu_limit_utilization), 2),
        avg_cpu_ratio = ROUND(AVG(k8s.pod.cpu_limit_utilization), 2),
        max_cpu_cores = ROUND(MAX(k8s.pod.cpu.usage), 3)
持续比值>1.0 = 节流。短暂>1.0且平均值<0.5通常是良性突发。

Nodes under memory pressure (right now)

当前存在内存压力的节点

esql
FROM metrics-k8sclusterreceiver.otel-*
| WHERE @timestamp > NOW() - 15 minutes AND k8s.node.condition_memory_pressure == 1
| STATS ts = MAX(@timestamp) BY k8s.node.name
| SORT ts DESC
esql
FROM metrics-k8sclusterreceiver.otel-*
| WHERE @timestamp > NOW() - 15 minutes AND k8s.node.condition_memory_pressure == 1
| STATS ts = MAX(@timestamp) BY k8s.node.name
| SORT ts DESC

Admission denials (webhook or quota) last hour

最近1小时的准入拒绝(Webhook或配额)

esql
FROM logs-k8seventsreceiver.otel-*
| WHERE @timestamp > NOW() - 1 hour
  AND (attributes.k8s.event.reason == "FailedCreate"
       OR body.text LIKE "*admission webhook*"
       OR body.text LIKE "*exceeded quota*")
| SORT @timestamp DESC
| KEEP @timestamp, k8s.namespace.name, attributes.k8s.event.reason, body.text
| LIMIT 30
esql
FROM logs-k8seventsreceiver.otel-*
| WHERE @timestamp > NOW() - 1 hour
  AND (attributes.k8s.event.reason == "FailedCreate"
       OR body.text LIKE "*admission webhook*"
       OR body.text LIKE "*exceeded quota*")
| SORT @timestamp DESC
| KEEP @timestamp, k8s.namespace.name, attributes.k8s.event.reason, body.text
| LIMIT 30

Firing K8s alerts

触发中的K8s告警

text
GET /api/alerting/rules/_find?search=k8s&search_fields=tags&filter=alert.attributes.executionStatus.status:active
text
GET /api/alerting/rules/_find?search=k8s&search_fields=tags&filter=alert.attributes.executionStatus.status:active

Examples

示例场景

"Why is my pod CrashLoopBackOff-ing?"

“我的Pod为什么CrashLoopBackOff?”

Characterize first: get restart count, termination reason, memory and CPU utilization.
  • If
    last_terminated_reason == "OOMKilled"
    and memory utilization hit 1.0 → memory path. Corroborate with 7-day baseline: monotonic rise over days = leak; spiky = load-driven. Check GC metrics if language is known.
  • If
    last_terminated_reason == "Error"
    and
    cpu_limit_utilization > 1.0
    → CPU throttling path. Corroborate with liveness probe config (initialDelaySeconds, timeoutSeconds) and K8s events for
    Unhealthy
    .
  • If
    last_terminated_reason == "Error"
    and CPU is fine → application-logic path. Pull recent logs before termination.
  • If
    last_terminated_reason == "ContainerCannotRun"
    → image/exec path. Check K8s events for
    Failed
    pull events.
Synthesize with appropriate confidence. If logs were unavailable on the Error path, downgrade to medium and say so.
首先进行特征分析:获取重启次数、终止原因、内存和CPU使用率。
  • 如果
    last_terminated_reason == "OOMKilled"
    且内存使用率达到1.0 → 内存路径。用7天基线佐证:数天内单调增长=泄漏;峰值=负载驱动。若已知语言,检查GC指标。
  • 如果
    last_terminated_reason == "Error"
    cpu_limit_utilization > 1.0
    → CPU节流路径。用存活探针配置(initialDelaySeconds、timeoutSeconds)和K8s的
    Unhealthy
    事件佐证。
  • 如果
    last_terminated_reason == "Error"
    且CPU正常 → 应用逻辑路径。获取终止前的最新日志。
  • 如果
    last_terminated_reason == "ContainerCannotRun"
    → 镜像/执行路径。检查K8s的
    Failed
    拉取事件。
根据置信度撰写结论。如果Error路径下日志不可用,将置信度降级为中并说明。

"Is my rollout stuck?"

“我的发布是否停滞?”

Authoritative signal:
k8s.deployment.available < k8s.deployment.desired
for > 10 minutes.
Diagnose the constraint:
  • K8s events on the new ReplicaSet:
    FailedCreate
    → admission rejection (quota, webhook, PSP).
    FailedScheduling
    → no node fits.
  • New-pod utilization: all at 0% memory → never started (image pull failure); high CPU with low memory → slow startup hitting readiness probe.
  • HPA state: stable
    current_replicas < desired_replicas
    under load → unready-pod dampening.
权威信号:
k8s.deployment.available < k8s.deployment.desired
持续超过10分钟。
诊断约束条件:
  • 新ReplicaSet的K8s事件:
    FailedCreate
    → 准入拒绝(配额、Webhook、PSP)。
    FailedScheduling
    → 无匹配节点。
  • 新Pod使用率:内存全为0% → 从未启动(镜像拉取失败);CPU高内存低 → 启动缓慢触发就绪探针。
  • HPA状态:负载下
    current_replicas < desired_replicas
    稳定 → 未就绪Pod抑制。

"Alert fired but everything looks healthy"

“告警触发但一切看似正常”

Possible and worth naming explicitly. Check:
  • Has the symptom resolved? Compare current utilization/restart rate to the alert trigger point.
  • Was the alert a transient spike that's already decayed?
  • Is the alert tuned appropriately (e.g., too-short evaluation window)?
Output:
ALERT FIRED BUT SYSTEM APPEARS HEALTHY
with what you checked. Recommend alert tuning if the pattern is recurrent.
这种情况是可能的,值得明确说明。检查:
  • 症状是否已解决?将当前使用率/重启率与告警触发阈值对比。
  • 告警是否是已衰减的短暂峰值?
  • 告警配置是否合理(例如,评估窗口过短)?
输出:
告警触发但系统看似正常
,并列出你检查的内容。如果该模式反复出现,建议调整告警配置。

Related

相关资源

  • Workflow:
    K8s CrashLoopBackOff Investigation
    — alert-triggered automated version of the pod-level path above. Runs deterministic ESQL + branches; this skill provides the interpretation layer the workflow lacks.
  • Forge genome library: 16 K8s failure scenarios (OOMKill cascade, CPU throttling, probe misconfig, node NotReady, admission webhook block, etc.) validating this skill's coverage.
  • 工作流:
    K8s CrashLoopBackOff Investigation
    ——上述Pod级路径的告警触发自动化版本。运行确定性ESQL并分支;本技能提供该工作流缺失的解读层。
  • Forge基因组库: 16种K8s故障场景(OOMKill连锁反应、CPU节流、探针配置错误、节点NotReady、准入Webhook阻塞等),验证本技能的覆盖范围。