debug-buttercup

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Debug Buttercup

调试Buttercup

When to Use

适用场景

  • Pods in the
    crs
    namespace are in CrashLoopBackOff, OOMKilled, or restarting
  • Multiple services restart simultaneously (cascade failure)
  • Redis is unresponsive or showing AOF warnings
  • Queues are growing but tasks are not progressing
  • Nodes show DiskPressure, MemoryPressure, or PID pressure
  • Build-bot cannot reach the Docker daemon (DinD failures)
  • Scheduler is stuck and not advancing task state
  • Health check probes are failing unexpectedly
  • Deployed Helm values don't match actual pod configuration
  • crs
    命名空间中的Pod处于CrashLoopBackOff、OOMKilled状态或频繁重启
  • 多个服务同时重启(级联故障)
  • Redis无响应或显示AOF警告
  • 队列持续增长但任务无进展
  • 节点出现DiskPressure、MemoryPressure或PID压力
  • Build-bot无法连接Docker守护进程(DinD故障)
  • Scheduler停滞且未推进任务状态
  • 健康检查探针意外失败
  • 部署的Helm值与实际Pod配置不匹配

When NOT to Use

不适用场景

  • Deploying or upgrading Buttercup (use Helm and deployment guides)
  • Debugging issues outside the
    crs
    Kubernetes namespace
  • Performance tuning that doesn't involve a failure symptom
  • 部署或升级Buttercup(请使用Helm和部署指南)
  • 调试
    crs
    Kubernetes命名空间之外的问题
  • 无故障症状的性能调优

Namespace and Services

命名空间与服务

All pods run in namespace
crs
. Key services:
LayerServices
Infraredis, dind, litellm, registry-cache
Orchestrationscheduler, task-server, task-downloader, scratch-cleaner
Fuzzingbuild-bot, fuzzer-bot, coverage-bot, tracer-bot, merger-bot
Analysispatcher, seed-gen, program-model, pov-reproducer
Interfacecompetition-api, ui
所有Pod都运行在
crs
命名空间中。核心服务如下:
层级服务
基础设施redis, dind, litellm, registry-cache
编排层scheduler, task-server, task-downloader, scratch-cleaner
模糊测试层build-bot, fuzzer-bot, coverage-bot, tracer-bot, merger-bot
分析层patcher, seed-gen, program-model, pov-reproducer
接口层competition-api, ui

Triage Workflow

分类排查流程

Always start with triage. Run these three commands first:
bash
undefined
始终从分类排查开始。首先运行以下三条命令:
bash
undefined

1. Pod status - look for restarts, CrashLoopBackOff, OOMKilled

1. Pod status - look for restarts, CrashLoopBackOff, OOMKilled

kubectl get pods -n crs -o wide
kubectl get pods -n crs -o wide

2. Events - the timeline of what went wrong

2. Events - the timeline of what went wrong

kubectl get events -n crs --sort-by='.lastTimestamp'
kubectl get events -n crs --sort-by='.lastTimestamp'

3. Warnings only - filter the noise

3. Warnings only - filter the noise

kubectl get events -n crs --field-selector type=Warning --sort-by='.lastTimestamp'

Then narrow down:

```bash
kubectl get events -n crs --field-selector type=Warning --sort-by='.lastTimestamp'

然后缩小排查范围:

```bash

Why did a specific pod restart? Check Last State Reason (OOMKilled, Error, Completed)

Why did a specific pod restart? Check Last State Reason (OOMKilled, Error, Completed)

kubectl describe pod -n crs <pod-name> | grep -A8 'Last State:'
kubectl describe pod -n crs <pod-name> | grep -A8 'Last State:'

Check actual resource limits vs intended

Check actual resource limits vs intended

kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}'
kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}'

Crashed container's logs (--previous = the container that died)

Crashed container's logs (--previous = the container that died)

kubectl logs -n crs <pod-name> --previous --tail=200
kubectl logs -n crs <pod-name> --previous --tail=200

Current logs

Current logs

kubectl logs -n crs <pod-name> --tail=200
undefined
kubectl logs -n crs <pod-name> --tail=200
undefined

Historical vs Ongoing Issues

历史问题与当前问题

High restart counts don't necessarily mean an issue is ongoing -- restarts accumulate over a pod's lifetime. Always distinguish:
  • --tail
    shows the end of the log buffer, which may contain old messages. Use
    --since=300s
    to confirm issues are actively happening now.
  • --timestamps
    on log output helps correlate events across services.
  • Check
    Last State
    timestamps in
    describe pod
    to see when the most recent crash actually occurred.
高重启次数并不一定意味着问题正在持续——重启次数是Pod生命周期内的累计值。请务必区分:
  • --tail
    参数会显示日志缓冲区的末尾内容,其中可能包含旧消息。使用
    --since=300s
    参数可确认问题是否当前正在发生。
  • 日志输出中的
    --timestamps
    参数有助于跨服务关联事件。
  • 查看
    describe pod
    中的
    Last State
    时间戳,了解最近一次崩溃实际发生的时间。

Cascade Detection

级联故障检测

When many pods restart around the same time, check for a shared-dependency failure before investigating individual pods. The most common cascade: Redis goes down -> every service gets
ConnectionError
/
ConnectionRefusedError
-> mass restarts. Look for the same error across multiple
--previous
logs -- if they all say
redis.exceptions.ConnectionError
, debug Redis, not the individual services.
当多个Pod在同一时间段内重启时,应先检查共享依赖项的故障,再逐一排查单个Pod。最常见的级联故障场景:Redis宕机 -> 所有服务出现
ConnectionError
/
ConnectionRefusedError
-> 大规模重启。如果多个Pod的
--previous
日志中均出现
redis.exceptions.ConnectionError
,请先调试Redis,而非单个服务。

Log Analysis

日志分析

bash
undefined
bash
undefined

All replicas of a service at once

All replicas of a service at once

kubectl logs -n crs -l app=fuzzer-bot --tail=100 --prefix
kubectl logs -n crs -l app=fuzzer-bot --tail=100 --prefix

Stream live

Stream live

kubectl logs -n crs -l app.kubernetes.io/name=redis -f
kubectl logs -n crs -l app.kubernetes.io/name=redis -f

Collect all logs to disk (existing script)

Collect all logs to disk (existing script)

bash deployment/collect-logs.sh
undefined
bash deployment/collect-logs.sh
undefined

Resource Pressure

资源压力排查

bash
undefined
bash
undefined

Per-pod CPU/memory

Per-pod CPU/memory

kubectl top pods -n crs
kubectl top pods -n crs

Node-level

Node-level

kubectl top nodes
kubectl top nodes

Node conditions (disk pressure, memory pressure, PID pressure)

Node conditions (disk pressure, memory pressure, PID pressure)

kubectl describe node <node> | grep -A5 Conditions
kubectl describe node <node> | grep -A5 Conditions

Disk usage inside a pod

Disk usage inside a pod

kubectl exec -n crs <pod> -- df -h
kubectl exec -n crs <pod> -- df -h

What's eating disk

What's eating disk

kubectl exec -n crs <pod> -- sh -c 'du -sh /corpus/* 2>/dev/null' kubectl exec -n crs <pod> -- sh -c 'du -sh /scratch/* 2>/dev/null'
undefined
kubectl exec -n crs <pod> -- sh -c 'du -sh /corpus/* 2>/dev/null' kubectl exec -n crs <pod> -- sh -c 'du -sh /scratch/* 2>/dev/null'
undefined

Redis Debugging

Redis调试

Redis is the backbone. When it goes down, everything cascades.
bash
undefined
Redis是核心支撑组件。一旦Redis宕机,所有服务都会出现级联故障。
bash
undefined

Redis pod status

Redis pod status

kubectl get pods -n crs -l app.kubernetes.io/name=redis
kubectl get pods -n crs -l app.kubernetes.io/name=redis

Redis logs (AOF warnings, OOM, connection issues)

Redis logs (AOF warnings, OOM, connection issues)

kubectl logs -n crs -l app.kubernetes.io/name=redis --tail=200
kubectl logs -n crs -l app.kubernetes.io/name=redis --tail=200

Connect to Redis CLI

Connect to Redis CLI

kubectl exec -n crs <redis-pod> -- redis-cli
kubectl exec -n crs <redis-pod> -- redis-cli

Inside redis-cli: key diagnostics

Inside redis-cli: key diagnostics

INFO memory # used_memory_human, maxmemory INFO persistence # aof_enabled, aof_last_bgrewrite_status, aof_delayed_fsync INFO clients # connected_clients, blocked_clients INFO stats # total_connections_received, rejected_connections CLIENT LIST # see who's connected DBSIZE # total keys
INFO memory # used_memory_human, maxmemory INFO persistence # aof_enabled, aof_last_bgrewrite_status, aof_delayed_fsync INFO clients # connected_clients, blocked_clients INFO stats # total_connections_received, rejected_connections CLIENT LIST # see who's connected DBSIZE # total keys

AOF configuration

AOF configuration

CONFIG GET appendonly # is AOF enabled? CONFIG GET appendfsync # fsync policy: everysec, always, or no
CONFIG GET appendonly # is AOF enabled? CONFIG GET appendfsync # fsync policy: everysec, always, or no

What is /data mounted on? (disk vs tmpfs matters for AOF performance)

What is /data mounted on? (disk vs tmpfs matters for AOF performance)


```bash
kubectl exec -n crs <redis-pod> -- mount | grep /data
kubectl exec -n crs <redis-pod> -- du -sh /data/

```bash
kubectl exec -n crs <redis-pod> -- mount | grep /data
kubectl exec -n crs <redis-pod> -- du -sh /data/

Queue Inspection

队列检查

Buttercup uses Redis streams with consumer groups. Queue names:
QueueStream Key
Buildfuzzer_build_queue
Build Outputfuzzer_build_output_queue
Crashfuzzer_crash_queue
Confirmed Vulnsconfirmed_vulnerabilities_queue
Download Tasksorchestrator_download_tasks_queue
Ready Taskstasks_ready_queue
Patchespatches_queue
Indexindex_queue
Index Outputindex_output_queue
Traced Vulnstraced_vulnerabilities_queue
POV Requestspov_reproducer_requests_queue
POV Responsespov_reproducer_responses_queue
Delete Taskorchestrator_delete_task_queue
bash
undefined
Buttercup使用带消费组的Redis流。队列名称如下:
队列流键
Buildfuzzer_build_queue
Build Outputfuzzer_build_output_queue
Crashfuzzer_crash_queue
Confirmed Vulnsconfirmed_vulnerabilities_queue
Download Tasksorchestrator_download_tasks_queue
Ready Taskstasks_ready_queue
Patchespatches_queue
Indexindex_queue
Index Outputindex_output_queue
Traced Vulnstraced_vulnerabilities_queue
POV Requestspov_reproducer_requests_queue
POV Responsespov_reproducer_responses_queue
Delete Taskorchestrator_delete_task_queue
bash
undefined

Check stream length (pending messages)

Check stream length (pending messages)

kubectl exec -n crs <redis-pod> -- redis-cli XLEN fuzzer_build_queue
kubectl exec -n crs <redis-pod> -- redis-cli XLEN fuzzer_build_queue

Check consumer group lag

Check consumer group lag

kubectl exec -n crs <redis-pod> -- redis-cli XINFO GROUPS fuzzer_build_queue
kubectl exec -n crs <redis-pod> -- redis-cli XINFO GROUPS fuzzer_build_queue

Check pending messages per consumer

Check pending messages per consumer

kubectl exec -n crs <redis-pod> -- redis-cli XPENDING fuzzer_build_queue build_bot_consumers - + 10
kubectl exec -n crs <redis-pod> -- redis-cli XPENDING fuzzer_build_queue build_bot_consumers - + 10

Task registry size

Task registry size

kubectl exec -n crs <redis-pod> -- redis-cli HLEN tasks_registry
kubectl exec -n crs <redis-pod> -- redis-cli HLEN tasks_registry

Task state counts

Task state counts

kubectl exec -n crs <redis-pod> -- redis-cli SCARD cancelled_tasks kubectl exec -n crs <redis-pod> -- redis-cli SCARD succeeded_tasks kubectl exec -n crs <redis-pod> -- redis-cli SCARD errored_tasks

Consumer groups: `build_bot_consumers`, `orchestrator_group`, `patcher_group`, `index_group`, `tracer_bot_group`.
kubectl exec -n crs <redis-pod> -- redis-cli SCARD cancelled_tasks kubectl exec -n crs <redis-pod> -- redis-cli SCARD succeeded_tasks kubectl exec -n crs <redis-pod> -- redis-cli SCARD errored_tasks

消费组包括:`build_bot_consumers`、`orchestrator_group`、`patcher_group`、`index_group`、`tracer_bot_group`。

Health Checks

健康检查

Pods write timestamps to
/tmp/health_check_alive
. The liveness probe checks file freshness.
bash
undefined
Pods会将时间戳写入
/tmp/health_check_alive
文件。存活探针会检查该文件的新鲜度。
bash
undefined

Check health file freshness

Check health file freshness

kubectl exec -n crs <pod> -- stat /tmp/health_check_alive kubectl exec -n crs <pod> -- cat /tmp/health_check_alive

If a pod is restart-looping, the health check file is likely going stale because the main process is blocked (e.g. waiting on Redis, stuck on I/O).
kubectl exec -n crs <pod> -- stat /tmp/health_check_alive kubectl exec -n crs <pod> -- cat /tmp/health_check_alive

如果Pod处于重启循环状态,健康检查文件很可能已过期,这是因为主进程被阻塞(例如等待Redis响应、卡在I/O操作上)。

Telemetry (OpenTelemetry / Signoz)

遥测(OpenTelemetry / Signoz)

All services export traces and metrics via OpenTelemetry. If Signoz is deployed (
global.signoz.deployed: true
), use its UI for distributed tracing across services.
bash
undefined
所有服务均通过OpenTelemetry导出追踪数据和指标。如果已部署Signoz(
global.signoz.deployed: true
),可使用其UI进行跨服务的分布式追踪。
bash
undefined

Check if OTEL is configured

Check if OTEL is configured

kubectl exec -n crs <pod> -- env | grep OTEL
kubectl exec -n crs <pod> -- env | grep OTEL

Verify Signoz pods are running (if deployed)

Verify Signoz pods are running (if deployed)

kubectl get pods -n platform -l app.kubernetes.io/name=signoz

Traces are especially useful for diagnosing slow task processing, identifying which service in a pipeline is the bottleneck, and correlating events across the scheduler -> build-bot -> fuzzer-bot chain.
kubectl get pods -n platform -l app.kubernetes.io/name=signoz

追踪数据对于诊断任务处理缓慢、识别流水线中的瓶颈服务以及关联scheduler -> build-bot -> fuzzer-bot链路中的事件尤为有用。

Volume and Storage

卷与存储

bash
undefined
bash
undefined

PVC status

PVC status

kubectl get pvc -n crs
kubectl get pvc -n crs

Check if corpus tmpfs is mounted, its size, and backing type

Check if corpus tmpfs is mounted, its size, and backing type

kubectl exec -n crs <pod> -- mount | grep corpus_tmpfs kubectl exec -n crs <pod> -- df -h /corpus_tmpfs 2>/dev/null
kubectl exec -n crs <pod> -- mount | grep corpus_tmpfs kubectl exec -n crs <pod> -- df -h /corpus_tmpfs 2>/dev/null

Check if CORPUS_TMPFS_PATH is set

Check if CORPUS_TMPFS_PATH is set

kubectl exec -n crs <pod> -- env | grep CORPUS
kubectl exec -n crs <pod> -- env | grep CORPUS

Full disk layout - what's on real disk vs tmpfs

Full disk layout - what's on real disk vs tmpfs

kubectl exec -n crs <pod> -- df -h

`CORPUS_TMPFS_PATH` is set when `global.volumes.corpusTmpfs.enabled: true`. This affects fuzzer-bot, coverage-bot, seed-gen, and merger-bot.
kubectl exec -n crs <pod> -- df -h

当`global.volumes.corpusTmpfs.enabled: true`时,会设置`CORPUS_TMPFS_PATH`。该配置会影响fuzzer-bot、coverage-bot、seed-gen和merger-bot。

Deployment Config Verification

部署配置验证

When behavior doesn't match expectations, verify Helm values actually took effect:
bash
undefined
当实际行为与预期不符时,请验证Helm值是否已生效:
bash
undefined

Check a pod's actual resource limits

Check a pod's actual resource limits

kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}'
kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}'

Check a pod's actual volume definitions

Check a pod's actual volume definitions

kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.volumes}'

Helm values template typos (e.g. wrong key names) silently fall back to chart defaults. If deployed resources don't match the values template, check for key name mismatches.
kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.volumes}'

Helm值模板中的拼写错误(例如键名错误)会导致配置静默回退到Chart默认值。如果部署的资源与值模板不符,请检查键名是否匹配。

Service-Specific Debugging

特定服务调试

For detailed per-service symptoms, root causes, and fixes, see references/failure-patterns.md.
Quick reference:
  • DinD:
    kubectl logs -n crs -l app=dind --tail=100
    -- look for docker daemon crashes, storage driver errors
  • Build-bot: check build queue depth, DinD connectivity, OOM during compilation
  • Fuzzer-bot: corpus disk usage, CPU throttling, crash queue backlog
  • Patcher: LiteLLM connectivity, LLM timeout, patch queue depth
  • Scheduler: the central brain --
    kubectl logs -n crs -l app=scheduler --tail=-1 --prefix | grep "WAIT_PATCH_PASS\|ERROR\|SUBMIT"
有关各服务的详细症状、根本原因和修复方法,请参阅references/failure-patterns.md
快速参考:
  • DinD
    kubectl logs -n crs -l app=dind --tail=100
    -- 查找Docker守护进程崩溃、存储驱动错误
  • Build-bot:检查构建队列深度、DinD连通性、编译过程中的OOM问题
  • Fuzzer-bot:语料库磁盘使用情况、CPU限流、崩溃队列积压
  • Patcher:LiteLLM连通性、LLM超时、补丁队列深度
  • Scheduler:核心调度组件 --
    kubectl logs -n crs -l app=scheduler --tail=-1 --prefix | grep "WAIT_PATCH_PASS\|ERROR\|SUBMIT"

Diagnostic Script

诊断脚本

Run the automated triage snapshot:
bash
bash {baseDir}/scripts/diagnose.sh
Pass
--full
to also dump recent logs from all pods:
bash
bash {baseDir}/scripts/diagnose.sh --full
This collects pod status, events, resource usage, Redis health, and queue depths in one pass.
运行自动化分类排查快照:
bash
bash {baseDir}/scripts/diagnose.sh
添加
--full
参数可同时导出所有Pod的近期日志:
bash
bash {baseDir}/scripts/diagnose.sh --full
该脚本会一次性收集Pod状态、事件、资源使用情况、Redis健康状态和队列深度信息。