Debug Buttercup

调试Buttercup

When to Use

适用场景

Pods in the
```
crs
```
namespace are in CrashLoopBackOff, OOMKilled, or restarting
Multiple services restart simultaneously (cascade failure)
Redis is unresponsive or showing AOF warnings
Queues are growing but tasks are not progressing
Nodes show DiskPressure, MemoryPressure, or PID pressure
Build-bot cannot reach the Docker daemon (DinD failures)
Scheduler is stuck and not advancing task state
Health check probes are failing unexpectedly
Deployed Helm values don't match actual pod configuration

```
crs
```
命名空间中的Pod处于CrashLoopBackOff、OOMKilled状态或频繁重启
多个服务同时重启（级联故障）
Redis无响应或显示AOF警告
队列持续增长但任务无进展
节点出现DiskPressure、MemoryPressure或PID压力
Build-bot无法连接Docker守护进程（DinD故障）
Scheduler停滞且未推进任务状态
健康检查探针意外失败
部署的Helm值与实际Pod配置不匹配

When NOT to Use

不适用场景

Deploying or upgrading Buttercup (use Helm and deployment guides)
Debugging issues outside the
```
crs
```
Kubernetes namespace
Performance tuning that doesn't involve a failure symptom

部署或升级Buttercup（请使用Helm和部署指南）
调试
```
crs
```
Kubernetes命名空间之外的问题
无故障症状的性能调优

Namespace and Services

命名空间与服务

All pods run in namespace

crs

. Key services:

Layer	Services
Infra	redis, dind, litellm, registry-cache
Orchestration	scheduler, task-server, task-downloader, scratch-cleaner
Fuzzing	build-bot, fuzzer-bot, coverage-bot, tracer-bot, merger-bot
Analysis	patcher, seed-gen, program-model, pov-reproducer
Interface	competition-api, ui

所有Pod都运行在

crs

命名空间中。核心服务如下：

层级	服务
基础设施	redis, dind, litellm, registry-cache
编排层	scheduler, task-server, task-downloader, scratch-cleaner
模糊测试层	build-bot, fuzzer-bot, coverage-bot, tracer-bot, merger-bot
分析层	patcher, seed-gen, program-model, pov-reproducer
接口层	competition-api, ui

Triage Workflow

分类排查流程

Always start with triage. Run these three commands first:

bash

undefined

始终从分类排查开始。首先运行以下三条命令：

bash

undefined

1. Pod status - look for restarts, CrashLoopBackOff, OOMKilled

kubectl get pods -n crs -o wide

2. Events - the timeline of what went wrong

kubectl get events -n crs --sort-by='.lastTimestamp'

3. Warnings only - filter the noise

kubectl get events -n crs --field-selector type=Warning --sort-by='.lastTimestamp'


Then narrow down:

```bash

kubectl get events -n crs --field-selector type=Warning --sort-by='.lastTimestamp'


然后缩小排查范围：

```bash

Why did a specific pod restart? Check Last State Reason (OOMKilled, Error, Completed)

kubectl describe pod -n crs <pod-name> | grep -A8 'Last State:'

Check actual resource limits vs intended

kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}'

Crashed container's logs (--previous = the container that died)

kubectl logs -n crs <pod-name> --previous --tail=200

Current logs

kubectl logs -n crs <pod-name> --tail=200

undefined

kubectl logs -n crs <pod-name> --tail=200

undefined

Historical vs Ongoing Issues

历史问题与当前问题

High restart counts don't necessarily mean an issue is ongoing -- restarts accumulate over a pod's lifetime. Always distinguish:

```
--tail
```
shows the end of the log buffer, which may contain old messages. Use
```
--since=300s
```
to confirm issues are actively happening now.
```
--timestamps
```
on log output helps correlate events across services.
Check
```
Last State
```
timestamps in
```
describe pod
```
to see when the most recent crash actually occurred.

高重启次数并不一定意味着问题正在持续——重启次数是Pod生命周期内的累计值。请务必区分：

```
--tail
```
参数会显示日志缓冲区的末尾内容，其中可能包含旧消息。使用
```
--since=300s
```
参数可确认问题是否当前正在发生。
日志输出中的
```
--timestamps
```
参数有助于跨服务关联事件。
查看
```
describe pod
```
中的
```
Last State
```
时间戳，了解最近一次崩溃实际发生的时间。

Cascade Detection

级联故障检测

When many pods restart around the same time, check for a shared-dependency failure before investigating individual pods. The most common cascade: Redis goes down -> every service gets

ConnectionError

/

ConnectionRefusedError

-> mass restarts. Look for the same error across multiple

--previous

logs -- if they all say

redis.exceptions.ConnectionError

, debug Redis, not the individual services.

当多个Pod在同一时间段内重启时，应先检查共享依赖项的故障，再逐一排查单个Pod。最常见的级联故障场景：Redis宕机 -> 所有服务出现

ConnectionError

/

ConnectionRefusedError

-> 大规模重启。如果多个Pod的

--previous

日志中均出现

redis.exceptions.ConnectionError

，请先调试Redis，而非单个服务。

Log Analysis

日志分析

bash

undefined

bash

undefined

All replicas of a service at once

kubectl logs -n crs -l app=fuzzer-bot --tail=100 --prefix

Stream live

kubectl logs -n crs -l app.kubernetes.io/name=redis -f

Collect all logs to disk (existing script)

bash deployment/collect-logs.sh

undefined

bash deployment/collect-logs.sh

undefined

Resource Pressure

资源压力排查

bash

undefined

bash

undefined

Per-pod CPU/memory

kubectl top pods -n crs

Node-level

kubectl top nodes

Node conditions (disk pressure, memory pressure, PID pressure)

kubectl describe node <node> | grep -A5 Conditions

Disk usage inside a pod

kubectl exec -n crs <pod> -- df -h

What's eating disk

kubectl exec -n crs <pod> -- sh -c 'du -sh /corpus/* 2>/dev/null' kubectl exec -n crs <pod> -- sh -c 'du -sh /scratch/* 2>/dev/null'

undefined

kubectl exec -n crs <pod> -- sh -c 'du -sh /corpus/* 2>/dev/null' kubectl exec -n crs <pod> -- sh -c 'du -sh /scratch/* 2>/dev/null'

undefined

Redis Debugging

Redis调试

Redis is the backbone. When it goes down, everything cascades.

bash

undefined

Redis是核心支撑组件。一旦Redis宕机，所有服务都会出现级联故障。

bash

undefined

Redis pod status

kubectl get pods -n crs -l app.kubernetes.io/name=redis

Redis logs (AOF warnings, OOM, connection issues)

kubectl logs -n crs -l app.kubernetes.io/name=redis --tail=200

Connect to Redis CLI

kubectl exec -n crs <redis-pod> -- redis-cli

Inside redis-cli: key diagnostics

INFO memory # used_memory_human, maxmemory INFO persistence # aof_enabled, aof_last_bgrewrite_status, aof_delayed_fsync INFO clients # connected_clients, blocked_clients INFO stats # total_connections_received, rejected_connections CLIENT LIST # see who's connected DBSIZE # total keys

AOF configuration

CONFIG GET appendonly # is AOF enabled? CONFIG GET appendfsync # fsync policy: everysec, always, or no

What is /data mounted on? (disk vs tmpfs matters for AOF performance)


```bash
kubectl exec -n crs <redis-pod> -- mount | grep /data
kubectl exec -n crs <redis-pod> -- du -sh /data/


```bash
kubectl exec -n crs <redis-pod> -- mount | grep /data
kubectl exec -n crs <redis-pod> -- du -sh /data/

Queue Inspection

队列检查

Buttercup uses Redis streams with consumer groups. Queue names:

Queue	Stream Key
Build	fuzzer_build_queue
Build Output	fuzzer_build_output_queue
Crash	fuzzer_crash_queue
Confirmed Vulns	confirmed_vulnerabilities_queue
Download Tasks	orchestrator_download_tasks_queue
Ready Tasks	tasks_ready_queue
Patches	patches_queue
Index	index_queue
Index Output	index_output_queue
Traced Vulns	traced_vulnerabilities_queue
POV Requests	pov_reproducer_requests_queue
POV Responses	pov_reproducer_responses_queue
Delete Task	orchestrator_delete_task_queue

bash

undefined

Buttercup使用带消费组的Redis流。队列名称如下：

队列	流键
Build	fuzzer_build_queue
Build Output	fuzzer_build_output_queue
Crash	fuzzer_crash_queue
Confirmed Vulns	confirmed_vulnerabilities_queue
Download Tasks	orchestrator_download_tasks_queue
Ready Tasks	tasks_ready_queue
Patches	patches_queue
Index	index_queue
Index Output	index_output_queue
Traced Vulns	traced_vulnerabilities_queue
POV Requests	pov_reproducer_requests_queue
POV Responses	pov_reproducer_responses_queue
Delete Task	orchestrator_delete_task_queue

bash

undefined

Check stream length (pending messages)

kubectl exec -n crs <redis-pod> -- redis-cli XLEN fuzzer_build_queue

Check consumer group lag

kubectl exec -n crs <redis-pod> -- redis-cli XINFO GROUPS fuzzer_build_queue

Check pending messages per consumer

kubectl exec -n crs <redis-pod> -- redis-cli XPENDING fuzzer_build_queue build_bot_consumers - + 10

Task registry size

kubectl exec -n crs <redis-pod> -- redis-cli HLEN tasks_registry

Task state counts

kubectl exec -n crs <redis-pod> -- redis-cli SCARD cancelled_tasks kubectl exec -n crs <redis-pod> -- redis-cli SCARD succeeded_tasks kubectl exec -n crs <redis-pod> -- redis-cli SCARD errored_tasks


Consumer groups: `build_bot_consumers`, `orchestrator_group`, `patcher_group`, `index_group`, `tracer_bot_group`.

kubectl exec -n crs <redis-pod> -- redis-cli SCARD cancelled_tasks kubectl exec -n crs <redis-pod> -- redis-cli SCARD succeeded_tasks kubectl exec -n crs <redis-pod> -- redis-cli SCARD errored_tasks


消费组包括：`build_bot_consumers`、`orchestrator_group`、`patcher_group`、`index_group`、`tracer_bot_group`。

Health Checks

健康检查

Pods write timestamps to

/tmp/health_check_alive

. The liveness probe checks file freshness.

bash

undefined

Pods会将时间戳写入

/tmp/health_check_alive

文件。存活探针会检查该文件的新鲜度。

bash

undefined

Check health file freshness

kubectl exec -n crs <pod> -- stat /tmp/health_check_alive kubectl exec -n crs <pod> -- cat /tmp/health_check_alive


If a pod is restart-looping, the health check file is likely going stale because the main process is blocked (e.g. waiting on Redis, stuck on I/O).

kubectl exec -n crs <pod> -- stat /tmp/health_check_alive kubectl exec -n crs <pod> -- cat /tmp/health_check_alive


如果Pod处于重启循环状态，健康检查文件很可能已过期，这是因为主进程被阻塞（例如等待Redis响应、卡在I/O操作上）。

Telemetry (OpenTelemetry / Signoz)

遥测（OpenTelemetry / Signoz）

All services export traces and metrics via OpenTelemetry. If Signoz is deployed (

global.signoz.deployed: true

), use its UI for distributed tracing across services.

bash

undefined

所有服务均通过OpenTelemetry导出追踪数据和指标。如果已部署Signoz（

global.signoz.deployed: true

），可使用其UI进行跨服务的分布式追踪。

bash

undefined

Check if OTEL is configured

kubectl exec -n crs <pod> -- env | grep OTEL

Verify Signoz pods are running (if deployed)

kubectl get pods -n platform -l app.kubernetes.io/name=signoz


Traces are especially useful for diagnosing slow task processing, identifying which service in a pipeline is the bottleneck, and correlating events across the scheduler -> build-bot -> fuzzer-bot chain.

kubectl get pods -n platform -l app.kubernetes.io/name=signoz


追踪数据对于诊断任务处理缓慢、识别流水线中的瓶颈服务以及关联scheduler -> build-bot -> fuzzer-bot链路中的事件尤为有用。

Volume and Storage

卷与存储

bash

undefined

bash

undefined

PVC status

kubectl get pvc -n crs

Check if corpus tmpfs is mounted, its size, and backing type

kubectl exec -n crs <pod> -- mount | grep corpus_tmpfs kubectl exec -n crs <pod> -- df -h /corpus_tmpfs 2>/dev/null

Check if CORPUS_TMPFS_PATH is set

kubectl exec -n crs <pod> -- env | grep CORPUS

Full disk layout - what's on real disk vs tmpfs

kubectl exec -n crs <pod> -- df -h


`CORPUS_TMPFS_PATH` is set when `global.volumes.corpusTmpfs.enabled: true`. This affects fuzzer-bot, coverage-bot, seed-gen, and merger-bot.

kubectl exec -n crs <pod> -- df -h


当`global.volumes.corpusTmpfs.enabled: true`时，会设置`CORPUS_TMPFS_PATH`。该配置会影响fuzzer-bot、coverage-bot、seed-gen和merger-bot。

Deployment Config Verification

部署配置验证

When behavior doesn't match expectations, verify Helm values actually took effect:

bash

undefined

当实际行为与预期不符时，请验证Helm值是否已生效：

bash

undefined

Check a pod's actual resource limits

kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}'

Check a pod's actual volume definitions

kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.volumes}'


Helm values template typos (e.g. wrong key names) silently fall back to chart defaults. If deployed resources don't match the values template, check for key name mismatches.

kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.volumes}'


Helm值模板中的拼写错误（例如键名错误）会导致配置静默回退到Chart默认值。如果部署的资源与值模板不符，请检查键名是否匹配。

Service-Specific Debugging

特定服务调试

For detailed per-service symptoms, root causes, and fixes, see references/failure-patterns.md.

Quick reference:

DinD:
```
kubectl logs -n crs -l app=dind --tail=100
```
-- look for docker daemon crashes, storage driver errors
Build-bot: check build queue depth, DinD connectivity, OOM during compilation
Fuzzer-bot: corpus disk usage, CPU throttling, crash queue backlog
Patcher: LiteLLM connectivity, LLM timeout, patch queue depth

Scheduler: the central brain --

kubectl logs -n crs -l app=scheduler --tail=-1 --prefix | grep "WAIT_PATCH_PASS\|ERROR\|SUBMIT"

有关各服务的详细症状、根本原因和修复方法，请参阅references/failure-patterns.md。

快速参考：

DinD：
```
kubectl logs -n crs -l app=dind --tail=100
```
-- 查找Docker守护进程崩溃、存储驱动错误
Build-bot：检查构建队列深度、DinD连通性、编译过程中的OOM问题
Fuzzer-bot：语料库磁盘使用情况、CPU限流、崩溃队列积压
Patcher：LiteLLM连通性、LLM超时、补丁队列深度

Scheduler：核心调度组件 --

kubectl logs -n crs -l app=scheduler --tail=-1 --prefix | grep "WAIT_PATCH_PASS\|ERROR\|SUBMIT"

Diagnostic Script

诊断脚本

Run the automated triage snapshot:

bash

bash {baseDir}/scripts/diagnose.sh

Pass

--full

to also dump recent logs from all pods:

bash

bash {baseDir}/scripts/diagnose.sh --full

This collects pod status, events, resource usage, Redis health, and queue depths in one pass.

运行自动化分类排查快照：

bash

bash {baseDir}/scripts/diagnose.sh

添加

--full

参数可同时导出所有Pod的近期日志：

bash

bash {baseDir}/scripts/diagnose.sh --full

该脚本会一次性收集Pod状态、事件、资源使用情况、Redis健康状态和队列深度信息。

debug-buttercup

Original

Translation

Debug Buttercup

调试Buttercup

When to Use

适用场景

When NOT to Use

不适用场景

Namespace and Services

命名空间与服务

Triage Workflow

分类排查流程

1. Pod status - look for restarts, CrashLoopBackOff, OOMKilled

1. Pod status - look for restarts, CrashLoopBackOff, OOMKilled

2. Events - the timeline of what went wrong

2. Events - the timeline of what went wrong

3. Warnings only - filter the noise

3. Warnings only - filter the noise

Why did a specific pod restart? Check Last State Reason (OOMKilled, Error, Completed)

Why did a specific pod restart? Check Last State Reason (OOMKilled, Error, Completed)

Check actual resource limits vs intended

Check actual resource limits vs intended

Crashed container's logs (--previous = the container that died)

Crashed container's logs (--previous = the container that died)

Current logs

Current logs

Historical vs Ongoing Issues

历史问题与当前问题

Cascade Detection

级联故障检测

Log Analysis

日志分析

All replicas of a service at once

All replicas of a service at once

Stream live

Stream live

Collect all logs to disk (existing script)

Collect all logs to disk (existing script)

Resource Pressure

资源压力排查

Per-pod CPU/memory

Per-pod CPU/memory

Node-level

Node-level

Node conditions (disk pressure, memory pressure, PID pressure)

Node conditions (disk pressure, memory pressure, PID pressure)

Disk usage inside a pod

Disk usage inside a pod

What's eating disk

What's eating disk

Redis Debugging

Redis调试

Redis pod status

Redis pod status

Redis logs (AOF warnings, OOM, connection issues)

Redis logs (AOF warnings, OOM, connection issues)

Connect to Redis CLI

Connect to Redis CLI

Inside redis-cli: key diagnostics

Inside redis-cli: key diagnostics

AOF configuration

AOF configuration

What is /data mounted on? (disk vs tmpfs matters for AOF performance)

What is /data mounted on? (disk vs tmpfs matters for AOF performance)

Queue Inspection

队列检查

Check stream length (pending messages)

Check stream length (pending messages)

Check consumer group lag

Check consumer group lag

Check pending messages per consumer

Check pending messages per consumer

Task registry size

Task registry size

Task state counts

Task state counts

Health Checks

健康检查

Check health file freshness