debug-buttercup
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDebug Buttercup
调试Buttercup
When to Use
适用场景
- Pods in the namespace are in CrashLoopBackOff, OOMKilled, or restarting
crs - Multiple services restart simultaneously (cascade failure)
- Redis is unresponsive or showing AOF warnings
- Queues are growing but tasks are not progressing
- Nodes show DiskPressure, MemoryPressure, or PID pressure
- Build-bot cannot reach the Docker daemon (DinD failures)
- Scheduler is stuck and not advancing task state
- Health check probes are failing unexpectedly
- Deployed Helm values don't match actual pod configuration
- 命名空间中的Pod处于CrashLoopBackOff、OOMKilled状态或频繁重启
crs - 多个服务同时重启(级联故障)
- Redis无响应或显示AOF警告
- 队列持续增长但任务无进展
- 节点出现DiskPressure、MemoryPressure或PID压力
- Build-bot无法连接Docker守护进程(DinD故障)
- Scheduler停滞且未推进任务状态
- 健康检查探针意外失败
- 部署的Helm值与实际Pod配置不匹配
When NOT to Use
不适用场景
- Deploying or upgrading Buttercup (use Helm and deployment guides)
- Debugging issues outside the Kubernetes namespace
crs - Performance tuning that doesn't involve a failure symptom
- 部署或升级Buttercup(请使用Helm和部署指南)
- 调试Kubernetes命名空间之外的问题
crs - 无故障症状的性能调优
Namespace and Services
命名空间与服务
All pods run in namespace . Key services:
crs| Layer | Services |
|---|---|
| Infra | redis, dind, litellm, registry-cache |
| Orchestration | scheduler, task-server, task-downloader, scratch-cleaner |
| Fuzzing | build-bot, fuzzer-bot, coverage-bot, tracer-bot, merger-bot |
| Analysis | patcher, seed-gen, program-model, pov-reproducer |
| Interface | competition-api, ui |
所有Pod都运行在命名空间中。核心服务如下:
crs| 层级 | 服务 |
|---|---|
| 基础设施 | redis, dind, litellm, registry-cache |
| 编排层 | scheduler, task-server, task-downloader, scratch-cleaner |
| 模糊测试层 | build-bot, fuzzer-bot, coverage-bot, tracer-bot, merger-bot |
| 分析层 | patcher, seed-gen, program-model, pov-reproducer |
| 接口层 | competition-api, ui |
Triage Workflow
分类排查流程
Always start with triage. Run these three commands first:
bash
undefined始终从分类排查开始。首先运行以下三条命令:
bash
undefined1. Pod status - look for restarts, CrashLoopBackOff, OOMKilled
1. Pod status - look for restarts, CrashLoopBackOff, OOMKilled
kubectl get pods -n crs -o wide
kubectl get pods -n crs -o wide
2. Events - the timeline of what went wrong
2. Events - the timeline of what went wrong
kubectl get events -n crs --sort-by='.lastTimestamp'
kubectl get events -n crs --sort-by='.lastTimestamp'
3. Warnings only - filter the noise
3. Warnings only - filter the noise
kubectl get events -n crs --field-selector type=Warning --sort-by='.lastTimestamp'
Then narrow down:
```bashkubectl get events -n crs --field-selector type=Warning --sort-by='.lastTimestamp'
然后缩小排查范围:
```bashWhy did a specific pod restart? Check Last State Reason (OOMKilled, Error, Completed)
Why did a specific pod restart? Check Last State Reason (OOMKilled, Error, Completed)
kubectl describe pod -n crs <pod-name> | grep -A8 'Last State:'
kubectl describe pod -n crs <pod-name> | grep -A8 'Last State:'
Check actual resource limits vs intended
Check actual resource limits vs intended
kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}'
kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}'
Crashed container's logs (--previous = the container that died)
Crashed container's logs (--previous = the container that died)
kubectl logs -n crs <pod-name> --previous --tail=200
kubectl logs -n crs <pod-name> --previous --tail=200
Current logs
Current logs
kubectl logs -n crs <pod-name> --tail=200
undefinedkubectl logs -n crs <pod-name> --tail=200
undefinedHistorical vs Ongoing Issues
历史问题与当前问题
High restart counts don't necessarily mean an issue is ongoing -- restarts accumulate over a pod's lifetime. Always distinguish:
- shows the end of the log buffer, which may contain old messages. Use
--tailto confirm issues are actively happening now.--since=300s - on log output helps correlate events across services.
--timestamps - Check timestamps in
Last Stateto see when the most recent crash actually occurred.describe pod
高重启次数并不一定意味着问题正在持续——重启次数是Pod生命周期内的累计值。请务必区分:
- 参数会显示日志缓冲区的末尾内容,其中可能包含旧消息。使用
--tail参数可确认问题是否当前正在发生。--since=300s - 日志输出中的参数有助于跨服务关联事件。
--timestamps - 查看中的
describe pod时间戳,了解最近一次崩溃实际发生的时间。Last State
Cascade Detection
级联故障检测
When many pods restart around the same time, check for a shared-dependency failure before investigating individual pods. The most common cascade: Redis goes down -> every service gets / -> mass restarts. Look for the same error across multiple logs -- if they all say , debug Redis, not the individual services.
ConnectionErrorConnectionRefusedError--previousredis.exceptions.ConnectionError当多个Pod在同一时间段内重启时,应先检查共享依赖项的故障,再逐一排查单个Pod。最常见的级联故障场景:Redis宕机 -> 所有服务出现/ -> 大规模重启。如果多个Pod的日志中均出现,请先调试Redis,而非单个服务。
ConnectionErrorConnectionRefusedError--previousredis.exceptions.ConnectionErrorLog Analysis
日志分析
bash
undefinedbash
undefinedAll replicas of a service at once
All replicas of a service at once
kubectl logs -n crs -l app=fuzzer-bot --tail=100 --prefix
kubectl logs -n crs -l app=fuzzer-bot --tail=100 --prefix
Stream live
Stream live
kubectl logs -n crs -l app.kubernetes.io/name=redis -f
kubectl logs -n crs -l app.kubernetes.io/name=redis -f
Collect all logs to disk (existing script)
Collect all logs to disk (existing script)
bash deployment/collect-logs.sh
undefinedbash deployment/collect-logs.sh
undefinedResource Pressure
资源压力排查
bash
undefinedbash
undefinedPer-pod CPU/memory
Per-pod CPU/memory
kubectl top pods -n crs
kubectl top pods -n crs
Node-level
Node-level
kubectl top nodes
kubectl top nodes
Node conditions (disk pressure, memory pressure, PID pressure)
Node conditions (disk pressure, memory pressure, PID pressure)
kubectl describe node <node> | grep -A5 Conditions
kubectl describe node <node> | grep -A5 Conditions
Disk usage inside a pod
Disk usage inside a pod
kubectl exec -n crs <pod> -- df -h
kubectl exec -n crs <pod> -- df -h
What's eating disk
What's eating disk
kubectl exec -n crs <pod> -- sh -c 'du -sh /corpus/* 2>/dev/null'
kubectl exec -n crs <pod> -- sh -c 'du -sh /scratch/* 2>/dev/null'
undefinedkubectl exec -n crs <pod> -- sh -c 'du -sh /corpus/* 2>/dev/null'
kubectl exec -n crs <pod> -- sh -c 'du -sh /scratch/* 2>/dev/null'
undefinedRedis Debugging
Redis调试
Redis is the backbone. When it goes down, everything cascades.
bash
undefinedRedis是核心支撑组件。一旦Redis宕机,所有服务都会出现级联故障。
bash
undefinedRedis pod status
Redis pod status
kubectl get pods -n crs -l app.kubernetes.io/name=redis
kubectl get pods -n crs -l app.kubernetes.io/name=redis
Redis logs (AOF warnings, OOM, connection issues)
Redis logs (AOF warnings, OOM, connection issues)
kubectl logs -n crs -l app.kubernetes.io/name=redis --tail=200
kubectl logs -n crs -l app.kubernetes.io/name=redis --tail=200
Connect to Redis CLI
Connect to Redis CLI
kubectl exec -n crs <redis-pod> -- redis-cli
kubectl exec -n crs <redis-pod> -- redis-cli
Inside redis-cli: key diagnostics
Inside redis-cli: key diagnostics
INFO memory # used_memory_human, maxmemory
INFO persistence # aof_enabled, aof_last_bgrewrite_status, aof_delayed_fsync
INFO clients # connected_clients, blocked_clients
INFO stats # total_connections_received, rejected_connections
CLIENT LIST # see who's connected
DBSIZE # total keys
INFO memory # used_memory_human, maxmemory
INFO persistence # aof_enabled, aof_last_bgrewrite_status, aof_delayed_fsync
INFO clients # connected_clients, blocked_clients
INFO stats # total_connections_received, rejected_connections
CLIENT LIST # see who's connected
DBSIZE # total keys
AOF configuration
AOF configuration
CONFIG GET appendonly # is AOF enabled?
CONFIG GET appendfsync # fsync policy: everysec, always, or no
CONFIG GET appendonly # is AOF enabled?
CONFIG GET appendfsync # fsync policy: everysec, always, or no
What is /data mounted on? (disk vs tmpfs matters for AOF performance)
What is /data mounted on? (disk vs tmpfs matters for AOF performance)
```bash
kubectl exec -n crs <redis-pod> -- mount | grep /data
kubectl exec -n crs <redis-pod> -- du -sh /data/
```bash
kubectl exec -n crs <redis-pod> -- mount | grep /data
kubectl exec -n crs <redis-pod> -- du -sh /data/Queue Inspection
队列检查
Buttercup uses Redis streams with consumer groups. Queue names:
| Queue | Stream Key |
|---|---|
| Build | fuzzer_build_queue |
| Build Output | fuzzer_build_output_queue |
| Crash | fuzzer_crash_queue |
| Confirmed Vulns | confirmed_vulnerabilities_queue |
| Download Tasks | orchestrator_download_tasks_queue |
| Ready Tasks | tasks_ready_queue |
| Patches | patches_queue |
| Index | index_queue |
| Index Output | index_output_queue |
| Traced Vulns | traced_vulnerabilities_queue |
| POV Requests | pov_reproducer_requests_queue |
| POV Responses | pov_reproducer_responses_queue |
| Delete Task | orchestrator_delete_task_queue |
bash
undefinedButtercup使用带消费组的Redis流。队列名称如下:
| 队列 | 流键 |
|---|---|
| Build | fuzzer_build_queue |
| Build Output | fuzzer_build_output_queue |
| Crash | fuzzer_crash_queue |
| Confirmed Vulns | confirmed_vulnerabilities_queue |
| Download Tasks | orchestrator_download_tasks_queue |
| Ready Tasks | tasks_ready_queue |
| Patches | patches_queue |
| Index | index_queue |
| Index Output | index_output_queue |
| Traced Vulns | traced_vulnerabilities_queue |
| POV Requests | pov_reproducer_requests_queue |
| POV Responses | pov_reproducer_responses_queue |
| Delete Task | orchestrator_delete_task_queue |
bash
undefinedCheck stream length (pending messages)
Check stream length (pending messages)
kubectl exec -n crs <redis-pod> -- redis-cli XLEN fuzzer_build_queue
kubectl exec -n crs <redis-pod> -- redis-cli XLEN fuzzer_build_queue
Check consumer group lag
Check consumer group lag
kubectl exec -n crs <redis-pod> -- redis-cli XINFO GROUPS fuzzer_build_queue
kubectl exec -n crs <redis-pod> -- redis-cli XINFO GROUPS fuzzer_build_queue
Check pending messages per consumer
Check pending messages per consumer
kubectl exec -n crs <redis-pod> -- redis-cli XPENDING fuzzer_build_queue build_bot_consumers - + 10
kubectl exec -n crs <redis-pod> -- redis-cli XPENDING fuzzer_build_queue build_bot_consumers - + 10
Task registry size
Task registry size
kubectl exec -n crs <redis-pod> -- redis-cli HLEN tasks_registry
kubectl exec -n crs <redis-pod> -- redis-cli HLEN tasks_registry
Task state counts
Task state counts
kubectl exec -n crs <redis-pod> -- redis-cli SCARD cancelled_tasks
kubectl exec -n crs <redis-pod> -- redis-cli SCARD succeeded_tasks
kubectl exec -n crs <redis-pod> -- redis-cli SCARD errored_tasks
Consumer groups: `build_bot_consumers`, `orchestrator_group`, `patcher_group`, `index_group`, `tracer_bot_group`.kubectl exec -n crs <redis-pod> -- redis-cli SCARD cancelled_tasks
kubectl exec -n crs <redis-pod> -- redis-cli SCARD succeeded_tasks
kubectl exec -n crs <redis-pod> -- redis-cli SCARD errored_tasks
消费组包括:`build_bot_consumers`、`orchestrator_group`、`patcher_group`、`index_group`、`tracer_bot_group`。Health Checks
健康检查
Pods write timestamps to . The liveness probe checks file freshness.
/tmp/health_check_alivebash
undefinedPods会将时间戳写入文件。存活探针会检查该文件的新鲜度。
/tmp/health_check_alivebash
undefinedCheck health file freshness
Check health file freshness
kubectl exec -n crs <pod> -- stat /tmp/health_check_alive
kubectl exec -n crs <pod> -- cat /tmp/health_check_alive
If a pod is restart-looping, the health check file is likely going stale because the main process is blocked (e.g. waiting on Redis, stuck on I/O).kubectl exec -n crs <pod> -- stat /tmp/health_check_alive
kubectl exec -n crs <pod> -- cat /tmp/health_check_alive
如果Pod处于重启循环状态,健康检查文件很可能已过期,这是因为主进程被阻塞(例如等待Redis响应、卡在I/O操作上)。Telemetry (OpenTelemetry / Signoz)
遥测(OpenTelemetry / Signoz)
All services export traces and metrics via OpenTelemetry. If Signoz is deployed (), use its UI for distributed tracing across services.
global.signoz.deployed: truebash
undefined所有服务均通过OpenTelemetry导出追踪数据和指标。如果已部署Signoz(),可使用其UI进行跨服务的分布式追踪。
global.signoz.deployed: truebash
undefinedCheck if OTEL is configured
Check if OTEL is configured
kubectl exec -n crs <pod> -- env | grep OTEL
kubectl exec -n crs <pod> -- env | grep OTEL
Verify Signoz pods are running (if deployed)
Verify Signoz pods are running (if deployed)
kubectl get pods -n platform -l app.kubernetes.io/name=signoz
Traces are especially useful for diagnosing slow task processing, identifying which service in a pipeline is the bottleneck, and correlating events across the scheduler -> build-bot -> fuzzer-bot chain.kubectl get pods -n platform -l app.kubernetes.io/name=signoz
追踪数据对于诊断任务处理缓慢、识别流水线中的瓶颈服务以及关联scheduler -> build-bot -> fuzzer-bot链路中的事件尤为有用。Volume and Storage
卷与存储
bash
undefinedbash
undefinedPVC status
PVC status
kubectl get pvc -n crs
kubectl get pvc -n crs
Check if corpus tmpfs is mounted, its size, and backing type
Check if corpus tmpfs is mounted, its size, and backing type
kubectl exec -n crs <pod> -- mount | grep corpus_tmpfs
kubectl exec -n crs <pod> -- df -h /corpus_tmpfs 2>/dev/null
kubectl exec -n crs <pod> -- mount | grep corpus_tmpfs
kubectl exec -n crs <pod> -- df -h /corpus_tmpfs 2>/dev/null
Check if CORPUS_TMPFS_PATH is set
Check if CORPUS_TMPFS_PATH is set
kubectl exec -n crs <pod> -- env | grep CORPUS
kubectl exec -n crs <pod> -- env | grep CORPUS
Full disk layout - what's on real disk vs tmpfs
Full disk layout - what's on real disk vs tmpfs
kubectl exec -n crs <pod> -- df -h
`CORPUS_TMPFS_PATH` is set when `global.volumes.corpusTmpfs.enabled: true`. This affects fuzzer-bot, coverage-bot, seed-gen, and merger-bot.kubectl exec -n crs <pod> -- df -h
当`global.volumes.corpusTmpfs.enabled: true`时,会设置`CORPUS_TMPFS_PATH`。该配置会影响fuzzer-bot、coverage-bot、seed-gen和merger-bot。Deployment Config Verification
部署配置验证
When behavior doesn't match expectations, verify Helm values actually took effect:
bash
undefined当实际行为与预期不符时,请验证Helm值是否已生效:
bash
undefinedCheck a pod's actual resource limits
Check a pod's actual resource limits
kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}'
kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}'
Check a pod's actual volume definitions
Check a pod's actual volume definitions
kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.volumes}'
Helm values template typos (e.g. wrong key names) silently fall back to chart defaults. If deployed resources don't match the values template, check for key name mismatches.kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.volumes}'
Helm值模板中的拼写错误(例如键名错误)会导致配置静默回退到Chart默认值。如果部署的资源与值模板不符,请检查键名是否匹配。Service-Specific Debugging
特定服务调试
For detailed per-service symptoms, root causes, and fixes, see references/failure-patterns.md.
Quick reference:
- DinD: -- look for docker daemon crashes, storage driver errors
kubectl logs -n crs -l app=dind --tail=100 - Build-bot: check build queue depth, DinD connectivity, OOM during compilation
- Fuzzer-bot: corpus disk usage, CPU throttling, crash queue backlog
- Patcher: LiteLLM connectivity, LLM timeout, patch queue depth
- Scheduler: the central brain --
kubectl logs -n crs -l app=scheduler --tail=-1 --prefix | grep "WAIT_PATCH_PASS\|ERROR\|SUBMIT"
有关各服务的详细症状、根本原因和修复方法,请参阅references/failure-patterns.md。
快速参考:
- DinD:-- 查找Docker守护进程崩溃、存储驱动错误
kubectl logs -n crs -l app=dind --tail=100 - Build-bot:检查构建队列深度、DinD连通性、编译过程中的OOM问题
- Fuzzer-bot:语料库磁盘使用情况、CPU限流、崩溃队列积压
- Patcher:LiteLLM连通性、LLM超时、补丁队列深度
- Scheduler:核心调度组件 --
kubectl logs -n crs -l app=scheduler --tail=-1 --prefix | grep "WAIT_PATCH_PASS\|ERROR\|SUBMIT"
Diagnostic Script
诊断脚本
Run the automated triage snapshot:
bash
bash {baseDir}/scripts/diagnose.shPass to also dump recent logs from all pods:
--fullbash
bash {baseDir}/scripts/diagnose.sh --fullThis collects pod status, events, resource usage, Redis health, and queue depths in one pass.
运行自动化分类排查快照:
bash
bash {baseDir}/scripts/diagnose.sh添加参数可同时导出所有Pod的近期日志:
--fullbash
bash {baseDir}/scripts/diagnose.sh --full该脚本会一次性收集Pod状态、事件、资源使用情况、Redis健康状态和队列深度信息。