Loading...
Loading...
Debugs the Buttercup CRS (Cyber Reasoning System) running on Kubernetes. Use when diagnosing pod crashes, restart loops, Redis failures, resource pressure, disk saturation, DinD issues, or any service misbehavior in the crs namespace. Covers triage, log analysis, queue inspection, and common failure patterns for: redis, fuzzer-bot, coverage-bot, seed-gen, patcher, build-bot, scheduler, task-server, task-downloader, program-model, litellm, dind, tracer-bot, merger-bot, competition-api, pov-reproducer, scratch-cleaner, registry-cache, image-preloader, ui.
npx skill4agent add lv416e/dotfiles debug-buttercupcrscrscrs| Layer | Services |
|---|---|
| Infra | redis, dind, litellm, registry-cache |
| Orchestration | scheduler, task-server, task-downloader, scratch-cleaner |
| Fuzzing | build-bot, fuzzer-bot, coverage-bot, tracer-bot, merger-bot |
| Analysis | patcher, seed-gen, program-model, pov-reproducer |
| Interface | competition-api, ui |
# 1. Pod status - look for restarts, CrashLoopBackOff, OOMKilled
kubectl get pods -n crs -o wide
# 2. Events - the timeline of what went wrong
kubectl get events -n crs --sort-by='.lastTimestamp'
# 3. Warnings only - filter the noise
kubectl get events -n crs --field-selector type=Warning --sort-by='.lastTimestamp'# Why did a specific pod restart? Check Last State Reason (OOMKilled, Error, Completed)
kubectl describe pod -n crs <pod-name> | grep -A8 'Last State:'
# Check actual resource limits vs intended
kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}'
# Crashed container's logs (--previous = the container that died)
kubectl logs -n crs <pod-name> --previous --tail=200
# Current logs
kubectl logs -n crs <pod-name> --tail=200--tail--since=300s--timestampsLast Statedescribe podConnectionErrorConnectionRefusedError--previousredis.exceptions.ConnectionError# All replicas of a service at once
kubectl logs -n crs -l app=fuzzer-bot --tail=100 --prefix
# Stream live
kubectl logs -n crs -l app.kubernetes.io/name=redis -f
# Collect all logs to disk (existing script)
bash deployment/collect-logs.sh# Per-pod CPU/memory
kubectl top pods -n crs
# Node-level
kubectl top nodes
# Node conditions (disk pressure, memory pressure, PID pressure)
kubectl describe node <node> | grep -A5 Conditions
# Disk usage inside a pod
kubectl exec -n crs <pod> -- df -h
# What's eating disk
kubectl exec -n crs <pod> -- sh -c 'du -sh /corpus/* 2>/dev/null'
kubectl exec -n crs <pod> -- sh -c 'du -sh /scratch/* 2>/dev/null'# Redis pod status
kubectl get pods -n crs -l app.kubernetes.io/name=redis
# Redis logs (AOF warnings, OOM, connection issues)
kubectl logs -n crs -l app.kubernetes.io/name=redis --tail=200
# Connect to Redis CLI
kubectl exec -n crs <redis-pod> -- redis-cli
# Inside redis-cli: key diagnostics
INFO memory # used_memory_human, maxmemory
INFO persistence # aof_enabled, aof_last_bgrewrite_status, aof_delayed_fsync
INFO clients # connected_clients, blocked_clients
INFO stats # total_connections_received, rejected_connections
CLIENT LIST # see who's connected
DBSIZE # total keys
# AOF configuration
CONFIG GET appendonly # is AOF enabled?
CONFIG GET appendfsync # fsync policy: everysec, always, or no
# What is /data mounted on? (disk vs tmpfs matters for AOF performance)kubectl exec -n crs <redis-pod> -- mount | grep /data
kubectl exec -n crs <redis-pod> -- du -sh /data/| Queue | Stream Key |
|---|---|
| Build | fuzzer_build_queue |
| Build Output | fuzzer_build_output_queue |
| Crash | fuzzer_crash_queue |
| Confirmed Vulns | confirmed_vulnerabilities_queue |
| Download Tasks | orchestrator_download_tasks_queue |
| Ready Tasks | tasks_ready_queue |
| Patches | patches_queue |
| Index | index_queue |
| Index Output | index_output_queue |
| Traced Vulns | traced_vulnerabilities_queue |
| POV Requests | pov_reproducer_requests_queue |
| POV Responses | pov_reproducer_responses_queue |
| Delete Task | orchestrator_delete_task_queue |
# Check stream length (pending messages)
kubectl exec -n crs <redis-pod> -- redis-cli XLEN fuzzer_build_queue
# Check consumer group lag
kubectl exec -n crs <redis-pod> -- redis-cli XINFO GROUPS fuzzer_build_queue
# Check pending messages per consumer
kubectl exec -n crs <redis-pod> -- redis-cli XPENDING fuzzer_build_queue build_bot_consumers - + 10
# Task registry size
kubectl exec -n crs <redis-pod> -- redis-cli HLEN tasks_registry
# Task state counts
kubectl exec -n crs <redis-pod> -- redis-cli SCARD cancelled_tasks
kubectl exec -n crs <redis-pod> -- redis-cli SCARD succeeded_tasks
kubectl exec -n crs <redis-pod> -- redis-cli SCARD errored_tasksbuild_bot_consumersorchestrator_grouppatcher_groupindex_grouptracer_bot_group/tmp/health_check_alive# Check health file freshness
kubectl exec -n crs <pod> -- stat /tmp/health_check_alive
kubectl exec -n crs <pod> -- cat /tmp/health_check_aliveglobal.signoz.deployed: true# Check if OTEL is configured
kubectl exec -n crs <pod> -- env | grep OTEL
# Verify Signoz pods are running (if deployed)
kubectl get pods -n platform -l app.kubernetes.io/name=signoz# PVC status
kubectl get pvc -n crs
# Check if corpus tmpfs is mounted, its size, and backing type
kubectl exec -n crs <pod> -- mount | grep corpus_tmpfs
kubectl exec -n crs <pod> -- df -h /corpus_tmpfs 2>/dev/null
# Check if CORPUS_TMPFS_PATH is set
kubectl exec -n crs <pod> -- env | grep CORPUS
# Full disk layout - what's on real disk vs tmpfs
kubectl exec -n crs <pod> -- df -hCORPUS_TMPFS_PATHglobal.volumes.corpusTmpfs.enabled: true# Check a pod's actual resource limits
kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}'
# Check a pod's actual volume definitions
kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.volumes}'kubectl logs -n crs -l app=dind --tail=100kubectl logs -n crs -l app=scheduler --tail=-1 --prefix | grep "WAIT_PATCH_PASS\|ERROR\|SUBMIT"bash {baseDir}/scripts/diagnose.sh--fullbash {baseDir}/scripts/diagnose.sh --full