Loading...
Loading...
Periodically check WandB metrics during training to catch problems early (NaN, loss divergence, idle GPUs). Avoids wasting GPU hours on broken runs. Use when training is running and you want automated health checks.
npx skill4agent add wanshuiyin/auto-claude-code-research-in-sleep training-checkentity/project/run_idgpt-5.4import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
history = run.history()ssh server "tail -100 /path/to/training.log"| Signal | Judgment | Action |
|---|---|---|
| NaN/Inf in loss | Clearly bad | Stop training, investigate |
| Loss diverging (increasing for >N steps) | Clearly bad | Stop training, investigate |
| Eval metrics significantly worse than baseline | Clearly bad | Stop training, investigate |
| Loss decreasing, metrics improving | Clearly fine | Continue, increase check interval |
| Loss flat but not diverging | Unsure | → Step 3 (Codex judgment) |
| Metrics noisy, can't tell trend | Unsure | → Step 3 (Codex judgment) |
| Slightly worse than baseline but still early | Unsure | → Step 3 (Codex judgment) |
mcp__codex__codex:
config: {"model_reasoning_effort": "high"}
prompt: |
TRAINING HEALTH CHECK — need your judgment on ambiguous metrics.
Run: <entity>/<project>/<run_id>
Current epoch/step: X / Y total
Training loss (last 10 checkpoints): [values]
Eval metrics (last 3 evals): [values]
Baseline reference: [numbers from paper/reproduction]
What I'm unsure about: [specific concern]
Please respond with exactly one of:
- STOP: clearly problematic, should kill training
- CONTINUE: looks fine, check again next interval
- WAIT: not enough data to judge, check again sooner| Decision | Action |
|---|---|
| Stop | Kill the training session. Save the WandB run URL, key metrics, and reason for stopping. Log to project notes for debugging. |
| Continue | Do nothing. Will be invoked again at next interval (increase interval if consistently healthy). |
| Wait | Do nothing but keep the current short interval (don't increase). |
| Layer | Tool | What it checks | Frequency |
|---|---|---|---|
| Process health | watchdog.py | Session alive? GPU active? | Every 60s (continuous) |
| Training quality | training-check | Loss trend? Metrics improving? | Every 10-60 min (periodic) |
After training is confirmed stable:
CronCreate (recurring, every 10 minutes initially):
"Run /training-check for wandb run <entity>/<project>/<run_id>"