Loading...
Loading...
SSH job queue for multi-seed/multi-config ML experiments with OOM-aware retry, stale-screen cleanup, and wave-transition race prevention. Use when user says "batch experiments", "队列实验", "run grid", "multi-seed sweep", "auto-chain experiments", or when /run-experiment is insufficient for 10+ jobs that need orchestration.
npx skill4agent add wanshuiyin/auto-claude-code-research-in-sleep experiment-queue/run-experiment/run-experimentproject: dllm_distill
cwd: /home/rfyang/rfyang_code/dllm_experiments_torch
conda: dllm
# Optional: override conda hook path if conda is not at a standard location.
# Can be a bare path (wrapped automatically) or a full `eval "$(... shell.bash hook)"` string.
# Falls back to auto-detect of ~/anaconda3, ~/miniconda3, /opt/anaconda3, etc.,
# or the ARIS_CONDA_HOOK environment variable.
# conda_hook: /custom/path/to/conda
ssh: SJTUServer5
default_cmd: >
python run_pc_distill_exp.py --backbone softmax --lam 0.5
--K 500 --L 96 --W 16 --n_steps 30000 --batch_size 128 --lr 1e-4
preconditions:
- type: checkpoint_exists
path: checkpoints/transformer/pcc_softmax_L96_K500_N{N}_wikitext103.pt
gpus: [0, 1, 2, 3, 4, 5, 6, 7]
max_parallel: 8
gpu_free_threshold_mib: 500 # optional, default 500; raise for shared servers, lower for tight packing
oom_retry:
delay: 120
max_attempts: 3
jobs:
- id: s200_N64_n50K
args: {seed: 200, n_hidden: 64, n_train_subset: 50000, subset_seed: 2024}
- id: s200_N128_n50K
args: {seed: 200, n_hidden: 128, n_train_subset: 50000, subset_seed: 2024}
# ... 14 morepending → running → completed
↘ failed_oom → pending (after delay) [retry up to N]
↘ failed_other → stuck (needs manual inspection)
stale_screen_detected → cleaned → pendingN=[64,128,256] × n=[50K,150K,500K,652K]<project>/experiment_queue/<timestamp>/manifest.jsoncwdmax_paralleltools/queue_manager.pynohupssh <server> 'nohup python3 ~/.aris_queue/queue_manager.py \
--manifest /tmp/manifest.json \
--state /tmp/queue_state.json \
--log /tmp/queue.log \
> /tmp/queue_mgr.log 2>&1 &'screenqueue_state.jsonssh <server> cat /tmp/queue_state.json | jq '.jobs | group_by(.status) | map({(.[0].status): length}) | add'/monitor-experimentmanifest.jsoncompletedstuck<project>/experiment_queue/<timestamp>/summary.md/analyze-resultsanalyze_on_complete: truegrid:
N: [64, 128, 256]
n: [50000, 150000, 500000, 652000]
seed: [42, 200, 201]
template:
id: "s${seed}_N${N}_n${n}"
args: {seed: ${seed}, n_hidden: ${N}, n_train_subset: ${n}}phases:
- name: train_teachers
grid:
N: [384, 512]
template:
cmd: python run_pc_exp.py --direction c --backbone softmax --n_hidden ${N} ...
output_check: checkpoints/transformer/pcc_softmax_L96_K500_N${N}_wikitext103.pt
- name: distill_students
depends_on: train_teachers
grid:
N: [384, 512]
seed: [42, 200, 201]
template:
cmd: python run_pc_distill_exp.py --n_hidden ${N} --seed ${seed} ...
output_check: figures/pcdistill_sw_N${N}_*_seed${seed}.jsondepends_ondistill_studentspendingtrain_teacherscompletedtorch\.OutOfMemoryError: CUDA out of memoryfailed_oomoom_retry.delaypendingoom_retry.max_attemptsstuckscreen -lsps -pcompletedfailed_otherqueue_state.jsonrunningpending# Experiment Queue Summary
**Project**: dllm_distill
**Started**: 2026-04-16 11:36:29
**Completed**: 2026-04-16 18:02:14
**Total wall-clock**: 6h 25m
**Jobs**: 40 completed, 2 OOM-retried then completed, 0 stuck
## Phases
| Phase | Jobs | Success | OOM retries | Duration |
| --- | --- | --- | --- | --- |
| train_teachers | 2 | 2 | 0 | 58m |
| distill_students | 24 | 24 | 2 | 4h 02m |
| multi_seed_validation | 16 | 16 | 0 | 1h 25m |
## Results Files
- 42 JSON files in `figures/pcdistill_sw_*.json`
## Next Steps
- Run `/analyze-results` on output JSONs
- Figures auto-regen via `artifact-sync` (if configured)/run-experiment| Feature | | |
|---|---|---|
| Single-shot experiment | ✅ | ✅ (overkill) |
| Multi-GPU parallel | Basic | Proper scheduling |
| Wave transitions | Manual | Automatic |
| OOM retry | Manual | Automatic |
| Stale screen cleanup | Manual | Automatic |
| Teacher→student chain | Manual | Built-in |
| State persistence | No | Yes (JSON) |
| Resume on crash | No | Yes |
| Grid expansion | Manual | Declarative |
/run-experimentexperiment-queuememory.used < 500 MiBqueue_state.jsonstuckstuck/experiment-queue/run-experiment/monitor-experiment/analyze-resultstools/queue_manager.pytools/build_manifest.pyartifact-syncpaper-fix-auto-apply