Experiment Queue

Orchestrate large batches of ML experiments on SSH remote GPU servers with proper state tracking, OOM retry, stale cleanup, and wave transitions.

When to Use This Skill

Use when

/run-experiment

is insufficient:

≥10 jobs that need batching across GPUs
Multi-seed sweeps (e.g., 21 seeds × 12 cells)
Wave transitions (run wave 1, wait, run wave 2, wait, run wave 3...)
Teacher+student chains (train teacher then distill; auto-trigger student after teacher done)
OOM-prone configs where you need to retry with different GPU or wait
Mixed seed grids where failed cells need re-running

Do NOT use for:

Single ad-hoc experiment (use
```
/run-experiment
```
)
Modal/Vast.ai deployments (those have their own orchestration)
Experiments that need manual inspection between runs

Why This Exists

Based on session audit (2026-04-16), the major wall-clock sinks in multi-seed grid experiments are:

Stale screens — python finishes, wandb uploads, screen hangs, next wave blocked
OOM on shared GPU — previous job's memory not yet released
Wave race — new wave launches before previous wave fully settles
Missing checkpoints — student launches before teacher saved
Parser duplication — rewriting multi-seed analysis python every batch

All of these are pure engineering friction that can be orchestrated.

Core Concepts

Job Manifest

A manifest lists jobs with explicit state:

yaml

project: dllm_distill
cwd: /home/rfyang/rfyang_code/dllm_experiments_torch
conda: dllm
# Optional: override conda hook path if conda is not at a standard location.
# Can be a bare path (wrapped automatically) or a full `eval "$(... shell.bash hook)"` string.
# Falls back to auto-detect of ~/anaconda3, ~/miniconda3, /opt/anaconda3, etc.,
# or the ARIS_CONDA_HOOK environment variable.
# conda_hook: /custom/path/to/conda
ssh: SJTUServer5
default_cmd: >
  python run_pc_distill_exp.py --backbone softmax --lam 0.5
  --K 500 --L 96 --W 16 --n_steps 30000 --batch_size 128 --lr 1e-4

preconditions:
  - type: checkpoint_exists
    path: checkpoints/transformer/pcc_softmax_L96_K500_N{N}_wikitext103.pt

gpus: [0, 1, 2, 3, 4, 5, 6, 7]
max_parallel: 8
gpu_free_threshold_mib: 500  # optional, default 500; raise for shared servers, lower for tight packing
oom_retry:
  delay: 120
  max_attempts: 3

jobs:
  - id: s200_N64_n50K
    args: {seed: 200, n_hidden: 64, n_train_subset: 50000, subset_seed: 2024}
  - id: s200_N128_n50K
    args: {seed: 200, n_hidden: 128, n_train_subset: 50000, subset_seed: 2024}
  # ... 14 more

Job State Machine

pending → running → completed
                 ↘ failed_oom → pending (after delay) [retry up to N]
                 ↘ failed_other → stuck (needs manual inspection)
stale_screen_detected → cleaned → pending

Wave Orchestration

A "wave" is a batch of jobs that fit available GPUs. Next wave only starts when:

All current-wave python processes have exited
No stale screens remain for current-wave tags
GPU memory has dropped below threshold (≤500 MiB)
Precondition checks pass for next-wave jobs

Workflow

Step 1: Parse Manifest / Build from Grid

Input can be:

YAML manifest (explicit job list, recommended for complex cases)
Grid spec (Cartesian product of param values, e.g.,
```
N=[64,128,256] × n=[50K,150K,500K,652K]
```
)
Natural language description (Claude parses into manifest)

Save the built manifest to

<project>/experiment_queue/<timestamp>/manifest.json

for reproducibility.

Step 2: Pre-flight

Check SSH connection works
Check conda env exists on remote
Check
```
cwd
```
exists on remote
Check all preconditions (checkpoints, input files)
Check GPU availability (at least
```
max_parallel
```
free GPUs)

If any precondition fails, show user which jobs are blocked and why.

Step 3: Launch Scheduler

Run

tools/queue_manager.py

(bundled with this skill) as a detached

nohup

process on the SSH host:

bash

ssh <server> 'nohup python3 ~/.aris_queue/queue_manager.py \
  --manifest /tmp/manifest.json \
  --state /tmp/queue_state.json \
  --log /tmp/queue.log \
  > /tmp/queue_mgr.log 2>&1 &'

The scheduler:

Reads manifest
Loops: for each pending job, assign to free GPU, launch via
```
screen
```
Polls job status (every 60s)
Detects stale screens (python exited but screen detached → kill)
Detects OOM (CUDA OOM in log → mark failed_oom → retry after delay)
Detects completion (expected output JSON/file exists) → mark completed
Launches next wave when current wave settles
Writes state to
```
queue_state.json
```
continuously

Step 4: Monitoring

User can check state anytime:

bash

ssh <server> cat /tmp/queue_state.json | jq '.jobs | group_by(.status) | map({(.[0].status): length}) | add'

Or invoke

/monitor-experiment

which reads the state file.

Step 5: Post-completion

When all jobs in

manifest.json

are

completed

stuck

Scheduler exits cleanly

Write final summary to

<project>/experiment_queue/<timestamp>/summary.md

Invoke

/analyze-results

analyze_on_complete: true

Grid Spec Syntax

Instead of writing 24 job entries manually:

yaml

grid:
  N: [64, 128, 256]
  n: [50000, 150000, 500000, 652000]
  seed: [42, 200, 201]
template:
  id: "s${seed}_N${N}_n${n}"
  args: {seed: ${seed}, n_hidden: ${N}, n_train_subset: ${n}}

Expands to 36 jobs automatically.

Wave Chaining

For sequential phases (teacher → student):

yaml

phases:
  - name: train_teachers
    grid:
      N: [384, 512]
    template:
      cmd: python run_pc_exp.py --direction c --backbone softmax --n_hidden ${N} ...
      output_check: checkpoints/transformer/pcc_softmax_L96_K500_N${N}_wikitext103.pt
  
  - name: distill_students
    depends_on: train_teachers
    grid:
      N: [384, 512]
      seed: [42, 200, 201]
    template:
      cmd: python run_pc_distill_exp.py --n_hidden ${N} --seed ${seed} ...
      output_check: figures/pcdistill_sw_N${N}_*_seed${seed}.json

Scheduler enforces

depends_on

distill_students

jobs stay

pending

until all

train_teachers

jobs are

completed

OOM Handling

Detect OOM from stdout:

regex

torch\.OutOfMemoryError: CUDA out of memory

On detection:

Mark job
```
failed_oom
```
Kill screen
Wait
```
oom_retry.delay
```
seconds
Check if current GPU is free; if not, try another free GPU
Requeue as
```
pending
```
Max
```
oom_retry.max_attempts
```
before marking
```
stuck
```

Stale Screen Detection

Every 60s, for each running screen:

Check screen exists (
```
screen -ls
```
)
Check python PID still running (
```
ps -p
```
)
If screen exists but python exited:
- If expected output file exists → mark
```
completed
```
  , kill stale screen
- If no output file → mark
```
failed_other
```
  , kill screen

Resume-on-restart

If scheduler crashes / is killed:

Read
```
queue_state.json
```
For each
```
running
```
job: check screen; if still alive, keep; if not, re-evaluate state
For each
```
pending
```
: continue normally
Idempotent: safe to restart scheduler without losing state

Output: Summary Report

markdown

# Experiment Queue Summary

**Project**: dllm_distill
**Started**: 2026-04-16 11:36:29
**Completed**: 2026-04-16 18:02:14
**Total wall-clock**: 6h 25m
**Jobs**: 40 completed, 2 OOM-retried then completed, 0 stuck

## Phases
| Phase | Jobs | Success | OOM retries | Duration |
| --- | --- | --- | --- | --- |
| train_teachers | 2 | 2 | 0 | 58m |
| distill_students | 24 | 24 | 2 | 4h 02m |
| multi_seed_validation | 16 | 16 | 0 | 1h 25m |

## Results Files
- 42 JSON files in `figures/pcdistill_sw_*.json`

## Next Steps
- Run `/analyze-results` on output JSONs
- Figures auto-regen via `artifact-sync` (if configured)

Comparison with

/run-experiment

Feature	`/run-experiment`	`experiment-queue`
Single-shot experiment	✅	✅ (overkill)
Multi-GPU parallel	Basic	Proper scheduling
Wave transitions	Manual	Automatic
OOM retry	Manual	Automatic
Stale screen cleanup	Manual	Automatic
Teacher→student chain	Manual	Built-in
State persistence	No	Yes (JSON)
Resume on crash	No	Yes
Grid expansion	Manual	Declarative

Rule: Use

/run-experiment

for ≤5 jobs. Use

experiment-queue

for ≥10 jobs or anything with phases.

Key Rules

Never overlap screens on the same GPU — always wait for
```
memory.used < 500 MiB
```
before launching new job
Always write state to disk — every state change flushed to
```
queue_state.json
```
Idempotent scheduler — safe to restart; picks up from state file
Expected-output-based completion — don't trust screen state alone; verify output file exists
Bounded retry — max N OOM retries, then mark
```
stuck
```
and alert
Dependencies enforced at launch — never launch student before teacher checkpoint exists

Known Failure Modes

SSH connection drop during scheduling: scheduler keeps running on remote (nohup), just reconnect and check
GPU reservation by another user: scheduler waits, does not pre-empt
Disk full on remote: scheduler detects write failure, marks all pending
```
stuck
```
, alerts

Example Session

User: "跑 T5+T6 全部实验：T5 = N∈{80,192} × n 4 values × seed {200,201}, T6 = N∈{384,512} × n 4 values × seed {42,200,201}; T6 需要先 train teacher"

Claude invokes

/experiment-queue

Parses description into 2-phase manifest
Phase 1: T5 (16 jobs, no teacher dependency) + T6 teacher training (2 jobs)
Phase 2: T6 distillation (24 jobs, depends on teachers)
Deploys scheduler via nohup
Reports: "Scheduler PID 93534, total 42 jobs, estimated 6-7h wall-clock"

Then user can check anytime or wait for summary report.

Rationale / Source

Identified via 2026-04-16 post-mortem analysis (Codex GPT-5.4 xhigh) of a 1.5-day multi-seed paper experiment session:

Wall-clock sink: stale screens, OOM, wave transitions, manual parser
Token sink: re-writing orchestration code each session
Cognitive sink: tracking which cells succeeded, which failed, which to retry

This skill targets the wall-clock sink specifically; see

artifact-sync

and

paper-fix-auto-apply

for the other two.

experiment-queue

NPX Install

Tags

SKILL.md Content

Experiment Queue

When to Use This Skill

Why This Exists

Core Concepts

Job Manifest

Job State Machine

Wave Orchestration

Workflow

Step 1: Parse Manifest / Build from Grid

Step 2: Pre-flight

Step 3: Launch Scheduler

Step 4: Monitoring

Step 5: Post-completion

Grid Spec Syntax

Wave Chaining

OOM Handling

Stale Screen Detection

Resume-on-restart

Output: Summary Report

Comparison with
`/run-experiment`

Key Rules

Known Failure Modes

Example Session

See Also

Rationale / Source