monitor

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Job Monitor

作业监控

Monitor jobs submitted to SLURM clusters — PTQ quantization, NEL evaluation, model deployment, or raw SLURM jobs.

监控提交至SLURM集群的作业——包括PTQ量化、NEL评估、模型部署或原生SLURM作业。

When to use

使用场景

Auto-monitor — another skill (PTQ, evaluation, deployment) just submitted a job. Register the job and set up monitoring immediately.
User-initiated — user asks about a job status, possibly in a new conversation. Check the registry, identify the job, and report.

自动监控 —— 其他技能（PTQ、评估、部署）刚提交了作业。立即注册该作业并设置监控。
用户主动触发 —— 用户询问作业状态，可能是在新对话中。检查注册表，识别作业并报告状态。

Job Registry

作业注册表

All active jobs are tracked in

.claude/active_jobs.json

. This file is the single source of truth for what's being monitored.

json

[
  {
    "type": "nel",
    "id": "<invocation_id or slurm_job_id>",
    "host": "<cluster_hostname>",
    "user": "<ssh_user>",
    "submitted": "YYYY-MM-DD HH:MM",
    "description": "<what this job does>",
    "last_status": "<last known status>"
  }
]

type

is one of:

nel

slurm

launcher

所有活跃作业都在

.claude/active_jobs.json

中跟踪。该文件是监控作业的唯一可信来源。

json

[
  {
    "type": "nel",
    "id": "<invocation_id or slurm_job_id>",
    "host": "<cluster_hostname>",
    "user": "<ssh_user>",
    "submitted": "YYYY-MM-DD HH:MM",
    "description": "<what this job does>",
    "last_status": "<last known status>"
  }
]

type

的可选值为：

nel

slurm

launcher

。

On Job Submission

作业提交时的处理

Every time a job is submitted (by any skill or manually):

Add an entry to
```
.claude/active_jobs.json
```
. Create the file if it doesn't exist.
Set up a durable recurring cron (if one isn't already running) that polls all registered jobs every 15 minutes. The cron prompt should: read the registry, check each job, report state changes to the user, remove completed jobs, and delete itself when the registry is empty.

Always do both steps. Don't try to predict job duration.

每当作业被提交（通过任何技能或手动方式）：

添加条目至
```
.claude/active_jobs.json
```
。若文件不存在则创建。
设置持久化定时轮询Cron任务（如果尚未运行），每15分钟轮询一次所有已注册的作业。Cron任务需执行以下操作：读取注册表、检查每个作业、向用户报告状态变化、移除已完成作业，当注册表为空时自动删除自身。

必须执行这两个步骤，不要尝试预测作业时长。

On Cron Fire / Status Check

Cron触发/状态检查时的处理

Whether triggered by the cron or by the user asking "check status":

Read the registry from
```
.claude/active_jobs.json
```
Check each job using the appropriate method (see below)
Report only state changes — compare against
```
last_status
```
in registry
Update
last_status
in the registry
Remove completed jobs — any job in a terminal state (COMPLETED, FAILED, CANCELLED, KILLED)
If registry is empty — delete the recurring cron

无论是Cron触发还是用户询问“检查状态”：

读取注册表中的内容，来源为
```
.claude/active_jobs.json
```
使用对应方法检查每个作业（见下文）
仅报告状态变化 —— 与注册表中的
```
last_status
```
进行对比
更新注册表中的
last_status
移除已完成作业 —— 任何处于结束状态的作业（COMPLETED、FAILED、CANCELLED、KILLED）
若注册表为空 —— 删除定时Cron任务

How to Check Each Job Type

不同类型作业的检查方法

NEL jobs (

type: nel

)

NEL作业（

type: nel

）

Check:
```
nel status <id>
```
On completion:
```
nel info <id>
```
to fetch results
On failure:
```
nel info <id> --logs
```
then inspect server/client/SLURM logs via SSH

检查命令：
```
nel status <id>
```
完成时：执行
```
nel info <id>
```
获取结果
失败时：执行
```
nel info <id> --logs
```
，然后通过SSH检查服务器/客户端/SLURM日志

Launcher jobs (

type: launcher

)

Launcher作业（

type: launcher

）

Check: Tail the launcher's background output file for key events
Key events: experiment ID, SLURM job ID, container import, calibration progress, export path, final status
On failure: Look for
```
Traceback
```
,
```
Error
```
, or
```
FAILED
```
in the output

检查方式：跟踪launcher后台输出文件中的关键事件
关键事件：实验ID、SLURM作业ID、容器导入、校准进度、导出路径、最终状态
失败时：在输出中查找
```
Traceback
```
、
```
Error
```
或
```
FAILED
```
关键字

Raw SLURM jobs (

type: slurm

)

原生SLURM作业（

type: slurm

）

Check:

ssh <host> "squeue -j <id> -h -o '%T %M %R'"

— if empty, job left the queue

On completion:

ssh <host> "sacct -j <id> --format=State,ExitCode,Elapsed -n"

On failure: Check the job's output log file

检查命令：
```
ssh <host> "squeue -j <id> -h -o '%T %M %R'"
```
—— 若返回空则表示作业已退出队列

完成时：

ssh <host> "sacct -j <id> --format=State,ExitCode,Elapsed -n"

失败时：检查作业的输出日志文件

Identifying Jobs (user-initiated, no ID given)

识别作业（用户主动触发且未提供ID）

When the user asks about a job without specifying an ID, check in order:

```
.claude/active_jobs.json
```
— most reliable, has context
```
nel ls runs --since 1d
```
— recent NEL runs
```
ssh <host> "squeue -u <user>"
```
— active SLURM jobs

ls -lt tools/launcher/experiments/cicd/ | head -10

— recent launcher experiments

当用户未指定ID询问作业时，按以下顺序检查：

```
.claude/active_jobs.json
```
—— 最可靠，包含上下文信息
```
nel ls runs --since 1d
```
—— 近期NEL运行记录
```
ssh <host> "squeue -u <user>"
```
—— 活跃SLURM作业

ls -lt tools/launcher/experiments/cicd/ | head -10

—— 近期launcher实验

Reporting Guidelines

报告准则

Report state changes proactively — PENDING → RUNNING, or job completes
Aggregate multiple jobs — "2 of 4 completed (MMLU-Pro: 42.3%, GSM8K: 67.1%), 1 running, 1 pending"
Summarize, don't echo — interpret events ("Calibration complete, exporting checkpoint") not raw logs
On failure, diagnose immediately — check logs and report root cause without waiting for user to ask
Minimize noise — don't report "still running" unless the user is actively asking

主动报告状态变化 —— 如从PENDING变为RUNNING，或作业完成
汇总多个作业 —— 例如“4个作业中有2个已完成（MMLU-Pro: 42.3%, GSM8K: 67.1%），1个运行中，1个待处理”
总结而非照搬 —— 解读事件（如“校准完成，正在导出检查点”）而非直接输出原始日志
失败时立即诊断 —— 检查日志并报告根本原因，无需等待用户询问
减少冗余信息 —— 除非用户主动询问，否则不要报告“仍在运行”

monitor

Original

Translation

Job Monitor

作业监控

When to use

使用场景

Job Registry

作业注册表

On Job Submission

作业提交时的处理

On Cron Fire / Status Check

Cron触发/状态检查时的处理

How to Check Each Job Type

不同类型作业的检查方法

NEL jobs (
`type: nel`
)

NEL作业（
`type: nel`
）

Launcher jobs (
`type: launcher`
)

Launcher作业（
`type: launcher`
）

Raw SLURM jobs (
`type: slurm`
)

原生SLURM作业（
`type: slurm`
）

Identifying Jobs (user-initiated, no ID given)

识别作业（用户主动触发且未提供ID）

Reporting Guidelines

报告准则

monitor

Original

Translation

Job Monitor

作业监控

When to use

使用场景

Job Registry

作业注册表

On Job Submission

作业提交时的处理

On Cron Fire / Status Check

Cron触发/状态检查时的处理

How to Check Each Job Type

不同类型作业的检查方法

NEL jobs (type: nel)

NEL作业（type: nel）

Launcher jobs (type: launcher)

Launcher作业（type: launcher）

Raw SLURM jobs (type: slurm)

原生SLURM作业（type: slurm）

Identifying Jobs (user-initiated, no ID given)

识别作业（用户主动触发且未提供ID）

Reporting Guidelines

报告准则

NEL jobs (
`type: nel`
)

NEL作业（
`type: nel`
）

Launcher jobs (
`type: launcher`
)

Launcher作业（
`type: launcher`
）

Raw SLURM jobs (
`type: slurm`
)

原生SLURM作业（
`type: slurm`
）