msverl-daily-regression-triage

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MSVerl Daily Regression Triage

MSVerl 每日回归排查

Use this skill when a fixed daily
verl + MindSpeed
training job has run and Codex needs to decide whether the result is healthy, whether there is a training failure or an accuracy regression, and which recent commit is the most likely cause.
当固定的每日
verl + MindSpeed
训练作业运行完成后,若需要Codex判断结果是否正常、是否存在训练失败或精度回归问题,以及哪条近期提交最可能是问题根源时,可使用本技能。

Defaults

默认配置

  • Baseline comparison log:
    /home/st_daily_verl/msverl.log
  • Training log pattern:
    /home/st_daily_verl/logs/msverl_YYYYMMDD.log
  • verl
    repo:
    https://github.com/verl-project/verl.git
    on
    main
  • MindSpeed
    repo:
    https://gitcode.com/Ascend/MindSpeed.git
    on
    master
  • Cache root for temporary clones:
    /tmp/msverl-skill-cache
  • Time window: from local previous day
    00:00:00
    to the task execution time
  • 基线对比日志:
    /home/st_daily_verl/msverl.log
  • 训练日志路径模板:
    /home/st_daily_verl/logs/msverl_YYYYMMDD.log
  • verl
    代码仓库:
    https://github.com/verl-project/verl.git
    main
    分支
  • MindSpeed
    代码仓库:
    https://gitcode.com/Ascend/MindSpeed.git
    master
    分支
  • 临时克隆缓存根目录:
    /tmp/msverl-skill-cache
  • 时间窗口:从本地时间前一天
    00:00:00
    到任务执行时间

Hard Stop Rules

强制终止规则

  • Read the comparison log first.
  • If it contains
    mean abs diff:
    and the parsed value is exactly
    0
    , stop and report success.
  • If it contains
    mean abs diff:
    and the value is non-zero, classify as
    accuracy_regression
    .
  • If it contains
    error, please check log
    , classify as
    train_error
    .
  • If the comparison log is ambiguous, report
    unknown
    and explain what evidence is missing before doing expensive work.
  • 优先读取对比日志。
  • 若日志中包含
    mean abs diff:
    且解析值恰好为
    0
    ,则终止流程并报告成功。
  • 若日志中包含
    mean abs diff:
    且值非零,则归类为
    accuracy_regression
    (精度回归)。
  • 若日志中包含
    error, please check log
    ,则归类为
    train_error
    (训练错误)。
  • 若对比日志内容模糊不清,则报告
    unknown
    (未知状态),并说明在执行高成本操作前缺少哪些必要证据。

Workflow

工作流程

  1. Run parse_result_log.py on the comparison log.
  2. Stop immediately on
    pass
    .
  3. For
    train_error
    , run extract_failure_tail.py against the daily training log and keep only the final high-signal error block.
  4. For
    accuracy_regression
    , use the parsed reward lists and
    mean abs diff
    as the primary evidence.
  5. Sync lightweight local clones with sync_repos.py.
  6. Collect recent commits with list_recent_commits.py for both repositories inside the default time window unless the user gives a different one.
  7. Rank suspects with rank_candidate_commits.py.
  8. Inspect diffs only for the top few commits when titles and touched files are not enough to explain a plausible fix direction.
  1. 在对比日志上运行parse_result_log.py脚本。
  2. 若结果为
    pass
    则立即终止流程。
  3. 若为
    train_error
    ,针对每日训练日志运行extract_failure_tail.py脚本,仅保留最终的高信号错误块。
  4. 若为
    accuracy_regression
    ,将解析得到的奖励列表和
    mean abs diff
    作为主要证据。
  5. 使用sync_repos.py脚本同步轻量本地克隆仓库。
  6. 使用list_recent_commits.py脚本收集两个仓库在默认时间窗口内的近期提交记录(用户指定其他时间窗口时除外)。
  7. 使用rank_candidate_commits.py脚本对可疑提交进行排序。
  8. 仅当提交标题和涉及文件不足以说明合理的修复方向时,才检查排名靠前的几个提交的差异内容。

Cost Controls

成本控制

  • Never load the whole training log unless the tail-based extractor fails twice.
  • Start with the log tail only; prefer the last traceback or last
    ERROR
    block.
  • Rank commits using title and touched files before reading diffs.
  • Limit deep diff reading to the top
    3
    candidates per repository unless the evidence is still weak.
  • 除非基于日志尾部的提取器连续两次失败,否则绝不加载完整的训练日志。
  • 仅从日志尾部开始分析;优先获取最后一段回溯信息或最后一个
    ERROR
    块。
  • 在查看提交差异前,先通过提交标题和涉及文件对提交进行排序。
  • 除非证据仍不充分,否则每个仓库仅对排名前
    3
    的候选提交进行深度差异查看。

Expected Output

预期输出

Return a compact report with:
  • status
    :
    pass
    ,
    train_error
    ,
    accuracy_regression
    , or
    unknown
  • time_window
  • evidence_summary
  • candidate_repo
  • candidate_commits
  • confidence
    :
    high
    ,
    medium
    , or
    low
  • fix_direction
When evidence is weak, say so clearly instead of forcing a single-commit claim.
返回一份简洁的报告,包含以下内容:
  • status
    pass
    (通过)、
    train_error
    (训练错误)、
    accuracy_regression
    (精度回归)或
    unknown
    (未知)
  • time_window
    :时间窗口
  • evidence_summary
    :证据摘要
  • candidate_repo
    :可疑仓库
  • candidate_commits
    :可疑提交
  • confidence
    high
    (高)、
    medium
    (中)或
    low
    (低)
  • fix_direction
    :修复方向
当证据不足时,需明确说明,而非强行指定某一个提交为问题根源。

References

参考信息

  • Run triage_msverl_regression.py for an end-to-end local workflow.
  • Use parse_result_log.py and extract_failure_tail.py separately when validating logs by hand.
  • Use list_recent_commits.py when you need a raw recent-commit inventory without ranking.
  • 运行triage_msverl_regression.py脚本可执行完整的本地端到端工作流程。
  • 手动验证日志时,可单独使用parse_result_log.pyextract_failure_tail.py脚本。
  • 若仅需要未排序的原始近期提交清单,可使用list_recent_commits.py脚本。