eval-triggers

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
IMPORTANT: This skill is expensive (makes many LLM API calls). Only run when the user explicitly asks for it. Never run proactively.
Before running, ask the user which eval source to run unless they already specified it in
$ARGUMENTS
:
  • Codex only — uses
    run_claude_trigger_eval.py
  • Codex only — uses
    run_codex_trigger_eval.py
  • Both — runs both runners sequentially
Pick a descriptive label for the run. Default to today's date if none provided.
Step 1 — Classify:
For Codex:
bash
python3 evals/scripts/run_claude_trigger_eval.py \
  --repo-root . \
  --output-dir evals/results/adHocRuns/<date>-<label>/raw \
  --source combined
For Codex:
bash
python3 evals/scripts/run_codex_trigger_eval.py \
  --repo-root . \
  --output-dir evals/results/adHocRuns/<date>-<label>/raw
Step 2 — Analyze:
bash
python3 evals/scripts/analyze_trigger_results.py \
  --results-dir evals/results/adHocRuns/<date>-<label>/raw \
  --output-json evals/results/adHocRuns/<date>-<label>/analysis.json \
  --output-markdown evals/results/adHocRuns/<date>-<label>/analysis.md
Report the key metrics: recall, precision, F1, and any false negatives/positives.
Step 3 — Compare (optional):
If the user provides a baseline path or there is a recent baseline in
evals/results/adHocRuns/
, compare:
bash
python3 evals/scripts/compare_trigger_runs.py \
  --baseline <baseline>/analysis.json \
  --candidate evals/results/adHocRuns/<date>-<label>/analysis.json \
  --output-markdown evals/results/adHocRuns/<date>-<label>/comparison.md \
  --output-json evals/results/adHocRuns/<date>-<label>/comparison.json
Summarize regressions and improvements.
重要提示: 该Skill成本较高(会调用大量LLM API)。仅在用户明确要求时运行,切勿主动执行。
运行前,询问用户要运行哪个评估源,除非他们已在
$ARGUMENTS
中指定:
  • 仅Codex — 使用
    run_claude_trigger_eval.py
  • 仅Codex — 使用
    run_codex_trigger_eval.py
  • 两者都运行 — 依次运行两个执行脚本
为本次运行选择一个描述性标签。如果未提供,默认使用今日日期。
步骤1 — 分类:
针对Codex:
bash
python3 evals/scripts/run_claude_trigger_eval.py \
  --repo-root . \
  --output-dir evals/results/adHocRuns/<date>-<label>/raw \
  --source combined
针对Codex:
bash
python3 evals/scripts/run_codex_trigger_eval.py \
  --repo-root . \
  --output-dir evals/results/adHocRuns/<date>-<label>/raw
步骤2 — 分析:
bash
python3 evals/scripts/analyze_trigger_results.py \
  --results-dir evals/results/adHocRuns/<date>-<label>/raw \
  --output-json evals/results/adHocRuns/<date>-<label>/analysis.json \
  --output-markdown evals/results/adHocRuns/<date>-<label>/analysis.md
报告关键指标:召回率、精确率、F1值,以及所有假阴性/假阳性结果。
步骤3 — 对比(可选):
如果用户提供了基线路径,或者
evals/results/adHocRuns/
中有最近的基线,则进行对比:
bash
python3 evals/scripts/compare_trigger_runs.py \
  --baseline <baseline>/analysis.json \
  --candidate evals/results/adHocRuns/<date>-<label>/analysis.json \
  --output-markdown evals/results/adHocRuns/<date>-<label>/comparison.md \
  --output-json evals/results/adHocRuns/<date>-<label>/comparison.json
总结性能退化与提升情况。