eval-triggers

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

IMPORTANT: This skill is expensive (makes many LLM API calls). Only run when the user explicitly asks for it. Never run proactively.

Before running, ask the user which eval source to run unless they already specified it in

$ARGUMENTS

Codex only — uses
```
run_claude_trigger_eval.py
```
Codex only — uses
```
run_codex_trigger_eval.py
```
Both — runs both runners sequentially

Pick a descriptive label for the run. Default to today's date if none provided.

Step 1 — Classify:

For Codex:

bash

python3 evals/scripts/run_claude_trigger_eval.py \
  --repo-root . \
  --output-dir evals/results/adHocRuns/<date>-<label>/raw \
  --source combined

For Codex:

bash

python3 evals/scripts/run_codex_trigger_eval.py \
  --repo-root . \
  --output-dir evals/results/adHocRuns/<date>-<label>/raw

Step 2 — Analyze:

bash

python3 evals/scripts/analyze_trigger_results.py \
  --results-dir evals/results/adHocRuns/<date>-<label>/raw \
  --output-json evals/results/adHocRuns/<date>-<label>/analysis.json \
  --output-markdown evals/results/adHocRuns/<date>-<label>/analysis.md

Report the key metrics: recall, precision, F1, and any false negatives/positives.

Step 3 — Compare (optional):

If the user provides a baseline path or there is a recent baseline in

evals/results/adHocRuns/

, compare:

bash

python3 evals/scripts/compare_trigger_runs.py \
  --baseline <baseline>/analysis.json \
  --candidate evals/results/adHocRuns/<date>-<label>/analysis.json \
  --output-markdown evals/results/adHocRuns/<date>-<label>/comparison.md \
  --output-json evals/results/adHocRuns/<date>-<label>/comparison.json

Summarize regressions and improvements.

重要提示： 该Skill成本较高（会调用大量LLM API）。仅在用户明确要求时运行，切勿主动执行。

运行前，询问用户要运行哪个评估源，除非他们已在

$ARGUMENTS

中指定：

仅Codex — 使用
```
run_claude_trigger_eval.py
```
仅Codex — 使用
```
run_codex_trigger_eval.py
```
两者都运行 — 依次运行两个执行脚本

为本次运行选择一个描述性标签。如果未提供，默认使用今日日期。

步骤1 — 分类：

针对Codex：

bash

python3 evals/scripts/run_claude_trigger_eval.py \
  --repo-root . \
  --output-dir evals/results/adHocRuns/<date>-<label>/raw \
  --source combined

针对Codex：

bash

python3 evals/scripts/run_codex_trigger_eval.py \
  --repo-root . \
  --output-dir evals/results/adHocRuns/<date>-<label>/raw

步骤2 — 分析：

bash

python3 evals/scripts/analyze_trigger_results.py \
  --results-dir evals/results/adHocRuns/<date>-<label>/raw \
  --output-json evals/results/adHocRuns/<date>-<label>/analysis.json \
  --output-markdown evals/results/adHocRuns/<date>-<label>/analysis.md

报告关键指标：召回率、精确率、F1值，以及所有假阴性/假阳性结果。

步骤3 — 对比（可选）：

如果用户提供了基线路径，或者

evals/results/adHocRuns/

中有最近的基线，则进行对比：

bash

python3 evals/scripts/compare_trigger_runs.py \
  --baseline <baseline>/analysis.json \
  --candidate evals/results/adHocRuns/<date>-<label>/analysis.json \
  --output-markdown evals/results/adHocRuns/<date>-<label>/comparison.md \
  --output-json evals/results/adHocRuns/<date>-<label>/comparison.json

总结性能退化与提升情况。