cost-benchmark
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCost Benchmark
成本基准测试
Runs against the structural+adversarial corpus and writes per-case + summary results to . This is the verification gate that backs every measurable claim in / .
scripts/bench.mjsdocs/benchmarks/runs/cost-booster-editcost-booster-route针对结构化+对抗性语料库运行,并将单案例结果及汇总结果写入。这是支撑/中所有可量化声明的验证关卡。
scripts/bench.mjsdocs/benchmarks/runs/cost-booster-editcost-booster-routeWhen to use
使用场景
- Before publishing a release — verify booster win rate didn't regress.
- After expanding — confirm new cases route correctly.
bench/booster-corpus.json - When auditing a "claimed upstream" tag — flip it to "verified" once the bench supports it.
- On a cost question ("is Sonnet 4.6 cheaper than Opus 4.7 for these tasks?") — re-run with .
BENCH_ANTHROPIC=1
- 发布版本前——验证booster的胜率未出现退化。
- 扩展后——确认新案例路由正确。
bench/booster-corpus.json - 审核“声明的上游”标签时——一旦基准测试支持该标签,将其切换为“已验证”。
- 遇到成本相关问题(如“对于这些任务,Sonnet 4.6是否比Opus 4.7更便宜?”)——设置后重新运行。
BENCH_ANTHROPIC=1
Steps
步骤
-
Run the bench from(where
v3/resolves):agent-boosterbash( cd v3 && node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # booster only — free, ~85 ms ( cd v3 && BENCH_LLM_BASELINE=1 node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # + Gemini 2.0 Flash (cheap) ( cd v3 && BENCH_LLM_BASELINE=1 BENCH_ANTHROPIC=1 \ node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # + Sonnet 4.6 + Opus 4.7 -
Inspect the markdown summary printed to stdout. The gate metric is(Tier 1 cases). Adversarial cases are tracked separately as
winRate.escalationRate -
Persisted output lands at:
- — pointer to the most recent run
docs/benchmarks/runs/latest.json - — historical record
docs/benchmarks/runs/<ISO-timestamp>.json
-
Read it back in subsequent skills (e.g.step 2 reads
cost-reportfor live tier-spend numbers).latest.json
-
从目录运行基准测试(
v3/的解析目录):agent-boosterbash( cd v3 && node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # 仅booster——免费,约85毫秒 ( cd v3 && BENCH_LLM_BASELINE=1 node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # + Gemini 2.0 Flash(低成本) ( cd v3 && BENCH_LLM_BASELINE=1 BENCH_ANTHROPIC=1 \ node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # + Sonnet 4.6 + Opus 4.7 -
检查输出到标准输出的Markdown汇总。核心指标为(Tier 1案例)。对抗性案例单独作为
winRate跟踪。escalationRate -
持久化输出存储在以下位置:
- ——指向最新测试运行的指针
docs/benchmarks/runs/latest.json - ——历史记录
docs/benchmarks/runs/<ISO-timestamp>.json
-
在后续技能中读取该输出(例如步骤2会读取
cost-report以获取实时层级支出数据)。latest.json
Smoke gates
冒烟测试关卡
- on Tier 1 cases (smoke step 23). Lower the threshold by editing
winRate ≥ 0.80.scripts/smoke.sh - is reported but ungated — adversarial cases are diagnostic.
escalationRate
- Tier 1案例的(冒烟测试步骤23)。可通过编辑
winRate ≥ 0.80调整阈值。scripts/smoke.sh - 仅作报告,不设关卡——对抗性案例用于诊断。
escalationRate
Env overrides
环境变量覆盖
| Env var | Default | Purpose |
|---|---|---|
| unset | |
| | Override the OpenAI-compat model |
| Gemini OpenAI shim | Override endpoint |
| unset | |
| | Comma-separated Claude IDs |
| timestamped file | Override output path |
| unset | Suppress markdown summary |
API keys auto-pulled from (, ); override with / .
gcloud secretsGOOGLE_AI_API_KEYANTHROPIC_API_KEYBENCH_LLM_API_KEYBENCH_ANTHROPIC_API_KEY| 环境变量 | 默认值 | 用途 |
|---|---|---|
| 未设置 | |
| | 覆盖OpenAI兼容模型 |
| Gemini OpenAI垫片 | 覆盖端点 |
| 未设置 | |
| | 逗号分隔的Claude模型ID |
| 带时间戳的文件 | 覆盖输出路径 |
| 未设置 | 抑制Markdown汇总输出 |
API密钥会自动从获取(、);可通过/覆盖。
gcloud secretsGOOGLE_AI_API_KEYANTHROPIC_API_KEYBENCH_LLM_API_KEYBENCH_ANTHROPIC_API_KEYCross-references
交叉引用
ADR-0002 §"Decision 1" / §"Riskiest assumption" · (verification table consumes this skill's output) · step 2 (reads ).
cost-booster-edit/SKILL.mdcost-report/SKILL.mdruns/latest.jsonADR-0002 §"决策1" / §"风险最高的假设" · (验证表会使用本技能的输出) · 步骤2(读取)。
cost-booster-edit/SKILL.mdcost-report/SKILL.mdruns/latest.json