cost-trend
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCost Trend
成本趋势
The smoke gate is binary ( → pass/fail). The corpus benchmarks captured over time form a curve — and curves catch regressions the gate misses (win rate slowly creeping from 100% to 85% is "still passing" by smoke but a real degradation).
winRate ≥ 0.80This skill reads every persisted run in and reports first→last deltas plus a per-run series, flagging regressions in win rate or latency.
docs/benchmarks/runs/*.json烟雾测试门限是二元判定( → 通过/失败)。随时间捕获的语料库基准数据形成一条曲线——而曲线能发现门限遗漏的回归问题(胜率从100%缓慢降至85%在烟雾测试中仍属于“通过”,但实际是性能退化)。
winRate ≥ 0.80该Skill会读取中所有持久化的运行记录,报告首次到末次记录的差值以及每一次运行的序列数据,标记胜率或延迟的回归问题。
docs/benchmarks/runs/*.jsonWhen to use
使用场景
- Before a release — check that the speedup hasn't drifted.
- After expanding the corpus — verify older runs still hit the same win rate on the new corpus they reflected.
- After upgrading — surface latency / strategy changes.
agent-booster
- 发布前——确认性能提升未出现变化。
- 扩展语料库后——验证旧运行记录在其对应的新语料库上仍能达到相同胜率。
- 升级后——呈现延迟/策略变化。
agent-booster
Steps
操作步骤
-
Run the trend script from the project root:bash
node plugins/ruflo-cost-tracker/scripts/trend.mjsOptional env:- — emit JSON instead of markdown
TREND_FORMAT=json - — consider only the most recent N runs
TREND_LIMIT=10
-
Inspect the drift summary — first vs last on win rate, avg latency, p99, escalation rate, speedup vs Gemini.
-
Inspect the per-run series — one row per run, including Sonnet 4.6 + Opus 4.7 baseline latencies if those were enabled (at run time).
BENCH_ANTHROPIC=1 -
Regression flags — the script emitscallouts when:
> ⚠ Regression- Win rate dropped between first and last run
- Avg latency rose ≥ 1.5× from first run
-
从项目根目录运行趋势脚本:bash
node plugins/ruflo-cost-tracker/scripts/trend.mjs可选环境变量:- —— 输出JSON格式而非Markdown
TREND_FORMAT=json - —— 仅考虑最近的N次运行记录
TREND_LIMIT=10
-
查看变化摘要——对比首次与末次记录的胜率、平均延迟、p99延迟、升级率、相对Gemini的性能提升。
-
查看单运行序列数据——每一行对应一次运行记录,若运行时启用了,则包含Sonnet 4.6 + Opus 4.7的基准延迟数据。
BENCH_ANTHROPIC=1 -
回归标记——当出现以下情况时,脚本会输出提示:
> ⚠ Regression- 首次到末次运行记录的胜率下降
- 平均延迟较首次运行上升≥1.5倍
Cross-references
交叉引用
- — the producer of the run JSONs this skill consumes
cost-benchmark - — the corpus version is recorded in each run, so trends across corpus versions remain interpretable
bench/booster-corpus.json - — the most-recent run; smoke step 23 gates on
docs/benchmarks/runs/latest.jsonfrom this filewinRate ≥ 0.80
- —— 生成该Skill所依赖的运行记录JSON文件的工具
cost-benchmark - —— 每次运行记录中都会记录语料库版本,因此跨语料库版本的趋势仍具有可解释性
bench/booster-corpus.json - —— 最新的运行记录;烟雾测试第23步的门限基于该文件中的
docs/benchmarks/runs/latest.jsonwinRate ≥ 0.80