cost-trend

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Cost Trend

成本趋势

The smoke gate is binary (
winRate ≥ 0.80
→ pass/fail). The corpus benchmarks captured over time form a curve — and curves catch regressions the gate misses (win rate slowly creeping from 100% to 85% is "still passing" by smoke but a real degradation).
This skill reads every persisted run in
docs/benchmarks/runs/*.json
and reports first→last deltas plus a per-run series, flagging regressions in win rate or latency.
烟雾测试门限是二元判定(
winRate ≥ 0.80
→ 通过/失败)。随时间捕获的语料库基准数据形成一条曲线——而曲线能发现门限遗漏的回归问题(胜率从100%缓慢降至85%在烟雾测试中仍属于“通过”,但实际是性能退化)。
该Skill会读取
docs/benchmarks/runs/*.json
中所有持久化的运行记录,报告首次到末次记录的差值以及每一次运行的序列数据,标记胜率或延迟的回归问题。

When to use

使用场景

  • Before a release — check that the speedup hasn't drifted.
  • After expanding the corpus — verify older runs still hit the same win rate on the new corpus they reflected.
  • After upgrading
    agent-booster
    — surface latency / strategy changes.
  • 发布前——确认性能提升未出现变化。
  • 扩展语料库后——验证旧运行记录在其对应的新语料库上仍能达到相同胜率。
  • 升级
    agent-booster
    后——呈现延迟/策略变化。

Steps

操作步骤

  1. Run the trend script from the project root:
    bash
    node plugins/ruflo-cost-tracker/scripts/trend.mjs
    Optional env:
    • TREND_FORMAT=json
      — emit JSON instead of markdown
    • TREND_LIMIT=10
      — consider only the most recent N runs
  2. Inspect the drift summary — first vs last on win rate, avg latency, p99, escalation rate, speedup vs Gemini.
  3. Inspect the per-run series — one row per run, including Sonnet 4.6 + Opus 4.7 baseline latencies if those were enabled (
    BENCH_ANTHROPIC=1
    at run time).
  4. Regression flags — the script emits
    > ⚠ Regression
    callouts when:
    • Win rate dropped between first and last run
    • Avg latency rose ≥ 1.5× from first run
  1. 从项目根目录运行趋势脚本
    bash
    node plugins/ruflo-cost-tracker/scripts/trend.mjs
    可选环境变量:
    • TREND_FORMAT=json
      —— 输出JSON格式而非Markdown
    • TREND_LIMIT=10
      —— 仅考虑最近的N次运行记录
  2. 查看变化摘要——对比首次与末次记录的胜率、平均延迟、p99延迟、升级率、相对Gemini的性能提升。
  3. 查看单运行序列数据——每一行对应一次运行记录,若运行时启用了
    BENCH_ANTHROPIC=1
    ,则包含Sonnet 4.6 + Opus 4.7的基准延迟数据。
  4. 回归标记——当出现以下情况时,脚本会输出
    > ⚠ Regression
    提示:
    • 首次到末次运行记录的胜率下降
    • 平均延迟较首次运行上升≥1.5倍

Cross-references

交叉引用

  • cost-benchmark
    — the producer of the run JSONs this skill consumes
  • bench/booster-corpus.json
    — the corpus version is recorded in each run, so trends across corpus versions remain interpretable
  • docs/benchmarks/runs/latest.json
    — the most-recent run; smoke step 23 gates on
    winRate ≥ 0.80
    from this file
  • cost-benchmark
    —— 生成该Skill所依赖的运行记录JSON文件的工具
  • bench/booster-corpus.json
    —— 每次运行记录中都会记录语料库版本,因此跨语料库版本的趋势仍具有可解释性
  • docs/benchmarks/runs/latest.json
    —— 最新的运行记录;烟雾测试第23步的门限基于该文件中的
    winRate ≥ 0.80