cost-trend

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Cost Trend

成本趋势

The smoke gate is binary (

winRate ≥ 0.80

→ pass/fail). The corpus benchmarks captured over time form a curve — and curves catch regressions the gate misses (win rate slowly creeping from 100% to 85% is "still passing" by smoke but a real degradation).

This skill reads every persisted run in

docs/benchmarks/runs/*.json

and reports first→last deltas plus a per-run series, flagging regressions in win rate or latency.

烟雾测试门限是二元判定（

winRate ≥ 0.80

→ 通过/失败）。随时间捕获的语料库基准数据形成一条曲线——而曲线能发现门限遗漏的回归问题（胜率从100%缓慢降至85%在烟雾测试中仍属于“通过”，但实际是性能退化）。

该Skill会读取

docs/benchmarks/runs/*.json

中所有持久化的运行记录，报告首次到末次记录的差值以及每一次运行的序列数据，标记胜率或延迟的回归问题。

When to use

使用场景

Before a release — check that the speedup hasn't drifted.
After expanding the corpus — verify older runs still hit the same win rate on the new corpus they reflected.
After upgrading
```
agent-booster
```
— surface latency / strategy changes.

发布前——确认性能提升未出现变化。
扩展语料库后——验证旧运行记录在其对应的新语料库上仍能达到相同胜率。
升级
```
agent-booster
```
后——呈现延迟/策略变化。

Steps

操作步骤

Run the trend script from the project root:
bash
```
node plugins/ruflo-cost-tracker/scripts/trend.mjs
```
Optional env:
- ```
TREND_FORMAT=json
```
  — emit JSON instead of markdown
- ```
TREND_LIMIT=10
```
  — consider only the most recent N runs
Inspect the drift summary — first vs last on win rate, avg latency, p99, escalation rate, speedup vs Gemini.
Inspect the per-run series — one row per run, including Sonnet 4.6 + Opus 4.7 baseline latencies if those were enabled (
```
BENCH_ANTHROPIC=1
```
at run time).
Regression flags — the script emits
```
> ⚠ Regression
```
callouts when:
- Win rate dropped between first and last run
- Avg latency rose ≥ 1.5× from first run

从项目根目录运行趋势脚本：
bash
```
node plugins/ruflo-cost-tracker/scripts/trend.mjs
```
可选环境变量：
- ```
TREND_FORMAT=json
```
  —— 输出JSON格式而非Markdown
- ```
TREND_LIMIT=10
```
  —— 仅考虑最近的N次运行记录
查看变化摘要——对比首次与末次记录的胜率、平均延迟、p99延迟、升级率、相对Gemini的性能提升。
查看单运行序列数据——每一行对应一次运行记录，若运行时启用了
```
BENCH_ANTHROPIC=1
```
，则包含Sonnet 4.6 + Opus 4.7的基准延迟数据。
回归标记——当出现以下情况时，脚本会输出
```
> ⚠ Regression
```
提示：
- 首次到末次运行记录的胜率下降
- 平均延迟较首次运行上升≥1.5倍

Cross-references

交叉引用

```
cost-benchmark
```
— the producer of the run JSONs this skill consumes
```
bench/booster-corpus.json
```
— the corpus version is recorded in each run, so trends across corpus versions remain interpretable
```
docs/benchmarks/runs/latest.json
```
— the most-recent run; smoke step 23 gates on
```
winRate ≥ 0.80
```
from this file

```
cost-benchmark
```
—— 生成该Skill所依赖的运行记录JSON文件的工具
```
bench/booster-corpus.json
```
—— 每次运行记录中都会记录语料库版本，因此跨语料库版本的趋势仍具有可解释性
```
docs/benchmarks/runs/latest.json
```
—— 最新的运行记录；烟雾测试第23步的门限基于该文件中的
```
winRate ≥ 0.80
```