cc-canary

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

cc-canary — long-form regression writeup

cc-canary — 长篇退化分析报告

Primary question: has Claude regressed on this user's work, and when?
Bundled script
scripts/compute_stats.py
does ~95% of the work in ~2.5s: scans JSONLs, runs inflection + transition-day detection, builds pre/post aggregates, cross-version comparison, hour-of-day, word frequency, three- period thinking depth, visibility transition, per-turn rates, and abnormalities — then renders a complete markdown skeleton with every table filled. Narrative slots are marked
<!-- C: ... -->
.
Default window:
60d
. Accept
7d / 14d / 30d / 60d / 90d / 180d
.
核心问题:Claude在用户的工作场景中是否出现性能退化?退化发生在何时?
配套脚本
scripts/compute_stats.py
在约2.5秒内完成约95%的工作:扫描JSONLs文件,执行拐点检测与过渡日识别,构建退化前后的聚合数据、跨版本对比、时段分析、词频统计、三阶段思考深度、可见性转换、每轮交互率及异常检测——随后生成一个已填好所有表格的完整Markdown框架。叙事内容占位符标记为
<!-- C: ... -->
默认分析窗口:
60d
。支持
7d / 14d / 30d / 60d / 90d / 180d

Framing — three-bucket classification

分析框架——三类归因分类

  1. Model-side — same user/task, worse outcomes; cross-version worse; reasoning-depth dropping
  2. User-side — project-mix shift; shorter/imperative prompts; new codebase; shortcut vocab rising
  3. Ambiguous — mixed confounds, borderline effect size, either-way-explainable
  1. 模型端 — 相同用户/任务下结果变差;跨版本对比性能下降;推理深度降低
  2. 用户端 — 项目组合变化;提示语更简短/命令式;使用新代码库;快捷词汇使用增多
  3. 模糊归因 — 混杂多种干扰因素,影响程度临界,两种归因均可解释

Your 3-step job

三步工作流程

1. Run the script

1. 运行脚本

Bash(python3 <SKILL_DIR>/scripts/compute_stats.py --window {window} --render-md /tmp/cc-canary-skeleton-{window}.md > /dev/null 2>&1)
<SKILL_DIR>
:
.claude/skills/cc-canary/
→ fallback to
~/.claude/skills/cc-canary/
.
Flags:
--window {Nd}
(required);
--include-agents
(include subagent sessions);
--min-user-words N
(default 10).
If the script fails: report error, retry once with
--include-agents
, else stop. Never fall back to hand-computation — that's the slow path.
Bash(python3 <SKILL_DIR>/scripts/compute_stats.py --window {window} --render-md /tmp/cc-canary-skeleton-{window}.md > /dev/null 2>&1)
<SKILL_DIR>
:
.claude/skills/cc-canary/
→ 备用路径为
~/.claude/skills/cc-canary/
.
参数:
--window {Nd}
(必填);
--include-agents
(包含子Agent会话);
--min-user-words N
(默认值为10)。
若脚本运行失败:报告错误,使用
--include-agents
参数重试一次,否则终止。绝不手动计算——这是低效路径。

2. Read the skeleton

2. 读取框架文件

Read /tmp/cc-canary-skeleton-{window}.md
Read /tmp/cc-canary-skeleton-{window}.md

3. Fill every
<!-- C: ... -->
placeholder and save

3. 填充所有
<!-- C: ... -->
占位符并保存

Write ./cc-canary-{YYYY-MM-DD}.md
End your message with the absolute path:
Wrote /Users/.../cc-canary-{date}.md · paste-ready.
Write ./cc-canary-{YYYY-MM-DD}.md
消息末尾需附上绝对路径:
Wrote /Users/.../cc-canary-{date}.md · paste-ready.

Narrative placeholders

叙事占位符说明

Each placeholder's inline comment already spells out what to write. Summary:
  • verdict-line
    — HOLDING / SUSPECTED REGRESSION / CONFIRMED REGRESSION / INCONCLUSIVE + brief justification
  • summary
    — 2–4 sentences: verdict + inflection + biggest pre→post delta. No counter-evidence hedging.
  • timeline
    — 1–2 paragraphs on the daily series shape
  • xv-para
    — 1 paragraph on cross-version (if §3 present)
  • finding-N-class
    × up to 5 — inline classification label
  • finding-N-reason
    × up to 5 — 2–3 sentences max, evidence-first. Lead with the strongest number (cross-version Δ, §2 value, appendix rate). No signal-line restating, no rhetorical buildup
  • root-cause
    — 3–5 paragraphs tying strongest signals together
  • what-would-help
    — 2–4 concrete bullets
  • appendix-a1…a4, b, c, d, e, f, g, h
    — 1 paragraph each (see table context in skeleton)
  • meta-note
    — 2–5 sentences, first person, honest, no claimed feelings, acknowledge the recursion
每个占位符的内嵌注释已明确说明需要填写的内容。摘要如下:
  • verdict-line
    — HOLDING(暂不判定)/ SUSPECTED REGRESSION(疑似退化)/ CONFIRMED REGRESSION(确认退化)/ INCONCLUSIVE(结论不明确) + 简短理由
  • summary
    — 2-4句话:结论+拐点信息+退化前后最大差异值。无需对冲反证。
  • timeline
    — 1-2段文字描述每日数据序列的形态
  • xv-para
    — 1段文字分析跨版本对比情况(若第3节存在)
  • finding-N-class
    × 最多5个 — 内联分类标签
  • finding-N-reason
    × 最多5个 — 最多2-3句话,以证据为先。开头使用最有力的数据(跨版本差异值Δ、第2节数值、附录比率)。无需重复信号线内容,无需修辞铺垫
  • root-cause
    — 3-5段文字,将最显著的信号关联起来分析根本原因
  • what-would-help
    — 2-4条具体的改进建议(项目符号列表)
  • appendix-a1…a4, b, c, d, e, f, g, h
    — 每个附录对应1段文字(参考框架中的表格上下文)
  • meta-note
    — 2-5句话,第一人称表述,客观诚实,不表达主观感受,需承认递归特性

Verdict calibration

结论校准规则

  • HOLDING: ≤1 model-side signal
  • SUSPECTED REGRESSION: 2–3 model-side signals
  • CONFIRMED REGRESSION: ≥3 model-side signals + non-empty cross-version showing decline +
    session_count ≥ 15
    + ≥2 models +
    inflection.gap_sigma ≥ 1.0
  • INCONCLUSIVE:
    session_count < 15
    OR
    inflection.method == "fallback_split_half"
    with overlapping confounds
Cap at SUSPECTED when: only one model; <15 sessions; single-project with project starting mid-window; inflection coincides with a visible user-side event.
All the data you need (session_count, model mix, inflection method, cross-version presence) is rendered as plain text in the skeleton.
  • 暂不判定:≤1个模型端信号
  • 疑似退化:2-3个模型端信号
  • 确认退化:≥3个模型端信号 + 非空的跨版本对比显示性能下降 +
    session_count ≥ 15
    + ≥2个模型版本 +
    inflection.gap_sigma ≥ 1.0
  • 结论不明确
    session_count < 15
    inflection.method == "fallback_split_half"
    且存在重叠干扰因素
以下情况结论上限为“疑似退化”:仅使用一个模型;会话数<15;单一项目且项目在分析窗口中途启动;拐点与明显的用户端事件重合。
所需的所有数据(会话数、模型组合、拐点检测方法、跨版本对比是否存在)均已在框架中以纯文本形式呈现。

Grounding example (for
finding-N-reason
)

参考示例(针对
finding-N-reason

Classification: model-side. Read:Edit dropped 9.0 → 1.0 (-89%, concerning) while cross-version shows opus-4-7 at 0.39 vs opus-4-6 at 1.00 on the same user's workload. No project-mix shift near the inflection — model is defaulting to edit-first.
分类:模型端。读/编辑比率从9.0降至1.0(下降89%,值得关注),同时跨版本对比显示,同一用户工作负载下opus-4-7版本为0.39,而opus-4-6版本为1.00。拐点附近无项目组合变化——模型默认优先选择编辑模式。

Hard rules

硬性规则

  • Never read, grep, or glob
    ~/.claude/projects/**/*.jsonl
    . Never run
    jq
    /
    awk
    /
    wc
    on session files. Script owns all that.
  • Never touch tables or numbers — they came from real data.
  • Every finding gets a classification label.
  • Hedge when cross-version is empty or
    session_count < 15
    .
  • Do not verdict CONFIRMED REGRESSION without the full checklist.
  • Do not save the skeleton as-is — replace every
    <!-- C: ... -->
    first.
  • 禁止读取、grep或遍历
    ~/.claude/projects/**/*.jsonl
    文件。禁止在会话文件上运行
    jq
    /
    awk
    /
    wc
    命令。所有这些操作均由脚本完成。
  • 禁止修改表格或数值——这些均来自真实数据。
  • 每个结论都必须添加分类标签。
  • 当跨版本对比为空或
    session_count < 15
    时,需谨慎表述。
  • 未满足所有 checklist 条件时,不得判定为“确认退化”。
  • 禁止直接保存框架文件——必须先替换所有
    <!-- C: ... -->
    占位符。

Failure modes

故障处理

  • Script import error → check
    python3 -V
    ≥ 3.8; retry once with
    --include-agents
    ; else stop.
  • Skeleton < 5KB → likely no sessions in window. Check script error.
  • inflection.method == fallback_split_half
    → state it; cap at SUSPECTED.
  • Cross-version Δ
    None
    → div-by-zero when model-A value is 0; note the confound.
  • 脚本导入错误 → 检查
    python3 -V
    是否≥3.8;使用
    --include-agents
    参数重试一次;否则终止。
  • 框架文件大小<5KB → 分析窗口内可能无会话数据。检查脚本错误信息。
  • inflection.method == fallback_split_half
    → 需注明此情况;结论上限为“疑似退化”。
  • 跨版本差异值Δ为
    None
    → 当模型A的数值为0时出现除零错误;需注明此干扰因素。