finding-duplicate-functions
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseFinding Duplicate-Intent Functions
检测功能意图重复的函数
Overview
概述
LLM-generated codebases accumulate semantic duplicates: functions that serve the same purpose but were implemented independently. Classical copy-paste detectors (jscpd) find syntactic duplicates but miss "same intent, different implementation."
This skill uses a two-phase approach: classical extraction followed by LLM-powered intent clustering.
LLM生成的代码库会积累语义重复的代码:即功能目的相同但独立实现的函数。传统的复制粘贴检测工具(如jscpd)只能检测语法重复,却无法识别「意图相同、实现不同」的情况。
本技能采用两阶段方法:先进行传统提取,再通过LLM驱动的意图聚类来检测。
When to Use
适用场景
- Codebase has grown organically with multiple contributors (human or LLM)
- You suspect utility functions have been reimplemented multiple times
- Before major refactoring to identify consolidation opportunities
- After jscpd has been run and syntactic duplicates are already handled
- 代码库由多名贡献者(人类或LLM)逐步开发而成
- 你怀疑工具函数被多次重复实现
- 大型重构前,用于识别可合并的代码机会
- 已运行jscpd处理完语法重复后的后续检测
Quick Reference
快速参考
| Phase | Tool | Model | Output |
|---|---|---|---|
| 1. Extract | | - | |
| 2. Categorize | | haiku | |
| 3. Split | | - | |
| 4. Detect | | opus | |
| 5. Report | | - | |
| 阶段 | 工具 | 模型 | 输出 |
|---|---|---|---|
| 1. 提取 | | - | |
| 2. 分类 | | haiku | |
| 3. 拆分 | | - | |
| 4. 检测 | | opus | |
| 5. 生成报告 | | - | |
Process
流程
dot
digraph duplicate_detection {
rankdir=TB;
node [shape=box];
extract [label="1. Extract function catalog\n./scripts/extract-functions.sh"];
categorize [label="2. Categorize by domain\n(haiku subagent)"];
split [label="3. Split into categories\n./scripts/prepare-category-analysis.sh"];
detect [label="4. Find duplicates per category\n(opus subagent per category)"];
report [label="5. Generate report\n./scripts/generate-report.sh"];
review [label="6. Human review & consolidate"];
extract -> categorize -> split -> detect -> report -> review;
}dot
digraph duplicate_detection {
rankdir=TB;
node [shape=box];
extract [label="1. Extract function catalog\n./scripts/extract-functions.sh"];
categorize [label="2. Categorize by domain\n(haiku subagent)"];
split [label="3. Split into categories\n./scripts/prepare-category-analysis.sh"];
detect [label="4. Find duplicates per category\n(opus subagent per category)"];
report [label="5. Generate report\n./scripts/generate-report.sh"];
review [label="6. Human review & consolidate"];
extract -> categorize -> split -> detect -> report -> review;
}Phase 1: Extract Function Catalog
阶段1:提取函数目录
bash
./scripts/extract-functions.sh src/ -o catalog.jsonOptions:
- : Output file (default: stdout)
-o FILE - : Lines of context to capture (default: 15)
-c N - : File types (default:
-t GLOB)*.ts,*.tsx,*.js,*.jsx - : Include test files (excluded by default)
--include-tests
Test files (, , ) are excluded by default since test utilities are less likely to be consolidation candidates.
*.test.**.spec.*__tests__/**bash
./scripts/extract-functions.sh src/ -o catalog.json选项:
- : 输出文件(默认:标准输出)
-o FILE - : 捕获的上下文行数(默认:15)
-c N - : 文件类型(默认:
-t GLOB)*.ts,*.tsx,*.js,*.jsx - : 包含测试文件(默认排除)
--include-tests
测试文件(, , )默认被排除,因为测试工具函数通常不是合并的优先候选。
*.test.**.spec.*__tests__/**Phase 2: Categorize by Domain
阶段2:按领域分类
Dispatch a haiku subagent using the prompt in .
scripts/categorize-prompt.mdInsert the contents of where indicated in the prompt template. Save output as .
catalog.jsoncategorized.json使用中的提示词调用haiku子代理。
scripts/categorize-prompt.md将的内容插入到提示模板的指定位置,输出保存为。
catalog.jsoncategorized.jsonPhase 3: Split into Categories
阶段3:拆分为分类文件
bash
./scripts/prepare-category-analysis.sh categorized.json ./categoriesCreates one JSON file per category. Only categories with 3+ functions are worth analyzing.
bash
./scripts/prepare-category-analysis.sh categorized.json ./categories为每个分类创建一个JSON文件。只有包含3个及以上函数的分类才值得分析。
Phase 4: Find Duplicates (Per Category)
阶段4:按分类检测重复项
For each category file in , dispatch an opus subagent using the prompt in .
./categories/scripts/find-duplicates-prompt.mdSave each output as .
./duplicates/{category}.json对于中的每个分类文件,使用中的提示词调用opus子代理。
./categories/scripts/find-duplicates-prompt.md将每个输出保存为。
./duplicates/{category}.jsonPhase 5: Generate Report
阶段5:生成报告
bash
./scripts/generate-report.sh ./duplicates ./duplicates-report.mdProduces a prioritized markdown report grouped by confidence level.
bash
./scripts/generate-report.sh ./duplicates ./duplicates-report.md生成按置信度优先级分组的Markdown报告。
Phase 6: Human Review
阶段6:人工审核
Review the report. For HIGH confidence duplicates:
- Verify the recommended survivor has tests
- Update callers to use the survivor
- Delete the duplicates
- Run tests
审核报告内容。对于高置信度的重复项:
- 确认推荐保留的函数有对应的测试用例
- 更新所有调用处,改为使用保留的函数
- 删除重复的函数
- 运行测试
High-Risk Duplicate Zones
高风险重复区域
Focus extraction on these areas first - they accumulate duplicates fastest:
| Zone | Common Duplicates |
|---|---|
| General utilities reimplemented |
| Validation code | Same checks written multiple ways |
| Error formatting | Error-to-string conversions |
| Path manipulation | Joining, resolving, normalizing paths |
| String formatting | Case conversion, truncation, escaping |
| Date formatting | Same formats implemented repeatedly |
| API response shaping | Similar transformations for different endpoints |
优先在以下区域提取检测——这些区域最容易积累重复代码:
| 区域 | 常见重复类型 |
|---|---|
| 通用工具函数被重复实现 |
| 验证代码 | 相同的校验逻辑被多次编写 |
| 错误格式化 | 错误信息转字符串的逻辑重复 |
| 路径处理 | 路径拼接、解析、标准化逻辑重复 |
| 字符串格式化 | 大小写转换、截断、转义逻辑重复 |
| 日期格式化 | 相同的日期格式被重复实现 |
| API响应格式化 | 不同接口使用相似的响应转换逻辑 |
Common Mistakes
常见误区
Extracting too much: Focus on exported functions and public methods. Internal helpers are less likely to be duplicated across files.
Skipping the categorization step: Going straight to duplicate detection on the full catalog produces noise. Categories focus the comparison.
Using haiku for duplicate detection: Haiku is cost-effective for categorization but misses subtle semantic duplicates. Use Opus for the actual duplicate analysis.
Consolidating without tests: Before deleting duplicates, ensure the survivor has tests covering all use cases of the deleted functions.
提取过多内容:重点关注导出函数和公共方法。内部辅助函数跨文件重复的可能性较低。
跳过分类步骤:直接对完整目录进行重复检测会产生大量无效结果。分类可以聚焦对比范围。
使用haiku进行重复检测:Haiku在分类任务中性价比高,但会遗漏细微的语义重复。实际重复分析应使用Opus。
未测试就合并:删除重复函数前,确保保留的函数有覆盖所有被删除函数用例的测试。