code-clone-assistant
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCode Clone Assistant
代码克隆助手
Detect code clones and guide refactoring using PMD CPD (exact duplicates) + Semgrep (patterns).
使用PMD CPD(精确重复检测)+ Semgrep(模式匹配)检测代码克隆并指导重构。
Tools
工具
- PMD CPD v7.17.0+: Exact duplicate detection
- Semgrep v1.140.0+: Pattern-based detection
Tested: October 2025 - 30 violations detected across 3 sample files
Coverage: ~3x more violations than using either tool alone
- PMD CPD v7.17.0+:精确重复代码检测
- Semgrep v1.140.0+:基于模式的重复检测
测试情况:2025年10月 - 在3个示例文件中检测出30处违规
覆盖范围:比单独使用任一工具多检测约3倍的违规情况
When to Use This Skill
何时使用该技能
Use this skill when:
- Finding duplicate code in a codebase
- Detecting DRY violations
- Refactoring similar code patterns
- Identifying copy-paste code
在以下场景使用本技能:
- 在代码库中查找重复代码
- 检测DRY原则违规
- 重构相似代码模式
- 识别复制粘贴的代码
Why Two Tools?
为何使用两种工具?
PMD CPD and Semgrep detect different clone types:
| Aspect | PMD CPD | Semgrep |
|---|---|---|
| Detects | Exact copy-paste duplicates | Similar patterns with variations |
| Scope | Across files ✅ | Within/across files (Pro only) |
| Matching | Token-based (ignores formatting) | Pattern-based (AST matching) |
| Rules | ❌ No custom rules | ✅ Custom rules |
Result: Using both finds ~3x more DRY violations.
PMD CPD和Semgrep检测不同类型的代码克隆:
| 维度 | PMD CPD | Semgrep |
|---|---|---|
| 检测内容 | 精确复制粘贴的重复代码 | 带有变体的相似代码模式 |
| 检测范围 | 跨文件 ✅ | 文件内/跨文件(仅Pro版本支持) |
| 匹配方式 | 基于令牌(忽略格式差异) | 基于模式(AST语法树匹配) |
| 规则定制 | ❌ 不支持自定义规则 | ✅ 支持自定义规则 |
结果:结合使用两种工具可检测出约3倍的DRY原则违规情况。
Clone Types
克隆类型
| Type | Description | PMD CPD | Semgrep |
|---|---|---|---|
| Type-1 | Exact copies | ✅ Default | ✅ |
| Type-2 | Renamed identifiers | ✅ | ✅ |
| Type-3 | Near-miss with variations | ⚠️ Partial | ✅ Patterns |
| Type-4 | Semantic clones (same behavior) | ❌ | ❌ |
| 类型 | 描述 | PMD CPD | Semgrep |
|---|---|---|---|
| Type-1 | 完全相同的复制代码 | ✅ 默认支持 | ✅ 支持 |
| Type-2 | 重命名标识符的重复代码 | ✅ | ✅ 支持 |
| Type-3 | 带有变体的近似重复代码 | ⚠️ 部分支持 | ✅ 模式匹配支持 |
| Type-4 | 语义克隆(行为相同) | ❌ 不支持 | ❌ 不支持 |
Quick Start Workflow
快速开始流程
bash
undefinedbash
undefinedStep 1: Detect exact duplicates (PMD CPD)
Step 1: Detect exact duplicates (PMD CPD)
pmd cpd -d . -l python --minimum-tokens 20 -f markdown > pmd-results.md
pmd cpd -d . -l python --minimum-tokens 20 -f markdown > pmd-results.md
Step 2: Detect pattern violations (Semgrep)
Step 2: Detect pattern violations (Semgrep)
semgrep --config=clone-rules.yaml --sarif --quiet > semgrep-results.sarif
semgrep --config=clone-rules.yaml --sarif --quiet > semgrep-results.sarif
Step 3: Analyze combined results (Claude Code)
Step 3: Analyze combined results (Claude Code)
Parse both outputs, prioritize by severity
Parse both outputs, prioritize by severity
Step 4: Refactor (Claude Code with user approval)
Step 4: Refactor (Claude Code with user approval)
Extract shared functions, consolidate patterns, verify tests
Extract shared functions, consolidate patterns, verify tests
---
---
---
---Accepted Exceptions (Known Intentional Duplication)
可接受的例外情况(已知的有意重复)
Not all code duplication is a problem. Some codebases deliberately use copy-and-adapt patterns where refactoring would be harmful. When running clone detection, always check for accepted exceptions before recommending refactoring.
并非所有代码重复都是问题。部分代码库会刻意使用“复制后调整”的模式,此时重构反而会带来危害。在运行克隆检测时,在推荐重构前务必检查是否存在可接受的例外情况。
When Duplication Is Acceptable
何时重复代码是可接受的
| Pattern | Why Acceptable | Example |
|---|---|---|
| Generation-per-directory experiments | Each generation is an immutable, self-contained experiment. Sharing code across generations would break provenance and make past experiments non-reproducible. | SQL templates, sweep scripts where each |
| SQL templates with placeholder substitution | SQL has no import/include mechanism. Templates use | ClickHouse sweep templates sharing signal detection + metrics CTEs |
| Protocol/schema boilerplate | Serialization formats, API contracts, and wire protocols require exact structure in each location. Abstracting them hides the contract. | NDJSON telemetry line construction in wrapper scripts |
| Test fixtures and golden files | Test data intentionally duplicates production patterns to verify behavior. Sharing fixtures creates brittle cross-test dependencies. | Test setup code, expected output snapshots |
| 模式 | 可接受原因 | 示例 |
|---|---|---|
| 按目录划分的生成实验 | 每个生成版本都是不可变、独立自包含的实验。跨版本共享代码会破坏溯源性,导致过往实验无法复现。 | SQL模板、扫描脚本,其中每个 |
| 带占位符替换的SQL模板 | SQL没有导入/包含机制。模板使用 | ClickHouse扫描模板,共享信号检测+指标CTE |
| 协议/模式样板代码 | 序列化格式、API契约和网络协议要求每个位置的结构完全一致。抽象这些结构会隐藏契约细节。 | 包装脚本中的NDJSON遥测行构造代码 |
| 测试夹具和黄金文件 | 测试数据会刻意复制生产模式以验证行为。共享夹具会创建脆弱的跨测试依赖。 | 测试设置代码、预期输出快照 |
How to Report Accepted Exceptions
如何报告可接受的例外情况
When clone detection finds duplication that matches an accepted exception pattern:
- Report it — always show the user what was found (lines, tokens, files)
- Flag as accepted — explicitly state it matches a known exception pattern
- Explain why — cite the specific reason refactoring is not recommended
- Do NOT recommend refactoring — this is the key difference from actionable findings
Example output format:
Code Clone Analysis Results
PMD CPD Findings:
Clone 1: 115 lines (575 tokens) — base_bars → signals CTEs
gen610_template.sql:33 ↔ gen710_template.sql:38
Status: ACCEPTED EXCEPTION (generation-per-directory experiment)
Reason: Each generation is immutable. Shared CTEs would break
experiment provenance and reproducibility.
Clone 2: 36 lines (478 tokens) — metrics aggregation
gen610_template.sql:207 ↔ gen710_template.sql:244
Status: ACCEPTED EXCEPTION (SQL template without include mechanism)
Actionable Findings: 0
Accepted Exceptions: 2当克隆检测发现符合可接受例外模式的重复代码时:
- 进行报告 — 始终向用户展示发现的内容(行号、令牌数、文件)
- 标记为可接受 — 明确说明其匹配已知的例外模式
- 解释原因 — 引用不建议重构的具体理由
- 不推荐重构 — 这与可处理的发现的关键区别
示例输出格式:
Code Clone Analysis Results
PMD CPD Findings:
Clone 1: 115 lines (575 tokens) — base_bars → signals CTEs
gen610_template.sql:33 ↔ gen710_template.sql:38
Status: ACCEPTED EXCEPTION (generation-per-directory experiment)
Reason: Each generation is immutable. Shared CTEs would break
experiment provenance and reproducibility.
Clone 2: 36 lines (478 tokens) — metrics aggregation
gen610_template.sql:207 ↔ gen710_template.sql:244
Status: ACCEPTED EXCEPTION (SQL template without include mechanism)
Actionable Findings: 0
Accepted Exceptions: 2Project-Level Exception Configuration
项目级例外配置
Projects can declare accepted exception patterns in their :
CLAUDE.mdmarkdown
undefined项目可在中声明可接受的例外模式:
CLAUDE.mdmarkdown
undefinedCode Clone Exceptions
Code Clone Exceptions
- — generation-per-directory experiments (immutable)
sql/gen*_template.sql - — copy-and-adapt sweep scripts (no shared infrastructure)
scripts/gen*/ - — intentional duplication for test isolation
tests/fixtures/
When this section exists in a project's `CLAUDE.md`, the code-clone-assistant should check it before classifying findings.
---- — generation-per-directory experiments (immutable)
sql/gen*_template.sql - — copy-and-adapt sweep scripts (no shared infrastructure)
scripts/gen*/ - — intentional duplication for test isolation
tests/fixtures/
当项目的`CLAUDE.md`中存在此部分时,代码克隆助手应在分类检测结果前先检查该配置。
---Reference Documentation
参考文档
For detailed information, see:
- Detection Commands - PMD CPD and Semgrep command details
- Complete Workflow - Detection, analysis, and presentation phases
- Refactoring Strategies - Approaches for addressing violations
如需详细信息,请查看:
- Detection Commands - PMD CPD和Semgrep命令详情
- Complete Workflow - 检测、分析和展示阶段的完整流程
- Refactoring Strategies - 解决违规问题的方法
Troubleshooting
故障排除
| Issue | Cause | Solution |
|---|---|---|
| PMD CPD not found | Not installed or not in PATH | |
| Semgrep timeout | Large codebase scan | Use |
| No duplicates detected | minimum-tokens too high | Lower |
| Too many false positives | minimum-tokens too low | Increase |
| Language not recognized | Wrong | Check PMD CPD supported languages list |
| SARIF parse error | Semgrep output malformed | Upgrade Semgrep to latest version |
| Memory error on large repo | Java heap too small | Set |
| Missing clone rules file | Custom rules not created | Create |
| 问题 | 原因 | 解决方案 |
|---|---|---|
| PMD CPD未找到 | 未安装或不在PATH中 | 使用 |
| Semgrep超时 | 代码库扫描范围过大 | 使用 |
| 未检测到重复代码 | minimum-tokens阈值过高 | 降低 |
| 误报过多 | minimum-tokens阈值过低 | 提高 |
| 语言未被识别 | | 查看PMD CPD支持的语言列表 |
| SARIF解析错误 | Semgrep输出格式损坏 | 将Semgrep升级到最新版本 |
| 大型仓库出现内存错误 | Java堆内存不足 | 设置 |
| 缺少克隆规则文件 | 未创建自定义规则 | 创建 |