code-clone-assistant

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Code Clone Assistant

代码克隆助手

Detect code clones and guide refactoring using PMD CPD (exact duplicates) + Semgrep (patterns).
使用PMD CPD(精确重复检测)+ Semgrep(模式匹配)检测代码克隆并指导重构。

Tools

工具

  • PMD CPD v7.17.0+: Exact duplicate detection
  • Semgrep v1.140.0+: Pattern-based detection
Tested: October 2025 - 30 violations detected across 3 sample files Coverage: ~3x more violations than using either tool alone

  • PMD CPD v7.17.0+:精确重复代码检测
  • Semgrep v1.140.0+:基于模式的重复检测
测试情况:2025年10月 - 在3个示例文件中检测出30处违规 覆盖范围:比单独使用任一工具多检测约3倍的违规情况

When to Use This Skill

何时使用该技能

Use this skill when:
  • Finding duplicate code in a codebase
  • Detecting DRY violations
  • Refactoring similar code patterns
  • Identifying copy-paste code

在以下场景使用本技能:
  • 在代码库中查找重复代码
  • 检测DRY原则违规
  • 重构相似代码模式
  • 识别复制粘贴的代码

Why Two Tools?

为何使用两种工具?

PMD CPD and Semgrep detect different clone types:
AspectPMD CPDSemgrep
DetectsExact copy-paste duplicatesSimilar patterns with variations
ScopeAcross files ✅Within/across files (Pro only)
MatchingToken-based (ignores formatting)Pattern-based (AST matching)
Rules❌ No custom rules✅ Custom rules
Result: Using both finds ~3x more DRY violations.
PMD CPD和Semgrep检测不同类型的代码克隆:
维度PMD CPDSemgrep
检测内容精确复制粘贴的重复代码带有变体的相似代码模式
检测范围跨文件 ✅文件内/跨文件(仅Pro版本支持)
匹配方式基于令牌(忽略格式差异)基于模式(AST语法树匹配)
规则定制❌ 不支持自定义规则✅ 支持自定义规则
结果:结合使用两种工具可检测出约3倍的DRY原则违规情况。

Clone Types

克隆类型

TypeDescriptionPMD CPDSemgrep
Type-1Exact copies✅ Default
Type-2Renamed identifiers
--ignore-*
Type-3Near-miss with variations⚠️ Partial✅ Patterns
Type-4Semantic clones (same behavior)

类型描述PMD CPDSemgrep
Type-1完全相同的复制代码✅ 默认支持✅ 支持
Type-2重命名标识符的重复代码
--ignore-*
参数支持
✅ 支持
Type-3带有变体的近似重复代码⚠️ 部分支持✅ 模式匹配支持
Type-4语义克隆(行为相同)❌ 不支持❌ 不支持

Quick Start Workflow

快速开始流程

bash
undefined
bash
undefined

Step 1: Detect exact duplicates (PMD CPD)

Step 1: Detect exact duplicates (PMD CPD)

pmd cpd -d . -l python --minimum-tokens 20 -f markdown > pmd-results.md
pmd cpd -d . -l python --minimum-tokens 20 -f markdown > pmd-results.md

Step 2: Detect pattern violations (Semgrep)

Step 2: Detect pattern violations (Semgrep)

semgrep --config=clone-rules.yaml --sarif --quiet > semgrep-results.sarif
semgrep --config=clone-rules.yaml --sarif --quiet > semgrep-results.sarif

Step 3: Analyze combined results (Claude Code)

Step 3: Analyze combined results (Claude Code)

Parse both outputs, prioritize by severity

Parse both outputs, prioritize by severity

Step 4: Refactor (Claude Code with user approval)

Step 4: Refactor (Claude Code with user approval)

Extract shared functions, consolidate patterns, verify tests

Extract shared functions, consolidate patterns, verify tests


---

---

---

---

Accepted Exceptions (Known Intentional Duplication)

可接受的例外情况(已知的有意重复)

Not all code duplication is a problem. Some codebases deliberately use copy-and-adapt patterns where refactoring would be harmful. When running clone detection, always check for accepted exceptions before recommending refactoring.
并非所有代码重复都是问题。部分代码库会刻意使用“复制后调整”的模式,此时重构反而会带来危害。在运行克隆检测时,在推荐重构前务必检查是否存在可接受的例外情况

When Duplication Is Acceptable

何时重复代码是可接受的

PatternWhy AcceptableExample
Generation-per-directory experimentsEach generation is an immutable, self-contained experiment. Sharing code across generations would break provenance and make past experiments non-reproducible.SQL templates, sweep scripts where each
gen{NNN}/
is independent
SQL templates with placeholder substitutionSQL has no import/include mechanism. Templates use
sed
placeholder replacement (
__PLACEHOLDER__
), not function calls. Extracting shared CTEs into separate files would break the single-file execution model.
ClickHouse sweep templates sharing signal detection + metrics CTEs
Protocol/schema boilerplateSerialization formats, API contracts, and wire protocols require exact structure in each location. Abstracting them hides the contract.NDJSON telemetry line construction in wrapper scripts
Test fixtures and golden filesTest data intentionally duplicates production patterns to verify behavior. Sharing fixtures creates brittle cross-test dependencies.Test setup code, expected output snapshots
模式可接受原因示例
按目录划分的生成实验每个生成版本都是不可变、独立自包含的实验。跨版本共享代码会破坏溯源性,导致过往实验无法复现。SQL模板、扫描脚本,其中每个
gen{NNN}/
目录都是独立的
带占位符替换的SQL模板SQL没有导入/包含机制。模板使用
sed
进行占位符替换(如
__PLACEHOLDER__
),而非函数调用。将共享CTE提取到单独文件会破坏单文件执行模型。
ClickHouse扫描模板,共享信号检测+指标CTE
协议/模式样板代码序列化格式、API契约和网络协议要求每个位置的结构完全一致。抽象这些结构会隐藏契约细节。包装脚本中的NDJSON遥测行构造代码
测试夹具和黄金文件测试数据会刻意复制生产模式以验证行为。共享夹具会创建脆弱的跨测试依赖。测试设置代码、预期输出快照

How to Report Accepted Exceptions

如何报告可接受的例外情况

When clone detection finds duplication that matches an accepted exception pattern:
  1. Report it — always show the user what was found (lines, tokens, files)
  2. Flag as accepted — explicitly state it matches a known exception pattern
  3. Explain why — cite the specific reason refactoring is not recommended
  4. Do NOT recommend refactoring — this is the key difference from actionable findings
Example output format:
Code Clone Analysis Results

PMD CPD Findings:
  Clone 1: 115 lines (575 tokens) — base_bars → signals CTEs
    gen610_template.sql:33 ↔ gen710_template.sql:38
    Status: ACCEPTED EXCEPTION (generation-per-directory experiment)
    Reason: Each generation is immutable. Shared CTEs would break
            experiment provenance and reproducibility.

  Clone 2: 36 lines (478 tokens) — metrics aggregation
    gen610_template.sql:207 ↔ gen710_template.sql:244
    Status: ACCEPTED EXCEPTION (SQL template without include mechanism)

Actionable Findings: 0
Accepted Exceptions: 2
当克隆检测发现符合可接受例外模式的重复代码时:
  1. 进行报告 — 始终向用户展示发现的内容(行号、令牌数、文件)
  2. 标记为可接受 — 明确说明其匹配已知的例外模式
  3. 解释原因 — 引用不建议重构的具体理由
  4. 不推荐重构 — 这与可处理的发现的关键区别
示例输出格式
Code Clone Analysis Results

PMD CPD Findings:
  Clone 1: 115 lines (575 tokens) — base_bars → signals CTEs
    gen610_template.sql:33 ↔ gen710_template.sql:38
    Status: ACCEPTED EXCEPTION (generation-per-directory experiment)
    Reason: Each generation is immutable. Shared CTEs would break
            experiment provenance and reproducibility.

  Clone 2: 36 lines (478 tokens) — metrics aggregation
    gen610_template.sql:207 ↔ gen710_template.sql:244
    Status: ACCEPTED EXCEPTION (SQL template without include mechanism)

Actionable Findings: 0
Accepted Exceptions: 2

Project-Level Exception Configuration

项目级例外配置

Projects can declare accepted exception patterns in their
CLAUDE.md
:
markdown
undefined
项目可在
CLAUDE.md
中声明可接受的例外模式:
markdown
undefined

Code Clone Exceptions

Code Clone Exceptions

  • sql/gen*_template.sql
    — generation-per-directory experiments (immutable)
  • scripts/gen*/
    — copy-and-adapt sweep scripts (no shared infrastructure)
  • tests/fixtures/
    — intentional duplication for test isolation

When this section exists in a project's `CLAUDE.md`, the code-clone-assistant should check it before classifying findings.

---
  • sql/gen*_template.sql
    — generation-per-directory experiments (immutable)
  • scripts/gen*/
    — copy-and-adapt sweep scripts (no shared infrastructure)
  • tests/fixtures/
    — intentional duplication for test isolation

当项目的`CLAUDE.md`中存在此部分时,代码克隆助手应在分类检测结果前先检查该配置。

---

Reference Documentation

参考文档

For detailed information, see:
  • Detection Commands - PMD CPD and Semgrep command details
  • Complete Workflow - Detection, analysis, and presentation phases
  • Refactoring Strategies - Approaches for addressing violations

如需详细信息,请查看:
  • Detection Commands - PMD CPD和Semgrep命令详情
  • Complete Workflow - 检测、分析和展示阶段的完整流程
  • Refactoring Strategies - 解决违规问题的方法

Troubleshooting

故障排除

IssueCauseSolution
PMD CPD not foundNot installed or not in PATH
brew install pmd
or download from PMD releases
Semgrep timeoutLarge codebase scanUse
--exclude
to limit scope
No duplicates detectedminimum-tokens too highLower
--minimum-tokens
value (try 15)
Too many false positivesminimum-tokens too lowIncrease
--minimum-tokens
(try 30+)
Language not recognizedWrong
-l
flag
Check PMD CPD supported languages list
SARIF parse errorSemgrep output malformedUpgrade Semgrep to latest version
Memory error on large repoJava heap too smallSet
PMD_JAVA_OPTS=-Xmx4g
Missing clone rules fileCustom rules not createdCreate
clone-rules.yaml
or use default config
问题原因解决方案
PMD CPD未找到未安装或不在PATH中使用
brew install pmd
或从PMD官网下载安装包
Semgrep超时代码库扫描范围过大使用
--exclude
参数限制扫描范围
未检测到重复代码minimum-tokens阈值过高降低
--minimum-tokens
值(尝试设为15)
误报过多minimum-tokens阈值过低提高
--minimum-tokens
值(尝试设为30+)
语言未被识别
-l
参数设置错误
查看PMD CPD支持的语言列表
SARIF解析错误Semgrep输出格式损坏将Semgrep升级到最新版本
大型仓库出现内存错误Java堆内存不足设置
PMD_JAVA_OPTS=-Xmx4g
缺少克隆规则文件未创建自定义规则创建
clone-rules.yaml
或使用默认配置