Loading...
Loading...
Detect and refactor code duplication with PMD CPD. TRIGGERS - code clones, DRY violations, duplicate code.
npx skill4agent add terrylica/cc-skills code-clone-assistant| Aspect | PMD CPD | Semgrep |
|---|---|---|
| Detects | Exact copy-paste duplicates | Similar patterns with variations |
| Scope | Across files ✅ | Within/across files (Pro only) |
| Matching | Token-based (ignores formatting) | Pattern-based (AST matching) |
| Rules | ❌ No custom rules | ✅ Custom rules |
| Type | Description | PMD CPD | Semgrep |
|---|---|---|---|
| Type-1 | Exact copies | ✅ Default | ✅ |
| Type-2 | Renamed identifiers | ✅ | ✅ |
| Type-3 | Near-miss with variations | ⚠️ Partial | ✅ Patterns |
| Type-4 | Semantic clones (same behavior) | ❌ | ❌ |
# Step 1: Detect exact duplicates (PMD CPD)
pmd cpd -d . -l python --minimum-tokens 20 -f markdown > pmd-results.md
# Step 2: Detect pattern violations (Semgrep)
semgrep --config=clone-rules.yaml --sarif --quiet > semgrep-results.sarif
# Step 3: Analyze combined results (Claude Code)
# Parse both outputs, prioritize by severity
# Step 4: Refactor (Claude Code with user approval)
# Extract shared functions, consolidate patterns, verify tests| Pattern | Why Acceptable | Example |
|---|---|---|
| Generation-per-directory experiments | Each generation is an immutable, self-contained experiment. Sharing code across generations would break provenance and make past experiments non-reproducible. | SQL templates, sweep scripts where each |
| SQL templates with placeholder substitution | SQL has no import/include mechanism. Templates use | ClickHouse sweep templates sharing signal detection + metrics CTEs |
| Protocol/schema boilerplate | Serialization formats, API contracts, and wire protocols require exact structure in each location. Abstracting them hides the contract. | NDJSON telemetry line construction in wrapper scripts |
| Test fixtures and golden files | Test data intentionally duplicates production patterns to verify behavior. Sharing fixtures creates brittle cross-test dependencies. | Test setup code, expected output snapshots |
Code Clone Analysis Results
PMD CPD Findings:
Clone 1: 115 lines (575 tokens) — base_bars → signals CTEs
gen610_template.sql:33 ↔ gen710_template.sql:38
Status: ACCEPTED EXCEPTION (generation-per-directory experiment)
Reason: Each generation is immutable. Shared CTEs would break
experiment provenance and reproducibility.
Clone 2: 36 lines (478 tokens) — metrics aggregation
gen610_template.sql:207 ↔ gen710_template.sql:244
Status: ACCEPTED EXCEPTION (SQL template without include mechanism)
Actionable Findings: 0
Accepted Exceptions: 2CLAUDE.md## Code Clone Exceptions
- `sql/gen*_template.sql` — generation-per-directory experiments (immutable)
- `scripts/gen*/` — copy-and-adapt sweep scripts (no shared infrastructure)
- `tests/fixtures/` — intentional duplication for test isolationCLAUDE.md| Issue | Cause | Solution |
|---|---|---|
| PMD CPD not found | Not installed or not in PATH | |
| Semgrep timeout | Large codebase scan | Use |
| No duplicates detected | minimum-tokens too high | Lower |
| Too many false positives | minimum-tokens too low | Increase |
| Language not recognized | Wrong | Check PMD CPD supported languages list |
| SARIF parse error | Semgrep output malformed | Upgrade Semgrep to latest version |
| Memory error on large repo | Java heap too small | Set |
| Missing clone rules file | Custom rules not created | Create |