indexion-explore

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

indexion explore

indexion explore

Analyze file similarity across a directory to find duplicates and related code.
分析目录下的文件相似度,查找重复文件和相关代码。

When to Use

适用场景

  • User asks to find similar or duplicate files
  • User wants to understand code overlap before refactoring
  • User asks "what files are related to X?"
  • User wants to detect copy-paste code
  • Quick scan before detailed refactoring — use as a first pass, then follow up with
    indexion plan refactor
    for actionable detail
  • 用户要求查找相似或重复文件
  • 用户想要在重构前了解代码重叠情况
  • 用户询问"什么文件与X相关?"
  • 用户想要检测复制粘贴的代码
  • 详细重构前的快速扫描 —— 作为第一步使用,之后可执行
    indexion plan refactor
    获取可落地的详细方案

Usage

使用方法

bash
undefined
bash
undefined

Basic similarity matrix (default: tfidf strategy)

Basic similarity matrix (default: tfidf strategy)

indexion explore <path>
indexion explore <path>

List format with threshold (most useful for finding duplicates)

List format with threshold (most useful for finding duplicates)

indexion explore --format=list --threshold=0.7 <path>
indexion explore --format=list --threshold=0.7 <path>

Cluster similar files together

Cluster similar files together

indexion explore --format=cluster --threshold=0.6 <path>
indexion explore --format=cluster --threshold=0.6 <path>

JSON output for further processing

JSON output for further processing

indexion explore --format=json --threshold=0.5 <path>
indexion explore --format=json --threshold=0.5 <path>

Filter by extension

Filter by extension

indexion explore --ext=.mbt --ext=.ts <path>
indexion explore --ext=.mbt --ext=.ts <path>

Include/exclude patterns

Include/exclude patterns

indexion explore --include='.ts' --exclude='_test.ts' src/
indexion explore --include='.ts' --exclude='_test.ts' src/

Filter out config noise

Filter out config noise

indexion explore --format=list --threshold=0.7
--include='*.mbt' --exclude='moon.pkg' cmd/indexion/
indexion explore --format=list --threshold=0.7
--include='*.mbt' --exclude='moon.pkg' cmd/indexion/

Function-level tree edit distance (more precise, slower)

Function-level tree edit distance (more precise, slower)

indexion explore --strategy=apted --format=list <path> indexion explore --strategy=tsed --format=list <path>
indexion explore --strategy=apted --format=list <path> indexion explore --strategy=tsed --format=list <path>

Hybrid strategy (auto-selects TF-IDF or APTED based on dataset size)

Hybrid strategy (auto-selects TF-IDF or APTED based on dataset size)

indexion explore --strategy=hybrid --format=list <path>
undefined
indexion explore --strategy=hybrid --format=list <path>
undefined

Strategies

匹配策略

StrategyDescriptionSpeed
tfidf
(default)
TF-IDF token similarityFast
hybrid
Dynamic TF-IDF + APTED, auto-selects based on dataset sizeAdaptive
ncd
Normalized Compression DistanceFast
apted
All-Path Tree Edit Distance (function-level)Slow
tsed
Tree Structure Edit Distance (function-level)Slow
策略描述速度
tfidf
(默认)
TF-IDF token相似度匹配
hybrid
动态TF-IDF + APTED,根据数据集大小自动选择自适应
ncd
归一化压缩距离
apted
全路径树编辑距离(函数级别)
tsed
树结构编辑距离(函数级别)

Output Formats

输出格式

  • matrix
    — Full similarity matrix (default, good for small sets)
  • list
    — Sorted pairs above threshold (best for finding duplicates)
  • cluster
    — Groups of similar files
  • json
    — Machine-readable output
  • matrix
    —— 完整相似度矩阵(默认,适合小数据集)
  • list
    —— 高于阈值的排序配对(最适合查找重复内容)
  • cluster
    —— 相似文件分组
  • json
    —— 机器可读的输出格式

Relationship to Other Commands

与其他命令的对应关系

TaskUse
"What files are similar?"
explore --format=list
"Find nested for loops"
grep "for ... for"
"Find functions named sort"
grep --semantic=name:sort
"What exactly is duplicated?"
plan refactor --threshold=0.9
"Find code similar to a description"
grep --semantic="similar:..."
需求使用命令
"哪些文件是相似的?"
explore --format=list
"查找嵌套for循环"
grep "for ... for"
"查找名为sort的函数"
grep --semantic=name:sort
"具体哪些内容是重复的?"
plan refactor --threshold=0.9
"查找与描述相似的代码"
grep --semantic="similar:..."

Workflow: explore → plan refactor

工作流:explore → plan refactor

  1. Run
    indexion explore --format=list --threshold=0.7 <path>
    for a quick scan
  2. If high-similarity pairs exist, run
    indexion plan refactor --threshold=0.9 <path>
    for details
  3. Fix duplicates, then re-run both to verify
  1. 执行
    indexion explore --format=list --threshold=0.7 <path>
    进行快速扫描
  2. 如果存在高相似度的文件对,执行
    indexion plan refactor --threshold=0.9 <path>
    获取详情
  3. 修复重复内容后,重新执行两个命令验证结果

Dogfooding Lessons

内部使用经验

  • moon.pkg files inflate similarity scores (they all look alike) — exclude with
    --exclude='*moon.pkg*'
    for meaningful results
  • 96%+ similarity between CLI files usually means duplicated utility functions
  • 85-95% similarity is often structural (same CLI patterns) — not always actionable
  • types.mbt files showing 100% similarity is normal — type definition files share structural patterns (pub struct + getters) that inflate TF-IDF scores
  • moon.pkg文件会拉高相似度得分(它们的内容都很相似)—— 使用
    --exclude='*moon.pkg*'
    排除此类文件可获得更有意义的结果
  • CLI文件之间96%+相似度通常意味着存在重复的工具函数
  • 85-95%相似度通常是结构层面的(相同的CLI模式)—— 不一定需要处理
  • types.mbt文件显示100%相似度是正常现象—— 类型定义文件具有相同的结构模式(pub struct + getters),会拉高TF-IDF得分