semtools
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSemtools: Semantic Search
Semtools:语义搜索
Perform semantic (meaning-based) search across code and documents using embedding-based similarity matching.
使用基于嵌入的相似性匹配,对代码和文档执行语义(基于含义)搜索。
Purpose
用途
The semtools skill provides access to Semtools, a high-performance Rust-based CLI for semantic search and document processing. Unlike traditional text search (ripgrep) which matches exact strings, or structural search (ast-grep) which matches syntax patterns, semtools understands semantic meaning through embeddings.
Key capabilities:
- Semantic Search: Find code/text by meaning, not just keywords
- Workspace Management: Index large codebases for fast repeated searches
- Document Parsing: Convert PDFs, DOCX, PPTX to searchable text (requires API key)
Semtools excels at discovery - finding relevant code when you don't know the exact keywords, function names, or syntax patterns.
Semtools Skill提供对Semtools的访问,这是一个基于Rust开发的高性能CLI工具,用于语义搜索和文档处理。与匹配精确字符串的传统文本搜索工具(ripgrep)或匹配语法模式的结构搜索工具(ast-grep)不同,Semtools通过嵌入技术理解语义含义。
核心功能:
- 语义搜索:根据含义查找代码/文本,而非仅依赖关键词
- 工作区管理:为大型代码库建立索引,实现快速重复搜索
- 文档解析:将PDF、DOCX、PPTX转换为可搜索文本(需要API密钥)
Semtools擅长发现工作——当你不知道确切的关键词、函数名或语法模式时,帮助你找到相关代码。
When to Use This Skill
何时使用该Skill
Use the semtools skill when you need meaning-based search:
Semantic Code Discovery:
- Finding code that implements a concept ("error handling", "data validation")
- Discovering similar functionality across different modules
- Locating examples of a pattern when you don't know exact names
- Understanding what code does without reading everything
Documentation & Knowledge:
- Searching documentation by concept, not keywords
- Finding related discussions in comments or docs
- Discovering similar issues or solutions
- Analyzing technical documents (PDFs, reports)
Use Cases:
- "Find all authentication-related code" (without knowing function names)
- "Show me error handling patterns" (regardless of specific error types)
- "Find code similar to this implementation" (semantic similarity)
- "Search research papers for 'distributed consensus'" (document search)
Choose semtools over file-search (ripgrep/ast-grep) when:
- You know the concept but not the keywords
- Exact string matching misses relevant results
- You want semantically similar code, not exact matches
- Searching across languages or mixed content
Still use file-search when:
- You know exact keywords, function names, or patterns
- You need structural code matching (ast-grep)
- Speed is critical (ripgrep is faster for exact matches)
- You're searching for specific symbols or references
当你需要基于语义的搜索时,使用Semtools Skill:
语义代码发现:
- 查找实现某个概念的代码(如“错误处理”、“数据验证”)
- 发现不同模块中的相似功能
- 当你不知道确切名称时,查找某类模式的示例
- 无需通读所有内容即可理解代码功能
文档与知识检索:
- 按概念而非关键词搜索文档
- 在注释或文档中查找相关讨论
- 发现相似的问题或解决方案
- 分析技术文档(PDF、报告等)
使用场景:
- “查找所有与认证相关的代码”(无需知道函数名)
- “展示错误处理模式”(无论具体错误类型)
- “查找与该实现相似的代码”(语义相似性)
- “在研究论文中搜索‘分布式共识’”(文档搜索)
以下情况选择Semtools而非文件搜索工具(ripgrep/ast-grep):
- 你知道概念但不知道关键词
- 精确字符串匹配会遗漏相关结果
- 你需要语义相似的代码,而非精确匹配
- 跨语言或混合内容搜索
以下情况仍使用文件搜索工具:
- 你知道确切的关键词、函数名或模式
- 你需要结构代码匹配(ast-grep)
- 速度是关键(ripgrep在精确匹配时更快)
- 你正在搜索特定符号或引用
Available Commands
可用命令
Semtools provides three CLI commands you can use via :
execute_command- - Semantic search across code and text files
search - - Manage workspaces for caching embeddings
workspace - - Convert documents (PDF, DOCX, PPTX) to searchable text
parse
All commands work out-of-the-box in your execution environment. Document parsing requires the LLAMA_CLOUD_API_KEY environment variable to be set.
Semtools提供三个CLI命令,你可以通过调用:
execute_command- - 对代码和文本文件执行语义搜索
search - - 管理工作区以缓存嵌入向量
workspace - - 将文档(PDF、DOCX、PPTX)转换为可搜索文本
parse
所有命令可在你的执行环境中直接使用。文档解析需要设置LLAMA_CLOUD_API_KEY环境变量。
Core Operations
核心操作
1. Semantic Search (search
)
search1. 语义搜索(search
)
searchFind files and code sections by semantic meaning:
bash
undefined根据语义含义查找文件和代码片段:
bash
undefinedBasic semantic search
基础语义搜索
search "authentication logic" src/
search "authentication logic" src/
Search with more context (5 lines before/after)
搜索并显示更多上下文(前后各5行)
search "error handling" --n-lines 5 src/
search "error handling" --n-lines 5 src/
Get more results (default: 3)
获取更多结果(默认:3条)
search "database queries" --top-k 10 src/
search "database queries" --top-k 10 src/
Control similarity threshold (0.0-1.0, lower = more lenient)
控制相似性阈值(0.0-1.0,值越低越严格)
search "API endpoints" --max-distance 0.4 src/
**Parameters:**
- `--n-lines N`: Show N lines of context around matches (default: 3)
- `--top-k K`: Return top K most similar matches (default: 3)
- `--max-distance D`: Maximum embedding distance (0.0-1.0, default: 0.3)
- `-i`: Case-insensitive matching
**Output format:**Match 1 (similarity: 0.12) File: src/auth/handlers.py Lines: 42-47
def authenticate_user(username: str, password: str) -> Optional[User]: """Authenticate user credentials against database.""" user = get_user_by_username(username) if user and verify_password(password, user.password_hash): return user return None
Match 2 (similarity: 0.18)
File: src/middleware/auth.py
...
undefinedsearch "API endpoints" --max-distance 0.4 src/
**参数说明:**
- `--n-lines N`:显示匹配内容前后N行上下文(默认:3)
- `--top-k K`:返回前K个最相似的匹配结果(默认:3)
- `--max-distance D`:最大嵌入距离(0.0-1.0,默认:0.3)
- `-i`:不区分大小写匹配
**输出格式:**Match 1 (similarity: 0.12) File: src/auth/handlers.py Lines: 42-47
def authenticate_user(username: str, password: str) -> Optional[User]: """Authenticate user credentials against database.""" user = get_user_by_username(username) if user and verify_password(password, user.password_hash): return user return None
Match 2 (similarity: 0.18)
File: src/middleware/auth.py
...
undefined2. Workspace Management (workspace
)
workspace2. 工作区管理(workspace
)
workspaceFor large codebases, create workspaces to cache embeddings and enable fast repeated searches:
bash
undefined对于大型代码库,创建工作区来缓存嵌入向量,实现快速重复搜索:
bash
undefinedCreate/activate workspace
创建/激活工作区
workspace use my-project
workspace use my-project
Set workspace via environment variable
通过环境变量设置工作区
export SEMTOOLS_WORKSPACE=my-project
export SEMTOOLS_WORKSPACE=my-project
Index files in workspace (workspace auto-detected from env var)
在工作区中索引文件(工作区会从环境变量自动检测)
search "query" src/
search "query" src/
Check workspace status
检查工作区状态
workspace status
workspace status
Clean up old workspaces
清理旧工作区
workspace prune
**Benefits:**
- **Fast repeated searches**: Embeddings cached, no re-computation
- **Large codebases**: IVF_PQ indexing for scalability
- **Session persistence**: Maintain context across multiple searches
**When to use workspaces:**
- Searching the same codebase multiple times
- Very large projects (1000+ files)
- Interactive exploration sessions
- CI/CD pipelines with repeated searchesworkspace prune
**优势:**
- **快速重复搜索**:嵌入向量已缓存,无需重新计算
- **支持大型代码库**:IVF_PQ索引实现可扩展性
- **会话持久化**:在多次搜索中保持上下文
**何时使用工作区:**
- 多次搜索同一代码库
- 超大型项目(1000+文件)
- 交互式探索会话
- 需要重复搜索的CI/CD流水线3. Document Parsing (parse
) ⚠️ Requires API Key
parse3. 文档解析(parse
)⚠️ 需要API密钥
parseConvert documents to searchable markdown (requires LlamaParse API key):
bash
undefined将文档转换为可搜索的Markdown格式(需要LlamaParse API密钥):
bash
undefinedParse PDFs to markdown
将PDF解析为Markdown
parse research_papers/*.pdf
parse research_papers/*.pdf
Parse Word documents
解析Word文档
parse reports/*.docx
parse reports/*.docx
Parse presentations
解析演示文稿
parse slides/*.pptx
parse slides/*.pptx
Parse and pipe to search
解析并直接传入搜索命令
parse docs/*.pdf | xargs search "neural networks"
**Supported formats:**
- PDF (.pdf)
- Word (.docx)
- PowerPoint (.pptx)
**Configuration:**
```bashparse docs/*.pdf | xargs search "neural networks"
**支持格式:**
- PDF (.pdf)
- Word (.docx)
- PowerPoint (.pptx)
**配置方式:**
```bashVia environment variable
通过环境变量配置
export LLAMA_CLOUD_API_KEY="llx-..."
export LLAMA_CLOUD_API_KEY="llx-..."
Via config file
通过配置文件配置
cat > ~/.parse_config.json << EOF
{
"api_key": "llx-...",
"max_concurrent_requests": 10,
"timeout_seconds": 3600
}
EOF
**Important:** Document parsing is **optional**. Semantic search works without it.cat > ~/.parse_config.json << EOF
{
"api_key": "llx-...",
"max_concurrent_requests": 10,
"timeout_seconds": 3600
}
EOF
**注意:** 文档解析是**可选功能**,语义搜索无需该功能即可使用。Workflow Patterns
工作流模式
Pattern 1: Concept Discovery
模式1:概念发现
When you know what you're looking for conceptually but not by name:
bash
undefined当你知道要找的概念但不知道具体名称时:
bash
undefinedStep 1: Broad semantic search
步骤1:宽泛的语义搜索
search "rate limiting implementation" src/
search "rate limiting implementation" src/
Step 2: Review results, refine query
步骤2:查看结果,优化查询
search "throttle requests per user" src/ --top-k 10
search "throttle requests per user" src/ --top-k 10
Step 3: Use ripgrep for exact follow-up
步骤3:使用ripgrep进行精确后续搜索
rg "RateLimiter" --type py src/
undefinedrg "RateLimiter" --type py src/
undefinedPattern 2: Similar Code Finder
模式2:相似代码查找
When you want to find code similar to a reference implementation:
bash
undefined当你想要找到与参考实现相似的代码时:
bash
undefinedStep 1: Extract key concepts from reference code
步骤1:从参考代码中提取核心概念
[Read example_auth.py and identify key concepts]
[阅读example_auth.py并识别核心概念]
Step 2: Search for similar implementations
步骤2:搜索相似实现
search "user authentication with JWT tokens" src/
search "user authentication with JWT tokens" src/
Step 3: Compare implementations
步骤3:对比实现
[Review semantic matches to find similar approaches]
[查看语义匹配结果以找到相似方案]
undefinedundefinedPattern 3: Documentation Search
模式3:文档搜索
When researching concepts in documentation or comments:
bash
undefined当你在文档或注释中研究概念时:
bash
undefinedSearch code comments semantically
语义搜索代码注释
search "thread safety guarantees" src/ --n-lines 10
search "thread safety guarantees" src/ --n-lines 10
Search markdown documentation
搜索Markdown文档
search "deployment best practices" docs/
search "deployment best practices" docs/
Combined search
组合搜索
search "performance optimization" --top-k 20
undefinedsearch "performance optimization" --top-k 20
undefinedPattern 4: Cross-Language Search
模式4:跨语言搜索
When searching for concepts across different languages:
bash
undefined当你跨不同语言搜索概念时:
bash
undefinedSemantic search works across languages
语义搜索支持跨语言
search "connection pooling" src/
search "connection pooling" src/
May find:
可能找到:
- Java: "ConnectionPool manager"
- Java: "ConnectionPool manager"
- Python: "database connection reuse"
- Python: "database connection reuse"
- Go: "pool of persistent connections"
- Go: "pool of persistent connections"
All semantically related despite different terminology
尽管术语不同,但语义相关
undefinedundefinedPattern 5: Document Analysis (with API key)
模式5:文档分析(需API密钥)
When analyzing PDFs or documents:
bash
undefined当你分析PDF或其他文档时:
bash
undefinedStep 1: Parse documents to markdown
步骤1:将文档解析为Markdown
parse research/*.pdf > papers.md
parse research/*.pdf > papers.md
Step 2: Search converted content
步骤2:搜索转换后的内容
search "transformer architecture" papers.md
search "transformer architecture" papers.md
Step 3: Combine with code search
步骤3:结合代码搜索
search "attention mechanism implementation" src/
undefinedsearch "attention mechanism implementation" src/
undefinedIntegration with file-search
与文件搜索工具的集成
Semtools and file-search (ripgrep/ast-grep) are complementary tools. Use them together for comprehensive search:
Semtools与文件搜索工具(ripgrep/ast-grep)是互补工具,结合使用可实现全面搜索:
Search Strategy Matrix
搜索策略矩阵
| You Know | Use First | Then Use | Why |
|---|---|---|---|
| Exact keywords | ripgrep | search | Fast exact match, then find similar |
| Concept only | search | ripgrep | Find relevant code, then search specifics |
| Function name | ripgrep | search | Find definition, then find similar usage |
| Code pattern | ast-grep | search | Find structure, then find similar logic |
| Approximate idea | search | ripgrep + ast-grep | Discover, then drill down |
| 你知道的信息 | 优先使用 | 随后使用 | 原因 |
|---|---|---|---|
| 确切关键词 | ripgrep | search | 快速精确匹配,然后查找相似内容 |
| 仅知道概念 | search | ripgrep | 找到相关代码,然后搜索具体细节 |
| 函数名 | ripgrep | search | 找到定义,然后查找相似用法 |
| 代码模式 | ast-grep | search | 找到结构,然后查找相似逻辑 |
| 大致想法 | search | ripgrep + ast-grep | 先发现,再深入分析 |
Layered Search Approach
分层搜索方法
bash
undefinedbash
undefinedLayer 1: Semantic discovery (what's related?)
第一层:语义发现(哪些内容相关?)
search "user session management" --top-k 10
search "user session management" --top-k 10
Layer 2: Exact text search (what's the implementation?)
第二层:精确文本搜索(具体实现是什么?)
rg "SessionManager|session_store" --type py
rg "SessionManager|session_store" --type py
Layer 3: Structural search (how is it used?)
第三层:结构搜索(如何使用?)
sg --pattern 'session.$METHOD($$$)' --lang python
sg --pattern 'session.$METHOD($$$)' --lang python
Layer 4: Reference tracking (where is it called?)
第四层:引用追踪(在哪里被调用?)
[Use serena skill for symbol-level tracking]
[使用serena skill进行符号级追踪]
undefinedundefinedBest Practices
最佳实践
1. Start Broad, Then Narrow
1. 先宽泛搜索,再逐步缩小范围
Use semantic search for discovery, then narrow with exact search:
bash
undefined使用语义搜索进行发现,然后用精确搜索缩小范围:
bash
undefinedGOOD: Broad semantic discovery first
推荐:先进行宽泛的语义发现
search "authentication" src/ --top-k 10
search "authentication" src/ --top-k 10
[Review results to learn terminology]
[查看结果以了解术语]
rg "authenticate|verify_credentials" --type py src/
rg "authenticate|verify_credentials" --type py src/
AVOID: Starting too narrow and missing variations
避免:一开始范围过窄,遗漏变体
rg "authenticate" --type py # Misses "verify_credentials", "check_auth", etc.
undefinedrg "authenticate" --type py # 会遗漏"verify_credentials"、"check_auth"等
undefined2. Adjust Similarity Threshold
2. 调整相似性阈值
Tune based on results:
--max-distancebash
undefined根据结果调整参数:
--max-distancebash
undefinedToo many irrelevant results? Decrease distance (more strict)
无关结果太多?减小距离(更严格)
search "query" --max-distance 0.2
search "query" --max-distance 0.2
Missing relevant results? Increase distance (more lenient)
遗漏相关结果?增大距离(更宽松)
search "query" --max-distance 0.5
search "query" --max-distance 0.5
Default (0.3) works well for most cases
默认值(0.3)适用于大多数场景
search "query"
undefinedsearch "query"
undefined3. Use Workspaces for Repeated Searches
3. 重复搜索时使用工作区
For interactive exploration, always use workspaces:
bash
undefined对于交互式探索,始终使用工作区:
bash
undefinedGOOD: Create workspace once, search many times
推荐:创建一次工作区,多次搜索
export SEMTOOLS_WORKSPACE=my-analysis
search "concept1" src/
search "concept2" src/
search "concept3" src/
export SEMTOOLS_WORKSPACE=my-analysis
search "concept1" src/
search "concept2" src/
search "concept3" src/
INEFFICIENT: Re-compute embeddings every time
低效:每次搜索都重新计算嵌入向量
search "concept1" src/
search "concept2" src/
undefinedsearch "concept1" src/
search "concept2" src/
undefined4. Combine with Context Tools
4. 结合上下文工具
Get more context around semantic matches:
bash
undefined获取语义匹配结果的更多上下文:
bash
undefinedFind semantically similar code
查找语义相似的代码
search "retry logic" src/ --n-lines 2
search "retry logic" src/ --n-lines 2
Get more context with ripgrep
使用ripgrep获取更多上下文
rg -C 10 "retry" src/specific_file.py
rg -C 10 "retry" src/specific_file.py
Or read the full file
或直接读取完整文件
cat src/specific_file.py
undefinedcat src/specific_file.py
undefined5. Phrase Queries Conceptually
5. 以概念化方式撰写查询
Write queries as concepts, not exact keywords:
bash
undefined将查询写为概念,而非精确关键词:
bash
undefinedGOOD: Conceptual queries
推荐:概念化查询
search "handling network timeouts"
search "user input validation"
search "concurrent data access"
search "handling network timeouts"
search "user input validation"
search "concurrent data access"
LESS EFFECTIVE: Exact keyword queries (use ripgrep instead)
效果较差:精确关键词查询(应使用ripgrep)
search "timeout" # Use: rg "timeout"
search "validate" # Use: rg "validate"
undefinedsearch "timeout" # 推荐使用:rg "timeout"
search "validate" # 推荐使用:rg "validate"
undefinedUnderstanding Semantic Distance
理解语义距离
Semtools uses embedding vectors to measure semantic similarity:
- Distance 0.0: Identical meaning
- Distance 0.1-0.2: Very similar (synonyms, paraphrases)
- Distance 0.2-0.3: Related concepts (default threshold)
- Distance 0.3-0.4: Loosely related
- Distance 0.5+: Weakly related or unrelated
Practical guidelines:
bash
undefinedSemtools使用嵌入向量来衡量语义相似性:
- 距离0.0:含义完全相同
- 距离0.1-0.2:非常相似(同义词、改写)
- 距离0.2-0.3:相关概念(默认阈值)
- 距离0.3-0.4:松散相关
- 距离0.5+:相关性弱或不相关
实用指南:
bash
undefinedStrict matching (only close matches)
严格匹配(仅接近的匹配结果)
--max-distance 0.2
--max-distance 0.2
Balanced matching (default, recommended)
平衡匹配(默认值,推荐)
--max-distance 0.3
--max-distance 0.3
Lenient matching (exploratory search)
宽松匹配(探索性搜索)
--max-distance 0.4
--max-distance 0.4
Very lenient (may include false positives)
非常宽松(可能包含误报)
--max-distance 0.5
undefined--max-distance 0.5
undefinedLocal vs. Cloud Embeddings
本地与云端嵌入向量
Semantic Search (Local):
- Uses local embeddings (model2vec, potion-multilingual-128M)
- No API calls or cloud dependencies
- Fast, private, no cost
- Works offline
Document Parsing (Cloud):
- Uses LlamaParse API (cloud-based)
- Requires API key and internet connection
- Processes PDFs, DOCX, PPTX
- Usage-based pricing (check LlamaIndex pricing)
Privacy consideration: Semantic search is 100% local. Only document parsing sends data to LlamaParse API.
语义搜索(本地):
- 使用本地嵌入向量(model2vec、potion-multilingual-128M)
- 无API调用或云端依赖
- 快速、隐私、免费
- 支持离线使用
文档解析(云端):
- 使用LlamaParse API(基于云端)
- 需要API密钥和互联网连接
- 处理PDF、DOCX、PPTX
- 按使用量付费(查看LlamaIndex定价)
隐私注意事项: 语义搜索100%在本地进行,只有文档解析会将数据发送到LlamaParse API。
Performance Considerations
性能考虑
Speed Characteristics
速度特性
Without workspace:
- First search: ~2-5 seconds (embedding computation)
- Subsequent searches: ~2-5 seconds each (re-compute embeddings)
With workspace (cached embeddings):
- First search: ~2-5 seconds (builds index)
- Subsequent searches: ~0.1-0.5 seconds (cached)
- Large codebases: IVF_PQ indexing for scalability
Comparison:
- ripgrep: 0.01-0.1 seconds (fastest, exact match)
- ast-grep: 0.1-0.5 seconds (fast, structural)
- semtools (cached): 0.1-0.5 seconds (fast, semantic)
- semtools (uncached): 2-5 seconds (slower, semantic)
无工作区:
- 首次搜索:~2-5秒(嵌入向量计算)
- 后续搜索:每次~2-5秒(重新计算嵌入向量)
有工作区(嵌入向量已缓存):
- 首次搜索:~2-5秒(构建索引)
- 后续搜索:~0.1-0.5秒(使用缓存)
- 大型代码库:IVF_PQ索引实现可扩展性
对比:
- ripgrep:0.01-0.1秒(最快,精确匹配)
- ast-grep:0.1-0.5秒(快速,结构匹配)
- semtools(缓存):0.1-0.5秒(快速,语义匹配)
- semtools(无缓存):2-5秒(较慢,语义匹配)
Optimization Tips
优化技巧
bash
undefinedbash
undefined1. Use workspaces for repeated searches
1. 重复搜索时使用工作区
export SEMTOOLS_WORKSPACE=my-project
export SEMTOOLS_WORKSPACE=my-project
2. Limit search scope to relevant directories
2. 将搜索范围限制在相关目录
search "query" src/ --not tests/
search "query" src/ --not tests/
3. Use --top-k to control result count
3. 使用--top-k控制结果数量
search "query" --top-k 5
search "query" --top-k 5
4. Pipe to head for quick preview
4. 管道到head命令快速预览
search "query" | head -50
undefinedsearch "query" | head -50
undefinedUnix Pipeline Integration
Unix流水线集成
Semtools is designed for Unix-style composition:
bash
undefinedSemtools专为Unix风格的组合使用而设计:
bash
undefinedFind and parse PDFs, then search
查找并解析PDF,然后搜索
find docs/ -name "*.pdf" | xargs parse | xargs search "topic"
find docs/ -name "*.pdf" | xargs parse | xargs search "topic"
Search and filter with grep
搜索并使用grep过滤
search "authentication" src/ | grep -i "jwt"
search "authentication" src/ | grep -i "jwt"
Count matches
统计匹配数量
search "error handling" src/ | grep "Match" | wc -l
search "error handling" src/ | grep "Match" | wc -l
Combine with other tools
与其他工具结合使用
search "API" src/ | xargs -I {} rg -l "REST" {}
undefinedsearch "API" src/ | xargs -I {} rg -l "REST" {}
undefinedLimitations
局限性
When NOT to Use Semtools
何时不应使用Semtools
-
Exact keyword search: Use ripgrep for known keywordsbash
# WRONG TOOL: Semantic search for exact function name search "authenticate_user" # RIGHT TOOL: Use ripgrep for exact matches rg "authenticate_user" --type py -
Structural code patterns: Use ast-grep for syntax matchingbash
# WRONG TOOL: Semantic search for code structure search "class with constructor" # RIGHT TOOL: Use ast-grep for structure sg --pattern 'class $NAME { constructor($$$) { $$$ } }' -
Symbol references: Use serena for LSP-based trackingbash
# WRONG TOOL: Semantic search for all usages search "MyClass usage" # RIGHT TOOL: Use serena for precise references serena find_referencing_symbols --name 'MyClass' -
Small codebases: Overhead not worth it for <100 files
- ripgrep is faster and simpler for small projects
-
精确关键词搜索:使用ripgrep查找已知关键词bash
# 错误工具:用语义搜索查找确切函数名 search "authenticate_user" # 正确工具:用ripgrep进行精确匹配 rg "authenticate_user" --type py -
结构代码模式:使用ast-grep进行语法匹配bash
# 错误工具:用语义搜索查找代码结构 search "class with constructor" # 正确工具:用ast-grep进行结构匹配 sg --pattern 'class $NAME { constructor($$$) { $$$ } }' -
符号引用:使用serena进行基于LSP的追踪bash
# 错误工具:用语义搜索查找所有用法 search "MyClass usage" # 正确工具:用serena进行精确引用追踪 serena find_referencing_symbols --name 'MyClass' -
小型代码库:对于<100个文件的项目,使用Semtools的开销不值得
- ripgrep更快更简单
Known Edge Cases
已知边缘情况
- Ambiguous queries: Vague concepts return broad results
- Technical jargon: Domain-specific terms may have lower accuracy
- Short code snippets: Limited context reduces embedding quality
- Mixed languages: Embeddings tuned for English (multilingual model used)
- Generated code: Repetitive patterns may cluster together
- 模糊查询:模糊概念会返回宽泛的结果
- 技术术语:领域特定术语的准确性可能较低
- 短代码片段:上下文有限会降低嵌入向量质量
- 混合语言:嵌入向量针对英语优化(使用多语言模型)
- 生成代码:重复模式可能会聚集在一起
Troubleshooting
故障排除
No Semantic Matches Found
未找到语义匹配结果
If semantic search returns zero results:
-
Verify files exist: Use ripgrep to confirm contentbash
rg "concept" src/ -
Increase similarity threshold: Be more lenientbash
search "query" --max-distance 0.5 -
Rephrase query: Try different terminologybash
search "user authentication" search "verify user credentials" search "login validation" -
Check file types: Ensure searching correct extensionsbash
search "query" src/*.py # Target specific types
如果语义搜索返回零结果:
-
验证文件存在:使用ripgrep确认内容存在bash
rg "concept" src/ -
提高相似性阈值:设置更宽松的匹配条件bash
search "query" --max-distance 0.5 -
改写查询:尝试不同的术语bash
search "user authentication" search "verify user credentials" search "login validation" -
检查文件类型:确保搜索的是正确的文件扩展名bash
search "query" src/*.py # 针对特定类型
Too Many Irrelevant Results
无关结果过多
If semantic search returns too much noise:
-
Decrease similarity threshold: Be more strictbash
search "query" --max-distance 0.2 -
Limit result count: Review top matches onlybash
search "query" --top-k 3 -
Narrow directory scope: Search specific pathsbash
search "query" src/specific_module/ -
Refine query: Add more specific conceptsbash
# Vague search "data" # Specific search "data validation with regex patterns"
如果语义搜索返回太多噪音:
-
降低相似性阈值:设置更严格的匹配条件bash
search "query" --max-distance 0.2 -
限制结果数量:仅查看前几个匹配结果bash
search "query" --top-k 3 -
缩小目录范围:搜索特定路径bash
search "query" src/specific_module/ -
优化查询:添加更具体的概念bash
# 模糊查询 search "data" # 具体查询 search "data validation with regex patterns"
Document Parsing Fails
文档解析失败
If fails:
parse-
Verify API key is set:bash
echo $LLAMA_CLOUD_API_KEY -
Check file format: Ensure supported format (PDF, DOCX, PPTX)bash
file document.pdf # Verify file type -
Check file size: Large files may timeoutbash
du -h document.pdf # Check size -
Review parse config: Adjust timeouts if neededbash
cat ~/.parse_config.json
如果命令失败:
parse-
验证API密钥已设置:bash
echo $LLAMA_CLOUD_API_KEY -
检查文件格式:确保是支持的格式(PDF、DOCX、PPTX)bash
file document.pdf # 验证文件类型 -
检查文件大小:大文件可能超时bash
du -h document.pdf # 检查大小 -
查看解析配置:必要时调整超时时间bash
cat ~/.parse_config.json
Workspace Issues
工作区问题
If workspace commands fail:
bash
undefined如果工作区命令失败:
bash
undefinedCheck workspace status
检查工作区状态
workspace status
workspace status
Prune corrupted workspaces
清理损坏的工作区
workspace prune
workspace prune
Recreate workspace
重新创建工作区
rm -rf ~/.semtools/workspaces/my-workspace
export SEMTOOLS_WORKSPACE=my-workspace
undefinedrm -rf ~/.semtools/workspaces/my-workspace
export SEMTOOLS_WORKSPACE=my-workspace
undefined