semtools

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Semtools: Semantic Search

Semtools:语义搜索

Perform semantic (meaning-based) search across code and documents using embedding-based similarity matching.
使用基于嵌入的相似性匹配,对代码和文档执行语义(基于含义)搜索。

Purpose

用途

The semtools skill provides access to Semtools, a high-performance Rust-based CLI for semantic search and document processing. Unlike traditional text search (ripgrep) which matches exact strings, or structural search (ast-grep) which matches syntax patterns, semtools understands semantic meaning through embeddings.
Key capabilities:
  1. Semantic Search: Find code/text by meaning, not just keywords
  2. Workspace Management: Index large codebases for fast repeated searches
  3. Document Parsing: Convert PDFs, DOCX, PPTX to searchable text (requires API key)
Semtools excels at discovery - finding relevant code when you don't know the exact keywords, function names, or syntax patterns.
Semtools Skill提供对Semtools的访问,这是一个基于Rust开发的高性能CLI工具,用于语义搜索和文档处理。与匹配精确字符串的传统文本搜索工具(ripgrep)或匹配语法模式的结构搜索工具(ast-grep)不同,Semtools通过嵌入技术理解语义含义
核心功能:
  1. 语义搜索:根据含义查找代码/文本,而非仅依赖关键词
  2. 工作区管理:为大型代码库建立索引,实现快速重复搜索
  3. 文档解析:将PDF、DOCX、PPTX转换为可搜索文本(需要API密钥)
Semtools擅长发现工作——当你不知道确切的关键词、函数名或语法模式时,帮助你找到相关代码。

When to Use This Skill

何时使用该Skill

Use the semtools skill when you need meaning-based search:
Semantic Code Discovery:
  • Finding code that implements a concept ("error handling", "data validation")
  • Discovering similar functionality across different modules
  • Locating examples of a pattern when you don't know exact names
  • Understanding what code does without reading everything
Documentation & Knowledge:
  • Searching documentation by concept, not keywords
  • Finding related discussions in comments or docs
  • Discovering similar issues or solutions
  • Analyzing technical documents (PDFs, reports)
Use Cases:
  • "Find all authentication-related code" (without knowing function names)
  • "Show me error handling patterns" (regardless of specific error types)
  • "Find code similar to this implementation" (semantic similarity)
  • "Search research papers for 'distributed consensus'" (document search)
Choose semtools over file-search (ripgrep/ast-grep) when:
  • You know the concept but not the keywords
  • Exact string matching misses relevant results
  • You want semantically similar code, not exact matches
  • Searching across languages or mixed content
Still use file-search when:
  • You know exact keywords, function names, or patterns
  • You need structural code matching (ast-grep)
  • Speed is critical (ripgrep is faster for exact matches)
  • You're searching for specific symbols or references
当你需要基于语义的搜索时,使用Semtools Skill:
语义代码发现:
  • 查找实现某个概念的代码(如“错误处理”、“数据验证”)
  • 发现不同模块中的相似功能
  • 当你不知道确切名称时,查找某类模式的示例
  • 无需通读所有内容即可理解代码功能
文档与知识检索:
  • 按概念而非关键词搜索文档
  • 在注释或文档中查找相关讨论
  • 发现相似的问题或解决方案
  • 分析技术文档(PDF、报告等)
使用场景:
  • “查找所有与认证相关的代码”(无需知道函数名)
  • “展示错误处理模式”(无论具体错误类型)
  • “查找与该实现相似的代码”(语义相似性)
  • “在研究论文中搜索‘分布式共识’”(文档搜索)
以下情况选择Semtools而非文件搜索工具(ripgrep/ast-grep):
  • 你知道概念但不知道关键词
  • 精确字符串匹配会遗漏相关结果
  • 你需要语义相似的代码,而非精确匹配
  • 跨语言或混合内容搜索
以下情况仍使用文件搜索工具:
  • 你知道确切的关键词、函数名或模式
  • 你需要结构代码匹配(ast-grep)
  • 速度是关键(ripgrep在精确匹配时更快)
  • 你正在搜索特定符号或引用

Available Commands

可用命令

Semtools provides three CLI commands you can use via
execute_command
:
  • search
    - Semantic search across code and text files
  • workspace
    - Manage workspaces for caching embeddings
  • parse
    - Convert documents (PDF, DOCX, PPTX) to searchable text
All commands work out-of-the-box in your execution environment. Document parsing requires the LLAMA_CLOUD_API_KEY environment variable to be set.
Semtools提供三个CLI命令,你可以通过
execute_command
调用:
  • search
    - 对代码和文本文件执行语义搜索
  • workspace
    - 管理工作区以缓存嵌入向量
  • parse
    - 将文档(PDF、DOCX、PPTX)转换为可搜索文本
所有命令可在你的执行环境中直接使用。文档解析需要设置LLAMA_CLOUD_API_KEY环境变量。

Core Operations

核心操作

1. Semantic Search (
search
)

1. 语义搜索(
search

Find files and code sections by semantic meaning:
bash
undefined
根据语义含义查找文件和代码片段:
bash
undefined

Basic semantic search

基础语义搜索

search "authentication logic" src/
search "authentication logic" src/

Search with more context (5 lines before/after)

搜索并显示更多上下文(前后各5行)

search "error handling" --n-lines 5 src/
search "error handling" --n-lines 5 src/

Get more results (default: 3)

获取更多结果(默认:3条)

search "database queries" --top-k 10 src/
search "database queries" --top-k 10 src/

Control similarity threshold (0.0-1.0, lower = more lenient)

控制相似性阈值(0.0-1.0,值越低越严格)

search "API endpoints" --max-distance 0.4 src/

**Parameters:**
- `--n-lines N`: Show N lines of context around matches (default: 3)
- `--top-k K`: Return top K most similar matches (default: 3)
- `--max-distance D`: Maximum embedding distance (0.0-1.0, default: 0.3)
- `-i`: Case-insensitive matching

**Output format:**

Match 1 (similarity: 0.12) File: src/auth/handlers.py Lines: 42-47

def authenticate_user(username: str, password: str) -> Optional[User]: """Authenticate user credentials against database.""" user = get_user_by_username(username) if user and verify_password(password, user.password_hash): return user return None

Match 2 (similarity: 0.18) File: src/middleware/auth.py ...
undefined
search "API endpoints" --max-distance 0.4 src/

**参数说明:**
- `--n-lines N`:显示匹配内容前后N行上下文(默认:3)
- `--top-k K`:返回前K个最相似的匹配结果(默认:3)
- `--max-distance D`:最大嵌入距离(0.0-1.0,默认:0.3)
- `-i`:不区分大小写匹配

**输出格式:**

Match 1 (similarity: 0.12) File: src/auth/handlers.py Lines: 42-47

def authenticate_user(username: str, password: str) -> Optional[User]: """Authenticate user credentials against database.""" user = get_user_by_username(username) if user and verify_password(password, user.password_hash): return user return None

Match 2 (similarity: 0.18) File: src/middleware/auth.py ...
undefined

2. Workspace Management (
workspace
)

2. 工作区管理(
workspace

For large codebases, create workspaces to cache embeddings and enable fast repeated searches:
bash
undefined
对于大型代码库,创建工作区来缓存嵌入向量,实现快速重复搜索:
bash
undefined

Create/activate workspace

创建/激活工作区

workspace use my-project
workspace use my-project

Set workspace via environment variable

通过环境变量设置工作区

export SEMTOOLS_WORKSPACE=my-project
export SEMTOOLS_WORKSPACE=my-project

Index files in workspace (workspace auto-detected from env var)

在工作区中索引文件(工作区会从环境变量自动检测)

search "query" src/
search "query" src/

Check workspace status

检查工作区状态

workspace status
workspace status

Clean up old workspaces

清理旧工作区

workspace prune

**Benefits:**
- **Fast repeated searches**: Embeddings cached, no re-computation
- **Large codebases**: IVF_PQ indexing for scalability
- **Session persistence**: Maintain context across multiple searches

**When to use workspaces:**
- Searching the same codebase multiple times
- Very large projects (1000+ files)
- Interactive exploration sessions
- CI/CD pipelines with repeated searches
workspace prune

**优势:**
- **快速重复搜索**:嵌入向量已缓存,无需重新计算
- **支持大型代码库**:IVF_PQ索引实现可扩展性
- **会话持久化**:在多次搜索中保持上下文

**何时使用工作区:**
- 多次搜索同一代码库
- 超大型项目(1000+文件)
- 交互式探索会话
- 需要重复搜索的CI/CD流水线

3. Document Parsing (
parse
) ⚠️ Requires API Key

3. 文档解析(
parse
)⚠️ 需要API密钥

Convert documents to searchable markdown (requires LlamaParse API key):
bash
undefined
将文档转换为可搜索的Markdown格式(需要LlamaParse API密钥):
bash
undefined

Parse PDFs to markdown

将PDF解析为Markdown

parse research_papers/*.pdf
parse research_papers/*.pdf

Parse Word documents

解析Word文档

parse reports/*.docx
parse reports/*.docx

Parse presentations

解析演示文稿

parse slides/*.pptx
parse slides/*.pptx

Parse and pipe to search

解析并直接传入搜索命令

parse docs/*.pdf | xargs search "neural networks"

**Supported formats:**
- PDF (.pdf)
- Word (.docx)
- PowerPoint (.pptx)

**Configuration:**
```bash
parse docs/*.pdf | xargs search "neural networks"

**支持格式:**
- PDF (.pdf)
- Word (.docx)
- PowerPoint (.pptx)

**配置方式:**
```bash

Via environment variable

通过环境变量配置

export LLAMA_CLOUD_API_KEY="llx-..."
export LLAMA_CLOUD_API_KEY="llx-..."

Via config file

通过配置文件配置

cat > ~/.parse_config.json << EOF { "api_key": "llx-...", "max_concurrent_requests": 10, "timeout_seconds": 3600 } EOF

**Important:** Document parsing is **optional**. Semantic search works without it.
cat > ~/.parse_config.json << EOF { "api_key": "llx-...", "max_concurrent_requests": 10, "timeout_seconds": 3600 } EOF

**注意:** 文档解析是**可选功能**,语义搜索无需该功能即可使用。

Workflow Patterns

工作流模式

Pattern 1: Concept Discovery

模式1:概念发现

When you know what you're looking for conceptually but not by name:
bash
undefined
当你知道要找的概念但不知道具体名称时:
bash
undefined

Step 1: Broad semantic search

步骤1:宽泛的语义搜索

search "rate limiting implementation" src/
search "rate limiting implementation" src/

Step 2: Review results, refine query

步骤2:查看结果,优化查询

search "throttle requests per user" src/ --top-k 10
search "throttle requests per user" src/ --top-k 10

Step 3: Use ripgrep for exact follow-up

步骤3:使用ripgrep进行精确后续搜索

rg "RateLimiter" --type py src/
undefined
rg "RateLimiter" --type py src/
undefined

Pattern 2: Similar Code Finder

模式2:相似代码查找

When you want to find code similar to a reference implementation:
bash
undefined
当你想要找到与参考实现相似的代码时:
bash
undefined

Step 1: Extract key concepts from reference code

步骤1:从参考代码中提取核心概念

[Read example_auth.py and identify key concepts]

[阅读example_auth.py并识别核心概念]

Step 2: Search for similar implementations

步骤2:搜索相似实现

search "user authentication with JWT tokens" src/
search "user authentication with JWT tokens" src/

Step 3: Compare implementations

步骤3:对比实现

[Review semantic matches to find similar approaches]

[查看语义匹配结果以找到相似方案]

undefined
undefined

Pattern 3: Documentation Search

模式3:文档搜索

When researching concepts in documentation or comments:
bash
undefined
当你在文档或注释中研究概念时:
bash
undefined

Search code comments semantically

语义搜索代码注释

search "thread safety guarantees" src/ --n-lines 10
search "thread safety guarantees" src/ --n-lines 10

Search markdown documentation

搜索Markdown文档

search "deployment best practices" docs/
search "deployment best practices" docs/

Combined search

组合搜索

search "performance optimization" --top-k 20
undefined
search "performance optimization" --top-k 20
undefined

Pattern 4: Cross-Language Search

模式4:跨语言搜索

When searching for concepts across different languages:
bash
undefined
当你跨不同语言搜索概念时:
bash
undefined

Semantic search works across languages

语义搜索支持跨语言

search "connection pooling" src/
search "connection pooling" src/

May find:

可能找到:

- Java: "ConnectionPool manager"

- Java: "ConnectionPool manager"

- Python: "database connection reuse"

- Python: "database connection reuse"

- Go: "pool of persistent connections"

- Go: "pool of persistent connections"

All semantically related despite different terminology

尽管术语不同,但语义相关

undefined
undefined

Pattern 5: Document Analysis (with API key)

模式5:文档分析(需API密钥)

When analyzing PDFs or documents:
bash
undefined
当你分析PDF或其他文档时:
bash
undefined

Step 1: Parse documents to markdown

步骤1:将文档解析为Markdown

parse research/*.pdf > papers.md
parse research/*.pdf > papers.md

Step 2: Search converted content

步骤2:搜索转换后的内容

search "transformer architecture" papers.md
search "transformer architecture" papers.md

Step 3: Combine with code search

步骤3:结合代码搜索

search "attention mechanism implementation" src/
undefined
search "attention mechanism implementation" src/
undefined

Integration with file-search

与文件搜索工具的集成

Semtools and file-search (ripgrep/ast-grep) are complementary tools. Use them together for comprehensive search:
Semtools与文件搜索工具(ripgrep/ast-grep)是互补工具,结合使用可实现全面搜索:

Search Strategy Matrix

搜索策略矩阵

You KnowUse FirstThen UseWhy
Exact keywordsripgrepsearchFast exact match, then find similar
Concept onlysearchripgrepFind relevant code, then search specifics
Function nameripgrepsearchFind definition, then find similar usage
Code patternast-grepsearchFind structure, then find similar logic
Approximate ideasearchripgrep + ast-grepDiscover, then drill down
你知道的信息优先使用随后使用原因
确切关键词ripgrepsearch快速精确匹配,然后查找相似内容
仅知道概念searchripgrep找到相关代码,然后搜索具体细节
函数名ripgrepsearch找到定义,然后查找相似用法
代码模式ast-grepsearch找到结构,然后查找相似逻辑
大致想法searchripgrep + ast-grep先发现,再深入分析

Layered Search Approach

分层搜索方法

bash
undefined
bash
undefined

Layer 1: Semantic discovery (what's related?)

第一层:语义发现(哪些内容相关?)

search "user session management" --top-k 10
search "user session management" --top-k 10

Layer 2: Exact text search (what's the implementation?)

第二层:精确文本搜索(具体实现是什么?)

rg "SessionManager|session_store" --type py
rg "SessionManager|session_store" --type py

Layer 3: Structural search (how is it used?)

第三层:结构搜索(如何使用?)

sg --pattern 'session.$METHOD($$$)' --lang python
sg --pattern 'session.$METHOD($$$)' --lang python

Layer 4: Reference tracking (where is it called?)

第四层:引用追踪(在哪里被调用?)

[Use serena skill for symbol-level tracking]

[使用serena skill进行符号级追踪]

undefined
undefined

Best Practices

最佳实践

1. Start Broad, Then Narrow

1. 先宽泛搜索,再逐步缩小范围

Use semantic search for discovery, then narrow with exact search:
bash
undefined
使用语义搜索进行发现,然后用精确搜索缩小范围:
bash
undefined

GOOD: Broad semantic discovery first

推荐:先进行宽泛的语义发现

search "authentication" src/ --top-k 10
search "authentication" src/ --top-k 10

[Review results to learn terminology]

[查看结果以了解术语]

rg "authenticate|verify_credentials" --type py src/
rg "authenticate|verify_credentials" --type py src/

AVOID: Starting too narrow and missing variations

避免:一开始范围过窄,遗漏变体

rg "authenticate" --type py # Misses "verify_credentials", "check_auth", etc.
undefined
rg "authenticate" --type py # 会遗漏"verify_credentials"、"check_auth"等
undefined

2. Adjust Similarity Threshold

2. 调整相似性阈值

Tune
--max-distance
based on results:
bash
undefined
根据结果调整
--max-distance
参数:
bash
undefined

Too many irrelevant results? Decrease distance (more strict)

无关结果太多?减小距离(更严格)

search "query" --max-distance 0.2
search "query" --max-distance 0.2

Missing relevant results? Increase distance (more lenient)

遗漏相关结果?增大距离(更宽松)

search "query" --max-distance 0.5
search "query" --max-distance 0.5

Default (0.3) works well for most cases

默认值(0.3)适用于大多数场景

search "query"
undefined
search "query"
undefined

3. Use Workspaces for Repeated Searches

3. 重复搜索时使用工作区

For interactive exploration, always use workspaces:
bash
undefined
对于交互式探索,始终使用工作区:
bash
undefined

GOOD: Create workspace once, search many times

推荐:创建一次工作区,多次搜索

export SEMTOOLS_WORKSPACE=my-analysis search "concept1" src/ search "concept2" src/ search "concept3" src/
export SEMTOOLS_WORKSPACE=my-analysis search "concept1" src/ search "concept2" src/ search "concept3" src/

INEFFICIENT: Re-compute embeddings every time

低效:每次搜索都重新计算嵌入向量

search "concept1" src/ search "concept2" src/
undefined
search "concept1" src/ search "concept2" src/
undefined

4. Combine with Context Tools

4. 结合上下文工具

Get more context around semantic matches:
bash
undefined
获取语义匹配结果的更多上下文:
bash
undefined

Find semantically similar code

查找语义相似的代码

search "retry logic" src/ --n-lines 2
search "retry logic" src/ --n-lines 2

Get more context with ripgrep

使用ripgrep获取更多上下文

rg -C 10 "retry" src/specific_file.py
rg -C 10 "retry" src/specific_file.py

Or read the full file

或直接读取完整文件

cat src/specific_file.py
undefined
cat src/specific_file.py
undefined

5. Phrase Queries Conceptually

5. 以概念化方式撰写查询

Write queries as concepts, not exact keywords:
bash
undefined
将查询写为概念,而非精确关键词:
bash
undefined

GOOD: Conceptual queries

推荐:概念化查询

search "handling network timeouts" search "user input validation" search "concurrent data access"
search "handling network timeouts" search "user input validation" search "concurrent data access"

LESS EFFECTIVE: Exact keyword queries (use ripgrep instead)

效果较差:精确关键词查询(应使用ripgrep)

search "timeout" # Use: rg "timeout" search "validate" # Use: rg "validate"
undefined
search "timeout" # 推荐使用:rg "timeout" search "validate" # 推荐使用:rg "validate"
undefined

Understanding Semantic Distance

理解语义距离

Semtools uses embedding vectors to measure semantic similarity:
  • Distance 0.0: Identical meaning
  • Distance 0.1-0.2: Very similar (synonyms, paraphrases)
  • Distance 0.2-0.3: Related concepts (default threshold)
  • Distance 0.3-0.4: Loosely related
  • Distance 0.5+: Weakly related or unrelated
Practical guidelines:
bash
undefined
Semtools使用嵌入向量来衡量语义相似性:
  • 距离0.0:含义完全相同
  • 距离0.1-0.2:非常相似(同义词、改写)
  • 距离0.2-0.3:相关概念(默认阈值)
  • 距离0.3-0.4:松散相关
  • 距离0.5+:相关性弱或不相关
实用指南:
bash
undefined

Strict matching (only close matches)

严格匹配(仅接近的匹配结果)

--max-distance 0.2
--max-distance 0.2

Balanced matching (default, recommended)

平衡匹配(默认值,推荐)

--max-distance 0.3
--max-distance 0.3

Lenient matching (exploratory search)

宽松匹配(探索性搜索)

--max-distance 0.4
--max-distance 0.4

Very lenient (may include false positives)

非常宽松(可能包含误报)

--max-distance 0.5
undefined
--max-distance 0.5
undefined

Local vs. Cloud Embeddings

本地与云端嵌入向量

Semantic Search (Local):
  • Uses local embeddings (model2vec, potion-multilingual-128M)
  • No API calls or cloud dependencies
  • Fast, private, no cost
  • Works offline
Document Parsing (Cloud):
  • Uses LlamaParse API (cloud-based)
  • Requires API key and internet connection
  • Processes PDFs, DOCX, PPTX
  • Usage-based pricing (check LlamaIndex pricing)
Privacy consideration: Semantic search is 100% local. Only document parsing sends data to LlamaParse API.
语义搜索(本地):
  • 使用本地嵌入向量(model2vec、potion-multilingual-128M)
  • 无API调用或云端依赖
  • 快速、隐私、免费
  • 支持离线使用
文档解析(云端):
  • 使用LlamaParse API(基于云端)
  • 需要API密钥和互联网连接
  • 处理PDF、DOCX、PPTX
  • 按使用量付费(查看LlamaIndex定价)
隐私注意事项: 语义搜索100%在本地进行,只有文档解析会将数据发送到LlamaParse API。

Performance Considerations

性能考虑

Speed Characteristics

速度特性

Without workspace:
  • First search: ~2-5 seconds (embedding computation)
  • Subsequent searches: ~2-5 seconds each (re-compute embeddings)
With workspace (cached embeddings):
  • First search: ~2-5 seconds (builds index)
  • Subsequent searches: ~0.1-0.5 seconds (cached)
  • Large codebases: IVF_PQ indexing for scalability
Comparison:
  • ripgrep: 0.01-0.1 seconds (fastest, exact match)
  • ast-grep: 0.1-0.5 seconds (fast, structural)
  • semtools (cached): 0.1-0.5 seconds (fast, semantic)
  • semtools (uncached): 2-5 seconds (slower, semantic)
无工作区:
  • 首次搜索:~2-5秒(嵌入向量计算)
  • 后续搜索:每次~2-5秒(重新计算嵌入向量)
有工作区(嵌入向量已缓存):
  • 首次搜索:~2-5秒(构建索引)
  • 后续搜索:~0.1-0.5秒(使用缓存)
  • 大型代码库:IVF_PQ索引实现可扩展性
对比:
  • ripgrep:0.01-0.1秒(最快,精确匹配)
  • ast-grep:0.1-0.5秒(快速,结构匹配)
  • semtools(缓存):0.1-0.5秒(快速,语义匹配)
  • semtools(无缓存):2-5秒(较慢,语义匹配)

Optimization Tips

优化技巧

bash
undefined
bash
undefined

1. Use workspaces for repeated searches

1. 重复搜索时使用工作区

export SEMTOOLS_WORKSPACE=my-project
export SEMTOOLS_WORKSPACE=my-project

2. Limit search scope to relevant directories

2. 将搜索范围限制在相关目录

search "query" src/ --not tests/
search "query" src/ --not tests/

3. Use --top-k to control result count

3. 使用--top-k控制结果数量

search "query" --top-k 5
search "query" --top-k 5

4. Pipe to head for quick preview

4. 管道到head命令快速预览

search "query" | head -50
undefined
search "query" | head -50
undefined

Unix Pipeline Integration

Unix流水线集成

Semtools is designed for Unix-style composition:
bash
undefined
Semtools专为Unix风格的组合使用而设计:
bash
undefined

Find and parse PDFs, then search

查找并解析PDF,然后搜索

find docs/ -name "*.pdf" | xargs parse | xargs search "topic"
find docs/ -name "*.pdf" | xargs parse | xargs search "topic"

Search and filter with grep

搜索并使用grep过滤

search "authentication" src/ | grep -i "jwt"
search "authentication" src/ | grep -i "jwt"

Count matches

统计匹配数量

search "error handling" src/ | grep "Match" | wc -l
search "error handling" src/ | grep "Match" | wc -l

Combine with other tools

与其他工具结合使用

search "API" src/ | xargs -I {} rg -l "REST" {}
undefined
search "API" src/ | xargs -I {} rg -l "REST" {}
undefined

Limitations

局限性

When NOT to Use Semtools

何时不应使用Semtools

  1. Exact keyword search: Use ripgrep for known keywords
    bash
    # WRONG TOOL: Semantic search for exact function name
    search "authenticate_user"
    
    # RIGHT TOOL: Use ripgrep for exact matches
    rg "authenticate_user" --type py
  2. Structural code patterns: Use ast-grep for syntax matching
    bash
    # WRONG TOOL: Semantic search for code structure
    search "class with constructor"
    
    # RIGHT TOOL: Use ast-grep for structure
    sg --pattern 'class $NAME { constructor($$$) { $$$ } }'
  3. Symbol references: Use serena for LSP-based tracking
    bash
    # WRONG TOOL: Semantic search for all usages
    search "MyClass usage"
    
    # RIGHT TOOL: Use serena for precise references
    serena find_referencing_symbols --name 'MyClass'
  4. Small codebases: Overhead not worth it for <100 files
    • ripgrep is faster and simpler for small projects
  1. 精确关键词搜索:使用ripgrep查找已知关键词
    bash
    # 错误工具:用语义搜索查找确切函数名
    search "authenticate_user"
    
    # 正确工具:用ripgrep进行精确匹配
    rg "authenticate_user" --type py
  2. 结构代码模式:使用ast-grep进行语法匹配
    bash
    # 错误工具:用语义搜索查找代码结构
    search "class with constructor"
    
    # 正确工具:用ast-grep进行结构匹配
    sg --pattern 'class $NAME { constructor($$$) { $$$ } }'
  3. 符号引用:使用serena进行基于LSP的追踪
    bash
    # 错误工具:用语义搜索查找所有用法
    search "MyClass usage"
    
    # 正确工具:用serena进行精确引用追踪
    serena find_referencing_symbols --name 'MyClass'
  4. 小型代码库:对于<100个文件的项目,使用Semtools的开销不值得
    • ripgrep更快更简单

Known Edge Cases

已知边缘情况

  • Ambiguous queries: Vague concepts return broad results
  • Technical jargon: Domain-specific terms may have lower accuracy
  • Short code snippets: Limited context reduces embedding quality
  • Mixed languages: Embeddings tuned for English (multilingual model used)
  • Generated code: Repetitive patterns may cluster together
  • 模糊查询:模糊概念会返回宽泛的结果
  • 技术术语:领域特定术语的准确性可能较低
  • 短代码片段:上下文有限会降低嵌入向量质量
  • 混合语言:嵌入向量针对英语优化(使用多语言模型)
  • 生成代码:重复模式可能会聚集在一起

Troubleshooting

故障排除

No Semantic Matches Found

未找到语义匹配结果

If semantic search returns zero results:
  1. Verify files exist: Use ripgrep to confirm content
    bash
    rg "concept" src/
  2. Increase similarity threshold: Be more lenient
    bash
    search "query" --max-distance 0.5
  3. Rephrase query: Try different terminology
    bash
    search "user authentication"
    search "verify user credentials"
    search "login validation"
  4. Check file types: Ensure searching correct extensions
    bash
    search "query" src/*.py  # Target specific types
如果语义搜索返回零结果:
  1. 验证文件存在:使用ripgrep确认内容存在
    bash
    rg "concept" src/
  2. 提高相似性阈值:设置更宽松的匹配条件
    bash
    search "query" --max-distance 0.5
  3. 改写查询:尝试不同的术语
    bash
    search "user authentication"
    search "verify user credentials"
    search "login validation"
  4. 检查文件类型:确保搜索的是正确的文件扩展名
    bash
    search "query" src/*.py  # 针对特定类型

Too Many Irrelevant Results

无关结果过多

If semantic search returns too much noise:
  1. Decrease similarity threshold: Be more strict
    bash
    search "query" --max-distance 0.2
  2. Limit result count: Review top matches only
    bash
    search "query" --top-k 3
  3. Narrow directory scope: Search specific paths
    bash
    search "query" src/specific_module/
  4. Refine query: Add more specific concepts
    bash
    # Vague
    search "data"
    
    # Specific
    search "data validation with regex patterns"
如果语义搜索返回太多噪音:
  1. 降低相似性阈值:设置更严格的匹配条件
    bash
    search "query" --max-distance 0.2
  2. 限制结果数量:仅查看前几个匹配结果
    bash
    search "query" --top-k 3
  3. 缩小目录范围:搜索特定路径
    bash
    search "query" src/specific_module/
  4. 优化查询:添加更具体的概念
    bash
    # 模糊查询
    search "data"
    
    # 具体查询
    search "data validation with regex patterns"

Document Parsing Fails

文档解析失败

If
parse
fails:
  1. Verify API key is set:
    bash
    echo $LLAMA_CLOUD_API_KEY
  2. Check file format: Ensure supported format (PDF, DOCX, PPTX)
    bash
    file document.pdf  # Verify file type
  3. Check file size: Large files may timeout
    bash
    du -h document.pdf  # Check size
  4. Review parse config: Adjust timeouts if needed
    bash
    cat ~/.parse_config.json
如果
parse
命令失败:
  1. 验证API密钥已设置
    bash
    echo $LLAMA_CLOUD_API_KEY
  2. 检查文件格式:确保是支持的格式(PDF、DOCX、PPTX)
    bash
    file document.pdf  # 验证文件类型
  3. 检查文件大小:大文件可能超时
    bash
    du -h document.pdf  # 检查大小
  4. 查看解析配置:必要时调整超时时间
    bash
    cat ~/.parse_config.json

Workspace Issues

工作区问题

If workspace commands fail:
bash
undefined
如果工作区命令失败:
bash
undefined

Check workspace status

检查工作区状态

workspace status
workspace status

Prune corrupted workspaces

清理损坏的工作区

workspace prune
workspace prune

Recreate workspace

重新创建工作区

rm -rf ~/.semtools/workspaces/my-workspace export SEMTOOLS_WORKSPACE=my-workspace
undefined
rm -rf ~/.semtools/workspaces/my-workspace export SEMTOOLS_WORKSPACE=my-workspace
undefined

Resources

资源