grepai-chunking
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGrepAI Chunking Configuration
GrepAI代码分块配置
This skill covers how GrepAI splits code files into chunks for embedding, and how to optimize chunking for your codebase.
本技能介绍GrepAI如何将代码文件拆分为块以进行嵌入,以及如何针对你的代码库优化分块设置。
When to Use This Skill
何时使用此技能
- Optimizing search accuracy
- Adjusting for code style (verbose vs. concise)
- Troubleshooting search results
- Understanding how indexing works
- 优化搜索准确性
- 根据代码风格调整(冗长型 vs 简洁型)
- 排查搜索结果问题
- 理解索引工作原理
What is Chunking?
什么是分块?
Chunking is the process of splitting source files into smaller segments for embedding:
┌─────────────────────────────────────┐
│ Large Source File │
│ (1000+ tokens) │
└─────────────────────────────────────┘
↓
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │
│ ~512 │ │ ~512 │ │ ~512 │
│ tokens │ │ tokens │ │ tokens │
└─────────┘ └─────────┘ └─────────┘
↓
Each chunk gets
its own embedding分块是将源文件拆分为更小片段以进行嵌入的过程:
┌─────────────────────────────────────┐
│ Large Source File │
│ (1000+ tokens) │
└─────────────────────────────────────┘
↓
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │
│ ~512 │ │ ~512 │ │ ~512 │
│ tokens │ │ tokens │ │ tokens │
└─────────┘ └─────────┘ └─────────┘
↓
Each chunk gets
its own embeddingWhy Chunking Matters
分块的重要性
Embedding models have optimal input sizes:
- Too large chunks: Less precise search results
- Too small chunks: Lost context, fragmented results
- Just right: Good balance of precision and context
嵌入模型有最优输入尺寸:
- 块过大:搜索结果精度降低
- 块过小:丢失上下文,结果碎片化
- 尺寸适中:在精度和上下文之间达到良好平衡
Configuration
配置
Basic Settings
基础设置
yaml
undefinedyaml
undefined.grepai/config.yaml
.grepai/config.yaml
chunking:
size: 512 # Tokens per chunk
overlap: 50 # Overlap between chunks
undefinedchunking:
size: 512 # Tokens per chunk
overlap: 50 # Overlap between chunks
undefinedUnderstanding Parameters
参数说明
Chunk Size
块大小
The target number of tokens per chunk.
| Size | Effect |
|---|---|
| 256 | More precise, less context |
| 512 | Balanced (default) |
| 1024 | More context, less precise |
每个块的目标token数量。
| 大小 | 效果 |
|---|---|
| 256 | 精度更高,上下文更少 |
| 512 | 平衡型(默认值) |
| 1024 | 上下文更多,精度更低 |
Overlap
重叠度
Tokens shared between adjacent chunks. Preserves context at boundaries.
| Overlap | Effect |
|---|---|
| 0 | No overlap, may lose context at boundaries |
| 50 | Standard overlap (default) |
| 100 | More context, larger index |
相邻块之间共享的token数量,用于保留边界处的上下文。
| 重叠度 | 效果 |
|---|---|
| 0 | 无重叠,可能丢失边界处的上下文 |
| 50 | 标准重叠度(默认值) |
| 100 | 上下文更完整,索引体积更大 |
Visualization
可视化示例
With size=512 and overlap=50:
File: auth.go (1000 tokens)
Chunk 1: tokens 1-512
┌────────────────────────────────────┐
│ func Login(user, pass)... │
└────────────────────────────────────┘
↘
50 token overlap
↙
Chunk 2: tokens 463-974
┌────────────────────────────────────┐
│ ...validate credentials... │
└────────────────────────────────────┘
↘
50 token overlap
↙
Chunk 3: tokens 925-1000
┌──────────────┐
│ ...return │
└──────────────┘当size=512且overlap=50时:
File: auth.go (1000 tokens)
Chunk 1: tokens 1-512
┌────────────────────────────────────┐
│ func Login(user, pass)... │
└────────────────────────────────────┘
↘
50 token overlap
↙
Chunk 2: tokens 463-974
┌────────────────────────────────────┐
│ ...validate credentials... │
└────────────────────────────────────┘
↘
50 token overlap
↙
Chunk 3: tokens 925-1000
┌──────────────┐
│ ...return │
└──────────────┘Recommended Settings by Language
按语言推荐的设置
Verbose Languages (Java, C#)
冗长型语言(Java、C#)
yaml
chunking:
size: 768 # Larger to capture full methods
overlap: 75yaml
chunking:
size: 768 # 更大尺寸以容纳完整方法
overlap: 75Concise Languages (Go, Python)
简洁型语言(Go、Python)
yaml
chunking:
size: 512 # Standard size
overlap: 50yaml
chunking:
size: 512 # 标准尺寸
overlap: 50Very Concise (Rust, Zig)
超简洁型语言(Rust、Zig)
yaml
chunking:
size: 384 # Smaller for precise results
overlap: 40yaml
chunking:
size: 384 # 更小尺寸以获得精准结果
overlap: 40Recommended Settings by Codebase
按代码库类型推荐的设置
Small Functions (Microservices)
小型函数(微服务)
yaml
chunking:
size: 384 # Capture individual functions
overlap: 40yaml
chunking:
size: 384 # 容纳单个函数
overlap: 40Large Classes (Monolith)
大型类(单体应用)
yaml
chunking:
size: 768 # Capture more context
overlap: 100yaml
chunking:
size: 768 # 捕获更多上下文
overlap: 100Mixed Codebase
混合代码库
yaml
chunking:
size: 512 # Balanced default
overlap: 50yaml
chunking:
size: 512 # 平衡型默认值
overlap: 50How Tokens are Counted
Token计数方式
GrepAI uses approximate token counting:
- ~4 characters = 1 token (for English text)
- Code varies based on identifiers and syntax
Example:
go
func calculateTotal(items []Item) float64 {
total := 0.0
for _, item := range items {
total += item.Price * float64(item.Quantity)
}
return total
}≈ 45 tokens
GrepAI使用近似token计数:
- 约4个字符 = 1个token(英文文本)
- 代码的计数因标识符和语法而异
示例:
go
func calculateTotal(items []Item) float64 {
total := 0.0
for _, item := range items {
total += item.Price * float64(item.Quantity)
}
return total
}≈ 45 tokens
Impact on Index Size
对索引体积的影响
Larger overlap = more chunks = larger index:
| Size | Overlap | Chunks per 10K tokens | Index Impact |
|---|---|---|---|
| 512 | 0 | ~20 | Smallest |
| 512 | 50 | ~22 | Standard |
| 512 | 100 | ~24 | +10% |
| 256 | 50 | ~44 | +100% |
重叠度越高 = 块数量越多 = 索引体积越大:
| 大小 | 重叠度 | 每10K token的块数量 | 索引影响 |
|---|---|---|---|
| 512 | 0 | ~20 | 最小 |
| 512 | 50 | ~22 | 标准 |
| 512 | 100 | ~24 | +10% |
| 256 | 50 | ~44 | +100% |
Impact on Search Quality
对搜索质量的影响
Too Small Chunks (size: 128)
块过小(size: 128)
Query: "authentication middleware"
Result: "...c.AbortWithStatus(401)..."
(Fragment, missing context)Query: "authentication middleware"
Result: "...c.AbortWithStatus(401)..."
(片段,缺少上下文)Just Right (size: 512)
尺寸适中(size: 512)
Query: "authentication middleware"
Result: "func AuthMiddleware() gin.HandlerFunc {
return func(c *gin.Context) {
token := c.GetHeader("Authorization")
if token == "" {
c.AbortWithStatus(401)
return
}
// validate token...
}
}"
(Complete function with context)Query: "authentication middleware"
Result: "func AuthMiddleware() gin.HandlerFunc {
return func(c *gin.Context) {
token := c.GetHeader("Authorization")
if token == "" {
c.AbortWithStatus(401)
return
}
// validate token...
}
}"
(带完整上下文的函数)Too Large Chunks (size: 2048)
块过大(size: 2048)
Query: "authentication middleware"
Result: "// Multiple unrelated functions...
func AuthMiddleware()... (your match)
func LoggingMiddleware()...
func CORSMiddleware()..."
(Too much noise)Query: "authentication middleware"
Result: "// Multiple unrelated functions...
func AuthMiddleware()... (your match)
func LoggingMiddleware()...
func CORSMiddleware()..."
(包含过多无关内容)Experimentation
实验方法
Testing Different Settings
测试不同设置
- Try smaller chunks for more precise results:
yaml
chunking:
size: 384
overlap: 40- Re-index:
bash
rm .grepai/index.gob
grepai watch- Test with searches:
bash
grepai search "your query"- Adjust and repeat until satisfied.
- 尝试更小的块以获得更精准的结果:
yaml
chunking:
size: 384
overlap: 40- 重新索引:
bash
rm .grepai/index.gob
grepai watch- 测试搜索:
bash
grepai search "your query"- 调整并重复直到满意。
Comparing Results
对比结果
Before changing settings, save a search result:
bash
grepai search "authentication" > before.txtAfter changing settings and re-indexing:
bash
grepai search "authentication" > after.txt
diff before.txt after.txt修改设置前,保存搜索结果:
bash
grepai search "authentication" > before.txt修改设置并重新索引后:
bash
grepai search "authentication" > after.txt
diff before.txt after.txtChunk Boundaries
块边界规则
GrepAI tries to split at logical boundaries:
- Empty lines (function/class boundaries)
- Closing braces
- Statement ends
This means actual chunk sizes may vary slightly from the target.
GrepAI尝试在逻辑边界处拆分:
- 空行(函数/类边界)
- 闭合大括号
- 语句结尾
这意味着实际块大小可能与目标值略有差异。
Best Practices
最佳实践
- Start with defaults: 512/50 works well for most codebases
- Adjust based on code style: Verbose = larger, concise = smaller
- Test with real queries: See what your searches return
- Re-index after changes: Must regenerate embeddings
- Consider overlap: Don't set to 0 unless index size is critical
- 从默认值开始:512/50适用于大多数代码库
- 根据代码风格调整:冗长型=更大尺寸,简洁型=更小尺寸
- 用真实查询测试:查看搜索返回的结果
- 修改后重新索引:必须重新生成嵌入
- 考虑重叠度:除非索引体积是关键问题,否则不要设置为0
Common Issues
常见问题
❌ Problem: Search results are too fragmented
✅ Solution: Increase chunk size:
yaml
chunking:
size: 768❌ Problem: Search results have too much irrelevant context
✅ Solution: Decrease chunk size:
yaml
chunking:
size: 384❌ Problem: Results miss related code at function boundaries
✅ Solution: Increase overlap:
yaml
chunking:
overlap: 100❌ Problem: Index is too large
✅ Solutions:
- Decrease overlap
- Increase chunk size
- Add more ignore patterns
❌ 问题:搜索结果过于碎片化
✅ 解决方案:增大块大小:
yaml
chunking:
size: 768❌ 问题:搜索结果包含过多无关上下文
✅ 解决方案:减小块大小:
yaml
chunking:
size: 384❌ 问题:结果遗漏函数边界处的相关代码
✅ 解决方案:增大重叠度:
yaml
chunking:
overlap: 100❌ 问题:索引体积过大
✅ 解决方案:
- 降低重叠度
- 增大块大小
- 添加更多忽略规则
Output Format
输出格式
Chunking status:
✅ Chunking Configuration
Size: 512 tokens
Overlap: 50 tokens
Index Statistics:
- Total files: 245
- Total chunks: 1,234
- Avg chunks/file: 5.0
- Avg chunk size: 478 tokens
Recommendations:
- Current settings are balanced
- Consider size: 384 for more precise results
- Consider size: 768 for more context分块状态:
✅ Chunking Configuration
Size: 512 tokens
Overlap: 50 tokens
Index Statistics:
- Total files: 245
- Total chunks: 1,234
- Avg chunks/file: 5.0
- Avg chunk size: 478 tokens
Recommendations:
- Current settings are balanced
- Consider size: 384 for more precise results
- Consider size: 768 for more context