grepai-chunking

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GrepAI Chunking Configuration

GrepAI代码分块配置

This skill covers how GrepAI splits code files into chunks for embedding, and how to optimize chunking for your codebase.
本技能介绍GrepAI如何将代码文件拆分为块以进行嵌入,以及如何针对你的代码库优化分块设置。

When to Use This Skill

何时使用此技能

  • Optimizing search accuracy
  • Adjusting for code style (verbose vs. concise)
  • Troubleshooting search results
  • Understanding how indexing works
  • 优化搜索准确性
  • 根据代码风格调整(冗长型 vs 简洁型)
  • 排查搜索结果问题
  • 理解索引工作原理

What is Chunking?

什么是分块?

Chunking is the process of splitting source files into smaller segments for embedding:
┌─────────────────────────────────────┐
│         Large Source File           │
│         (1000+ tokens)              │
└─────────────────────────────────────┘
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │
│ ~512    │ │ ~512    │ │ ~512    │
│ tokens  │ │ tokens  │ │ tokens  │
└─────────┘ └─────────┘ └─────────┘
          Each chunk gets
          its own embedding
分块是将源文件拆分为更小片段以进行嵌入的过程:
┌─────────────────────────────────────┐
│         Large Source File           │
│         (1000+ tokens)              │
└─────────────────────────────────────┘
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │
│ ~512    │ │ ~512    │ │ ~512    │
│ tokens  │ │ tokens  │ │ tokens  │
└─────────┘ └─────────┘ └─────────┘
          Each chunk gets
          its own embedding

Why Chunking Matters

分块的重要性

Embedding models have optimal input sizes:
  • Too large chunks: Less precise search results
  • Too small chunks: Lost context, fragmented results
  • Just right: Good balance of precision and context
嵌入模型有最优输入尺寸:
  • 块过大:搜索结果精度降低
  • 块过小:丢失上下文,结果碎片化
  • 尺寸适中:在精度和上下文之间达到良好平衡

Configuration

配置

Basic Settings

基础设置

yaml
undefined
yaml
undefined

.grepai/config.yaml

.grepai/config.yaml

chunking: size: 512 # Tokens per chunk overlap: 50 # Overlap between chunks
undefined
chunking: size: 512 # Tokens per chunk overlap: 50 # Overlap between chunks
undefined

Understanding Parameters

参数说明

Chunk Size

块大小

The target number of tokens per chunk.
SizeEffect
256More precise, less context
512Balanced (default)
1024More context, less precise
每个块的目标token数量。
大小效果
256精度更高,上下文更少
512平衡型(默认值)
1024上下文更多,精度更低

Overlap

重叠度

Tokens shared between adjacent chunks. Preserves context at boundaries.
OverlapEffect
0No overlap, may lose context at boundaries
50Standard overlap (default)
100More context, larger index
相邻块之间共享的token数量,用于保留边界处的上下文。
重叠度效果
0无重叠,可能丢失边界处的上下文
50标准重叠度(默认值)
100上下文更完整,索引体积更大

Visualization

可视化示例

With size=512 and overlap=50:
File: auth.go (1000 tokens)

Chunk 1: tokens 1-512
         ┌────────────────────────────────────┐
         │ func Login(user, pass)...          │
         └────────────────────────────────────┘
                              50 token overlap
Chunk 2: tokens 463-974
         ┌────────────────────────────────────┐
         │ ...validate credentials...         │
         └────────────────────────────────────┘
                              50 token overlap
Chunk 3: tokens 925-1000
         ┌──────────────┐
         │ ...return    │
         └──────────────┘
当size=512且overlap=50时:
File: auth.go (1000 tokens)

Chunk 1: tokens 1-512
         ┌────────────────────────────────────┐
         │ func Login(user, pass)...          │
         └────────────────────────────────────┘
                              50 token overlap
Chunk 2: tokens 463-974
         ┌────────────────────────────────────┐
         │ ...validate credentials...         │
         └────────────────────────────────────┘
                              50 token overlap
Chunk 3: tokens 925-1000
         ┌──────────────┐
         │ ...return    │
         └──────────────┘

Recommended Settings by Language

按语言推荐的设置

Verbose Languages (Java, C#)

冗长型语言(Java、C#)

yaml
chunking:
  size: 768    # Larger to capture full methods
  overlap: 75
yaml
chunking:
  size: 768    # 更大尺寸以容纳完整方法
  overlap: 75

Concise Languages (Go, Python)

简洁型语言(Go、Python)

yaml
chunking:
  size: 512    # Standard size
  overlap: 50
yaml
chunking:
  size: 512    # 标准尺寸
  overlap: 50

Very Concise (Rust, Zig)

超简洁型语言(Rust、Zig)

yaml
chunking:
  size: 384    # Smaller for precise results
  overlap: 40
yaml
chunking:
  size: 384    # 更小尺寸以获得精准结果
  overlap: 40

Recommended Settings by Codebase

按代码库类型推荐的设置

Small Functions (Microservices)

小型函数(微服务)

yaml
chunking:
  size: 384    # Capture individual functions
  overlap: 40
yaml
chunking:
  size: 384    # 容纳单个函数
  overlap: 40

Large Classes (Monolith)

大型类(单体应用)

yaml
chunking:
  size: 768    # Capture more context
  overlap: 100
yaml
chunking:
  size: 768    # 捕获更多上下文
  overlap: 100

Mixed Codebase

混合代码库

yaml
chunking:
  size: 512    # Balanced default
  overlap: 50
yaml
chunking:
  size: 512    # 平衡型默认值
  overlap: 50

How Tokens are Counted

Token计数方式

GrepAI uses approximate token counting:
  • ~4 characters = 1 token (for English text)
  • Code varies based on identifiers and syntax
Example:
go
func calculateTotal(items []Item) float64 {
    total := 0.0
    for _, item := range items {
        total += item.Price * float64(item.Quantity)
    }
    return total
}
≈ 45 tokens
GrepAI使用近似token计数:
  • 约4个字符 = 1个token(英文文本)
  • 代码的计数因标识符和语法而异
示例:
go
func calculateTotal(items []Item) float64 {
    total := 0.0
    for _, item := range items {
        total += item.Price * float64(item.Quantity)
    }
    return total
}
≈ 45 tokens

Impact on Index Size

对索引体积的影响

Larger overlap = more chunks = larger index:
SizeOverlapChunks per 10K tokensIndex Impact
5120~20Smallest
51250~22Standard
512100~24+10%
25650~44+100%
重叠度越高 = 块数量越多 = 索引体积越大:
大小重叠度每10K token的块数量索引影响
5120~20最小
51250~22标准
512100~24+10%
25650~44+100%

Impact on Search Quality

对搜索质量的影响

Too Small Chunks (size: 128)

块过小(size: 128)

Query: "authentication middleware"

Result: "...c.AbortWithStatus(401)..."
        (Fragment, missing context)
Query: "authentication middleware"

Result: "...c.AbortWithStatus(401)..."
        (片段,缺少上下文)

Just Right (size: 512)

尺寸适中(size: 512)

Query: "authentication middleware"

Result: "func AuthMiddleware() gin.HandlerFunc {
            return func(c *gin.Context) {
                token := c.GetHeader("Authorization")
                if token == "" {
                    c.AbortWithStatus(401)
                    return
                }
                // validate token...
            }
        }"
        (Complete function with context)
Query: "authentication middleware"

Result: "func AuthMiddleware() gin.HandlerFunc {
            return func(c *gin.Context) {
                token := c.GetHeader("Authorization")
                if token == "" {
                    c.AbortWithStatus(401)
                    return
                }
                // validate token...
            }
        }"
        (带完整上下文的函数)

Too Large Chunks (size: 2048)

块过大(size: 2048)

Query: "authentication middleware"

Result: "// Multiple unrelated functions...
        func AuthMiddleware()... (your match)
        func LoggingMiddleware()...
        func CORSMiddleware()..."
        (Too much noise)
Query: "authentication middleware"

Result: "// Multiple unrelated functions...
        func AuthMiddleware()... (your match)
        func LoggingMiddleware()...
        func CORSMiddleware()..."
        (包含过多无关内容)

Experimentation

实验方法

Testing Different Settings

测试不同设置

  1. Try smaller chunks for more precise results:
yaml
chunking:
  size: 384
  overlap: 40
  1. Re-index:
bash
rm .grepai/index.gob
grepai watch
  1. Test with searches:
bash
grepai search "your query"
  1. Adjust and repeat until satisfied.
  1. 尝试更小的块以获得更精准的结果:
yaml
chunking:
  size: 384
  overlap: 40
  1. 重新索引:
bash
rm .grepai/index.gob
grepai watch
  1. 测试搜索:
bash
grepai search "your query"
  1. 调整并重复直到满意。

Comparing Results

对比结果

Before changing settings, save a search result:
bash
grepai search "authentication" > before.txt
After changing settings and re-indexing:
bash
grepai search "authentication" > after.txt
diff before.txt after.txt
修改设置前,保存搜索结果:
bash
grepai search "authentication" > before.txt
修改设置并重新索引后:
bash
grepai search "authentication" > after.txt
diff before.txt after.txt

Chunk Boundaries

块边界规则

GrepAI tries to split at logical boundaries:
  1. Empty lines (function/class boundaries)
  2. Closing braces
  3. Statement ends
This means actual chunk sizes may vary slightly from the target.
GrepAI尝试在逻辑边界处拆分:
  1. 空行(函数/类边界)
  2. 闭合大括号
  3. 语句结尾
这意味着实际块大小可能与目标值略有差异。

Best Practices

最佳实践

  1. Start with defaults: 512/50 works well for most codebases
  2. Adjust based on code style: Verbose = larger, concise = smaller
  3. Test with real queries: See what your searches return
  4. Re-index after changes: Must regenerate embeddings
  5. Consider overlap: Don't set to 0 unless index size is critical
  1. 从默认值开始:512/50适用于大多数代码库
  2. 根据代码风格调整:冗长型=更大尺寸,简洁型=更小尺寸
  3. 用真实查询测试:查看搜索返回的结果
  4. 修改后重新索引:必须重新生成嵌入
  5. 考虑重叠度:除非索引体积是关键问题,否则不要设置为0

Common Issues

常见问题

Problem: Search results are too fragmented ✅ Solution: Increase chunk size:
yaml
chunking:
  size: 768
Problem: Search results have too much irrelevant context ✅ Solution: Decrease chunk size:
yaml
chunking:
  size: 384
Problem: Results miss related code at function boundaries ✅ Solution: Increase overlap:
yaml
chunking:
  overlap: 100
Problem: Index is too large ✅ Solutions:
  • Decrease overlap
  • Increase chunk size
  • Add more ignore patterns
问题:搜索结果过于碎片化 ✅ 解决方案:增大块大小:
yaml
chunking:
  size: 768
问题:搜索结果包含过多无关上下文 ✅ 解决方案:减小块大小:
yaml
chunking:
  size: 384
问题:结果遗漏函数边界处的相关代码 ✅ 解决方案:增大重叠度:
yaml
chunking:
  overlap: 100
问题:索引体积过大 ✅ 解决方案
  • 降低重叠度
  • 增大块大小
  • 添加更多忽略规则

Output Format

输出格式

Chunking status:
✅ Chunking Configuration

   Size: 512 tokens
   Overlap: 50 tokens

   Index Statistics:
   - Total files: 245
   - Total chunks: 1,234
   - Avg chunks/file: 5.0
   - Avg chunk size: 478 tokens

   Recommendations:
   - Current settings are balanced
   - Consider size: 384 for more precise results
   - Consider size: 768 for more context
分块状态:
✅ Chunking Configuration

   Size: 512 tokens
   Overlap: 50 tokens

   Index Statistics:
   - Total files: 245
   - Total chunks: 1,234
   - Avg chunks/file: 5.0
   - Avg chunk size: 478 tokens

   Recommendations:
   - Current settings are balanced
   - Consider size: 384 for more precise results
   - Consider size: 768 for more context