grepai-chunking

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

GrepAI Chunking Configuration

GrepAI代码分块配置

This skill covers how GrepAI splits code files into chunks for embedding, and how to optimize chunking for your codebase.

本技能介绍GrepAI如何将代码文件拆分为块以进行嵌入，以及如何针对你的代码库优化分块设置。

When to Use This Skill

何时使用此技能

Optimizing search accuracy
Adjusting for code style (verbose vs. concise)
Troubleshooting search results
Understanding how indexing works

优化搜索准确性
根据代码风格调整（冗长型 vs 简洁型）
排查搜索结果问题
理解索引工作原理

What is Chunking?

什么是分块？

Chunking is the process of splitting source files into smaller segments for embedding:

┌─────────────────────────────────────┐
│         Large Source File           │
│         (1000+ tokens)              │
└─────────────────────────────────────┘
                  ↓
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │
│ ~512    │ │ ~512    │ │ ~512    │
│ tokens  │ │ tokens  │ │ tokens  │
└─────────┘ └─────────┘ └─────────┘
                  ↓
          Each chunk gets
          its own embedding

分块是将源文件拆分为更小片段以进行嵌入的过程：

┌─────────────────────────────────────┐
│         Large Source File           │
│         (1000+ tokens)              │
└─────────────────────────────────────┘
                  ↓
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Chunk 1 │ │ Chunk 2 │ │ Chunk 3 │
│ ~512    │ │ ~512    │ │ ~512    │
│ tokens  │ │ tokens  │ │ tokens  │
└─────────┘ └─────────┘ └─────────┘
                  ↓
          Each chunk gets
          its own embedding

Why Chunking Matters

分块的重要性

Embedding models have optimal input sizes:

Too large chunks: Less precise search results
Too small chunks: Lost context, fragmented results
Just right: Good balance of precision and context

嵌入模型有最优输入尺寸：

块过大：搜索结果精度降低
块过小：丢失上下文，结果碎片化
尺寸适中：在精度和上下文之间达到良好平衡

Configuration

配置

Basic Settings

基础设置

yaml

undefined

yaml

undefined

.grepai/config.yaml

chunking: size: 512 # Tokens per chunk overlap: 50 # Overlap between chunks

undefined

chunking: size: 512 # Tokens per chunk overlap: 50 # Overlap between chunks

undefined

Understanding Parameters

参数说明

Chunk Size

块大小

The target number of tokens per chunk.

Size	Effect
256	More precise, less context
512	Balanced (default)
1024	More context, less precise

每个块的目标token数量。

大小	效果
256	精度更高，上下文更少
512	平衡型（默认值）
1024	上下文更多，精度更低

Overlap

重叠度

Tokens shared between adjacent chunks. Preserves context at boundaries.

Overlap	Effect
0	No overlap, may lose context at boundaries
50	Standard overlap (default)
100	More context, larger index

相邻块之间共享的token数量，用于保留边界处的上下文。

重叠度	效果
0	无重叠，可能丢失边界处的上下文
50	标准重叠度（默认值）
100	上下文更完整，索引体积更大

Visualization

可视化示例

With size=512 and overlap=50:

File: auth.go (1000 tokens)

Chunk 1: tokens 1-512
         ┌────────────────────────────────────┐
         │ func Login(user, pass)...          │
         └────────────────────────────────────┘
                                    ↘
                              50 token overlap
                                    ↙
Chunk 2: tokens 463-974
         ┌────────────────────────────────────┐
         │ ...validate credentials...         │
         └────────────────────────────────────┘
                                    ↘
                              50 token overlap
                                    ↙
Chunk 3: tokens 925-1000
         ┌──────────────┐
         │ ...return    │
         └──────────────┘

当size=512且overlap=50时：

File: auth.go (1000 tokens)

Chunk 1: tokens 1-512
         ┌────────────────────────────────────┐
         │ func Login(user, pass)...          │
         └────────────────────────────────────┘
                                    ↘
                              50 token overlap
                                    ↙
Chunk 2: tokens 463-974
         ┌────────────────────────────────────┐
         │ ...validate credentials...         │
         └────────────────────────────────────┘
                                    ↘
                              50 token overlap
                                    ↙
Chunk 3: tokens 925-1000
         ┌──────────────┐
         │ ...return    │
         └──────────────┘

Recommended Settings by Language

按语言推荐的设置

Verbose Languages (Java, C#)

冗长型语言（Java、C#）

yaml

chunking:
  size: 768    # Larger to capture full methods
  overlap: 75

yaml

chunking:
  size: 768    # 更大尺寸以容纳完整方法
  overlap: 75

Concise Languages (Go, Python)

简洁型语言（Go、Python）

yaml

chunking:
  size: 512    # Standard size
  overlap: 50

yaml

chunking:
  size: 512    # 标准尺寸
  overlap: 50

Very Concise (Rust, Zig)

超简洁型语言（Rust、Zig）

yaml

chunking:
  size: 384    # Smaller for precise results
  overlap: 40

yaml

chunking:
  size: 384    # 更小尺寸以获得精准结果
  overlap: 40

Recommended Settings by Codebase

按代码库类型推荐的设置

Small Functions (Microservices)

小型函数（微服务）

yaml

chunking:
  size: 384    # Capture individual functions
  overlap: 40

yaml

chunking:
  size: 384    # 容纳单个函数
  overlap: 40

Large Classes (Monolith)

大型类（单体应用）

yaml

chunking:
  size: 768    # Capture more context
  overlap: 100

yaml

chunking:
  size: 768    # 捕获更多上下文
  overlap: 100

Mixed Codebase

混合代码库

yaml

chunking:
  size: 512    # Balanced default
  overlap: 50

yaml

chunking:
  size: 512    # 平衡型默认值
  overlap: 50

How Tokens are Counted

Token计数方式

GrepAI uses approximate token counting:

~4 characters = 1 token (for English text)
Code varies based on identifiers and syntax

Example:

func calculateTotal(items []Item) float64 {
    total := 0.0
    for _, item := range items {
        total += item.Price * float64(item.Quantity)
    }
    return total
}

≈ 45 tokens

GrepAI使用近似token计数：

约4个字符 = 1个token（英文文本）
代码的计数因标识符和语法而异

示例：

func calculateTotal(items []Item) float64 {
    total := 0.0
    for _, item := range items {
        total += item.Price * float64(item.Quantity)
    }
    return total
}

≈ 45 tokens

Impact on Index Size

对索引体积的影响

Larger overlap = more chunks = larger index:

Size	Overlap	Chunks per 10K tokens	Index Impact
512	0	~20	Smallest
512	50	~22	Standard
512	100	~24	+10%
256	50	~44	+100%

重叠度越高 = 块数量越多 = 索引体积越大：

大小	重叠度	每10K token的块数量	索引影响
512	0	~20	最小
512	50	~22	标准
512	100	~24	+10%
256	50	~44	+100%

Impact on Search Quality

对搜索质量的影响

Too Small Chunks (size: 128)

块过小（size: 128）

Query: "authentication middleware"

Result: "...c.AbortWithStatus(401)..."
        (Fragment, missing context)

Query: "authentication middleware"

Result: "...c.AbortWithStatus(401)..."
        (片段，缺少上下文)

Just Right (size: 512)

尺寸适中（size: 512）

Query: "authentication middleware"

Result: "func AuthMiddleware() gin.HandlerFunc {
            return func(c *gin.Context) {
                token := c.GetHeader("Authorization")
                if token == "" {
                    c.AbortWithStatus(401)
                    return
                }
                // validate token...
            }
        }"
        (Complete function with context)

Query: "authentication middleware"

Result: "func AuthMiddleware() gin.HandlerFunc {
            return func(c *gin.Context) {
                token := c.GetHeader("Authorization")
                if token == "" {
                    c.AbortWithStatus(401)
                    return
                }
                // validate token...
            }
        }"
        (带完整上下文的函数)

Too Large Chunks (size: 2048)

块过大（size: 2048）

Query: "authentication middleware"

Result: "// Multiple unrelated functions...
        func AuthMiddleware()... (your match)
        func LoggingMiddleware()...
        func CORSMiddleware()..."
        (Too much noise)

Query: "authentication middleware"

Result: "// Multiple unrelated functions...
        func AuthMiddleware()... (your match)
        func LoggingMiddleware()...
        func CORSMiddleware()..."
        (包含过多无关内容)

Experimentation

实验方法

Testing Different Settings

测试不同设置

Try smaller chunks for more precise results:

yaml

chunking:
  size: 384
  overlap: 40

Re-index:

bash

rm .grepai/index.gob
grepai watch

Test with searches:

bash

grepai search "your query"

Adjust and repeat until satisfied.

尝试更小的块以获得更精准的结果：

yaml

chunking:
  size: 384
  overlap: 40

重新索引：

bash

rm .grepai/index.gob
grepai watch

测试搜索：

bash

grepai search "your query"

调整并重复直到满意。

Comparing Results

对比结果

Before changing settings, save a search result:

bash

grepai search "authentication" > before.txt

After changing settings and re-indexing:

bash

grepai search "authentication" > after.txt
diff before.txt after.txt

修改设置前，保存搜索结果：

bash

grepai search "authentication" > before.txt

修改设置并重新索引后：

bash

grepai search "authentication" > after.txt
diff before.txt after.txt

Chunk Boundaries

块边界规则

GrepAI tries to split at logical boundaries:

Empty lines (function/class boundaries)
Closing braces
Statement ends

This means actual chunk sizes may vary slightly from the target.

GrepAI尝试在逻辑边界处拆分：

空行（函数/类边界）
闭合大括号
语句结尾

这意味着实际块大小可能与目标值略有差异。

Best Practices

最佳实践

Start with defaults: 512/50 works well for most codebases
Adjust based on code style: Verbose = larger, concise = smaller
Test with real queries: See what your searches return
Re-index after changes: Must regenerate embeddings
Consider overlap: Don't set to 0 unless index size is critical

从默认值开始：512/50适用于大多数代码库
根据代码风格调整：冗长型=更大尺寸，简洁型=更小尺寸
用真实查询测试：查看搜索返回的结果
修改后重新索引：必须重新生成嵌入
考虑重叠度：除非索引体积是关键问题，否则不要设置为0

Common Issues

常见问题

❌ Problem: Search results are too fragmented ✅ Solution: Increase chunk size:

yaml

chunking:
  size: 768

❌ Problem: Search results have too much irrelevant context ✅ Solution: Decrease chunk size:

yaml

chunking:
  size: 384

❌ Problem: Results miss related code at function boundaries ✅ Solution: Increase overlap:

yaml

chunking:
  overlap: 100

❌ Problem: Index is too large ✅ Solutions:

Decrease overlap
Increase chunk size
Add more ignore patterns

❌ 问题：搜索结果过于碎片化 ✅ 解决方案：增大块大小：

yaml

chunking:
  size: 768

❌ 问题：搜索结果包含过多无关上下文 ✅ 解决方案：减小块大小：

yaml

chunking:
  size: 384

❌ 问题：结果遗漏函数边界处的相关代码 ✅ 解决方案：增大重叠度：

yaml

chunking:
  overlap: 100

❌ 问题：索引体积过大 ✅ 解决方案：

降低重叠度
增大块大小
添加更多忽略规则

Output Format

输出格式

Chunking status:

✅ Chunking Configuration

   Size: 512 tokens
   Overlap: 50 tokens

   Index Statistics:
   - Total files: 245
   - Total chunks: 1,234
   - Avg chunks/file: 5.0
   - Avg chunk size: 478 tokens

   Recommendations:
   - Current settings are balanced
   - Consider size: 384 for more precise results
   - Consider size: 768 for more context

分块状态：

✅ Chunking Configuration

   Size: 512 tokens
   Overlap: 50 tokens

   Index Statistics:
   - Total files: 245
   - Total chunks: 1,234
   - Avg chunks/file: 5.0
   - Avg chunk size: 478 tokens

   Recommendations:
   - Current settings are balanced
   - Consider size: 384 for more precise results
   - Consider size: 768 for more context