chunking-strategies
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseChunking Strategies
文档分块策略
Purpose: Provide production-ready document chunking implementations, benchmarking tools, and strategy selection guidance for RAG pipelines.
Activation Triggers:
- Implementing document chunking for RAG
- Optimizing chunk size and overlap
- Comparing different chunking strategies
- Benchmarking chunking performance
- Processing different document types (markdown, code, PDFs)
- Evaluating retrieval quality with different chunk strategies
Key Resources:
- - Fixed-size chunking implementation
scripts/chunk-fixed-size.py - - Semantic chunking with paragraph preservation
scripts/chunk-semantic.py - - Recursive chunking for hierarchical documents
scripts/chunk-recursive.py - - Benchmark and compare chunking strategies
scripts/benchmark-chunking.py - - Chunking configuration template
templates/chunking-config.yaml - - Template for custom chunking logic
templates/custom-splitter.py - - Markdown-specific chunking
examples/chunk-markdown.py - - Source code chunking
examples/chunk-code.py - - PDF document chunking
examples/chunk-pdf.py
用途: 为RAG管道提供可用于生产环境的文档分块实现、基准测试工具以及策略选择指导。
激活触发场景:
- 为RAG实现文档分块
- 优化分块大小与重叠度
- 对比不同分块策略
- 基准测试分块性能
- 处理不同类型文档(Markdown、代码、PDF)
- 评估不同分块策略下的检索质量
核心资源:
- - 固定大小分块实现
scripts/chunk-fixed-size.py - - 保留段落的语义化分块
scripts/chunk-semantic.py - - 面向层级文档的递归分块
scripts/chunk-recursive.py - - 基准测试与对比分块策略
scripts/benchmark-chunking.py - - 分块配置模板
templates/chunking-config.yaml - - 自定义分块逻辑模板
templates/custom-splitter.py - - Markdown专属分块示例
examples/chunk-markdown.py - - 源代码分块示例
examples/chunk-code.py - - PDF文档分块示例
examples/chunk-pdf.py
Chunking Strategy Overview
分块策略概述
Strategy Selection Guide
策略选择指南
Fixed-Size Chunking:
- Best for: Uniform documents, simple content, consistent structure
- Pros: Fast, predictable, simple implementation
- Cons: May split semantic units, no context awareness
- Use when: Speed matters more than semantic coherence
Semantic Chunking:
- Best for: Natural language documents, articles, books
- Pros: Preserves semantic boundaries, better context
- Cons: Slower, variable chunk sizes
- Use when: Content has clear paragraph/section structure
Recursive Chunking:
- Best for: Hierarchical documents, technical docs, code
- Pros: Preserves structure, handles nested content
- Cons: Most complex, requires structure detection
- Use when: Documents have clear hierarchical organization
Sentence-Based Chunking:
- Best for: Q&A pairs, chatbots, precise retrieval
- Pros: Natural boundaries, good for citations
- Cons: Small chunks may lack context
- Use when: Need precise attribution and citations
固定大小分块:
- 最佳适用场景:格式统一的文档、简单内容、结构一致的文档
- 优点:速度快、可预测、实现简单
- 缺点:可能拆分语义单元,无上下文感知能力
- 使用时机:速度优先级高于语义连贯性时
语义化分块:
- 最佳适用场景:自然语言文档、文章、书籍
- 优点:保留语义边界,上下文更完整
- 缺点:速度较慢,分块大小不固定
- 使用时机:内容具有清晰段落/章节结构时
递归分块:
- 最佳适用场景:层级化文档、技术文档、代码
- 优点:保留文档结构,处理嵌套内容
- 缺点:实现最复杂,需要结构检测
- 使用时机:文档具有清晰层级组织时
基于句子的分块:
- 最佳适用场景:问答对、聊天机器人、精准检索
- 优点:基于自然边界,适合引用场景
- 缺点:分块过小可能缺失上下文
- 使用时机:需要精准溯源与引用时
Implementation Scripts
实现脚本
1. Fixed-Size Chunking
1. 固定大小分块
Script:
scripts/chunk-fixed-size.pyUsage:
bash
python scripts/chunk-fixed-size.py \
--input document.txt \
--chunk-size 1000 \
--overlap 200 \
--output chunks.jsonParameters:
- : Number of characters per chunk (default: 1000)
chunk-size - : Character overlap between chunks (default: 200)
overlap - : Split on sentences, words, or characters (default: sentences)
split-on
Best Practices:
- Use 500-1000 character chunks for most RAG applications
- Set overlap to 10-20% of chunk size
- Split on sentences for better coherence
脚本:
scripts/chunk-fixed-size.py使用方式:
bash
python scripts/chunk-fixed-size.py \
--input document.txt \
--chunk-size 1000 \
--overlap 200 \
--output chunks.json参数说明:
- :每个分块的字符数(默认值:1000)
chunk-size - :分块间的字符重叠数(默认值:200)
overlap - :拆分依据(句子、单词或字符,默认值:句子)
split-on
最佳实践:
- 大多数RAG应用使用500-1000字符的分块
- 重叠度设置为分块大小的10-20%
- 按句子拆分以获得更好的连贯性
2. Semantic Chunking
2. 语义化分块
Script:
scripts/chunk-semantic.pyUsage:
bash
python scripts/chunk-semantic.py \
--input document.txt \
--max-chunk-size 1500 \
--output chunks.jsonHow it works:
- Detects natural boundaries (paragraphs, headings, line breaks)
- Groups content while respecting max chunk size
- Preserves semantic units (paragraphs stay together)
- Adds context headers for nested sections
Best for: Articles, blog posts, documentation, books
脚本:
scripts/chunk-semantic.py使用方式:
bash
python scripts/chunk-semantic.py \
--input document.txt \
--max-chunk-size 1500 \
--output chunks.json工作原理:
- 检测自然边界(段落、标题、换行符)
- 在最大分块大小限制下组合内容
- 保留语义单元(段落保持完整)
- 为嵌套章节添加上下文标题
最佳适用场景: 文章、博客、技术文档、书籍
3. Recursive Chunking
3. 递归分块
Script:
scripts/chunk-recursive.pyUsage:
bash
python scripts/chunk-recursive.py \
--input document.md \
--chunk-size 1000 \
--separators '["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]' \
--output chunks.jsonHow it works:
- Tries to split on first separator (e.g., ## headings)
- If chunks still too large, recursively splits on next separator
- Continues until all chunks are within size limit
- Preserves hierarchical context
Separator hierarchy examples:
- Markdown:
["\\n## ", "\\n### ", "\\n\\n", "\\n", " "] - Python:
["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "] - General:
["\\n\\n", "\\n", ". ", " "]
Best for: Structured documents, source code, technical manuals
脚本:
scripts/chunk-recursive.py使用方式:
bash
python scripts/chunk-recursive.py \
--input document.md \
--chunk-size 1000 \
--separators '["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]' \
--output chunks.json工作原理:
- 尝试使用第一个分隔符拆分(如##标题)
- 如果分块仍过大,递归使用下一个分隔符拆分
- 持续操作直到所有分块符合大小限制
- 保留层级上下文
分隔符层级示例:
- Markdown:
["\\n## ", "\\n### ", "\\n\\n", "\\n", " "] - Python:
["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "] - 通用文本:
["\\n\\n", "\\n", ". ", " "]
最佳适用场景: 结构化文档、源代码、技术手册
4. Benchmark Chunking Strategies
4. 分块策略基准测试
Script:
scripts/benchmark-chunking.pyUsage:
bash
python scripts/benchmark-chunking.py \
--input document.txt \
--strategies fixed,semantic,recursive \
--chunk-sizes 500,1000,1500 \
--output benchmark-results.jsonMetrics Evaluated:
- Processing time: Speed of chunking
- Chunk count: Total chunks generated
- Chunk size variance: Consistency of chunk sizes
- Context preservation: Semantic unit integrity (scored)
- Retrieval quality: Simulated query performance
Output:
json
{
"fixed-1000": {
"time_ms": 45,
"chunk_count": 127,
"avg_size": 982,
"size_variance": 12.3,
"context_score": 0.72
},
"semantic-1000": {
"time_ms": 156,
"chunk_count": 114,
"avg_size": 1087,
"size_variance": 234.5,
"context_score": 0.91
}
}脚本:
scripts/benchmark-chunking.py使用方式:
bash
python scripts/benchmark-chunking.py \
--input document.txt \
--strategies fixed,semantic,recursive \
--chunk-sizes 500,1000,1500 \
--output benchmark-results.json评估指标:
- 处理时间: 分块速度
- 分块数量: 生成的总分块数
- 分块大小方差: 分块大小的一致性
- 上下文保留度: 语义单元完整性(评分制)
- 检索质量: 模拟查询性能
输出示例:
json
{
"fixed-1000": {
"time_ms": 45,
"chunk_count": 127,
"avg_size": 982,
"size_variance": 12.3,
"context_score": 0.72
},
"semantic-1000": {
"time_ms": 156,
"chunk_count": 114,
"avg_size": 1087,
"size_variance": 234.5,
"context_score": 0.91
}
}Configuration Template
配置模板
Template:
templates/chunking-config.yamlComplete configuration:
yaml
chunking:
# Global defaults
default_strategy: semantic
default_chunk_size: 1000
default_overlap: 200
# Strategy-specific configs
strategies:
fixed_size:
chunk_size: 1000
overlap: 200
split_on: sentence # sentence, word, character
semantic:
max_chunk_size: 1500
min_chunk_size: 200
preserve_paragraphs: true
add_headers: true # Include section headers
recursive:
chunk_size: 1000
overlap: 100
separators:
markdown: ["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]
code: ["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]
text: ["\\n\\n", ". ", " "]
# Document type mappings
document_types:
".md": semantic
".py": recursive
".txt": fixed_size
".pdf": semantic模板:
templates/chunking-config.yaml完整配置示例:
yaml
chunking:
# 全局默认设置
default_strategy: semantic
default_chunk_size: 1000
default_overlap: 200
# 策略专属配置
strategies:
fixed_size:
chunk_size: 1000
overlap: 200
split_on: sentence # sentence, word, character
semantic:
max_chunk_size: 1500
min_chunk_size: 200
preserve_paragraphs: true
add_headers: true # Include section headers
recursive:
chunk_size: 1000
overlap: 100
separators:
markdown: ["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]
code: ["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]
text: ["\\n\\n", ". ", " "]
# 文档类型映射
document_types:
".md": semantic
".py": recursive
".txt": fixed_size
".pdf": semanticCustom Splitter Template
自定义分块器模板
Template:
templates/custom-splitter.pyCreate your own chunking logic:
python
from typing import List, Dict
import re
class CustomChunker:
def __init__(self, chunk_size: int = 1000, overlap: int = 200):
self.chunk_size = chunk_size
self.overlap = overlap
def chunk(self, text: str, metadata: Dict = None) -> List[Dict]:
"""
Implement custom chunking logic here.
Returns:
List of chunks with metadata:
[
{
"text": "chunk content",
"metadata": {
"chunk_id": 0,
"source": "document.txt",
"start_char": 0,
"end_char": 1000
}
}
]
"""
chunks = []
# Your custom chunking logic here
# Example: Split on custom pattern
sections = self._split_sections(text)
for i, section in enumerate(sections):
chunks.append({
"text": section,
"metadata": {
"chunk_id": i,
"source": metadata.get("source", "unknown"),
"chunk_size": len(section)
}
})
return chunks
def _split_sections(self, text: str) -> List[str]:
# Implement your splitting logic
pass模板:
templates/custom-splitter.py创建自定义分块逻辑:
python
from typing import List, Dict
import re
class CustomChunker:
def __init__(self, chunk_size: int = 1000, overlap: int = 200):
self.chunk_size = chunk_size
self.overlap = overlap
def chunk(self, text: str, metadata: Dict = None) -> List[Dict]:
"""
Implement custom chunking logic here.
Returns:
List of chunks with metadata:
[
{
"text": "chunk content",
"metadata": {
"chunk_id": 0,
"source": "document.txt",
"start_char": 0,
"end_char": 1000
}
}
]
"""
chunks = []
# Your custom chunking logic here
# Example: Split on custom pattern
sections = self._split_sections(text)
for i, section in enumerate(sections):
chunks.append({
"text": section,
"metadata": {
"chunk_id": i,
"source": metadata.get("source", "unknown"),
"chunk_size": len(section)
}
})
return chunks
def _split_sections(self, text: str) -> List[str]:
# Implement your splitting logic
passDocument-Specific Examples
文档专属示例
Markdown Chunking
Markdown分块
Example:
examples/chunk-markdown.pyFeatures:
- Preserves heading hierarchy
- Keeps code blocks together
- Maintains list structure
- Adds parent section context to chunks
Usage:
bash
python examples/chunk-markdown.py README.md --output readme-chunks.json示例:
examples/chunk-markdown.py特性:
- 保留标题层级
- 代码块保持完整
- 维持列表结构
- 为分块添加父章节上下文
使用方式:
bash
python examples/chunk-markdown.py README.md --output readme-chunks.jsonCode Chunking
代码分块
Example:
examples/chunk-code.pyFeatures:
- Splits on class and function boundaries
- Preserves complete functions
- Includes docstrings with implementations
- Language-aware separator selection
Supported languages: Python, JavaScript, TypeScript, Java, Go
Usage:
bash
python examples/chunk-code.py src/main.py --language python --output code-chunks.json示例:
examples/chunk-code.py特性:
- 按类和函数边界拆分
- 保留完整函数
- 文档字符串与实现代码关联
- 支持语言感知的分隔符选择
支持语言: Python、JavaScript、TypeScript、Java、Go
使用方式:
bash
python examples/chunk-code.py src/main.py --language python --output code-chunks.jsonPDF Chunking
PDF分块
Example:
examples/chunk-pdf.pyFeatures:
- Extracts text from PDF
- Preserves page boundaries
- Maintains formatting clues
- Handles multi-column layouts
Dependencies: ,
pypdfpdfminer.sixUsage:
bash
python examples/chunk-pdf.py research-paper.pdf --strategy semantic --output pdf-chunks.json示例:
examples/chunk-pdf.py特性:
- 从PDF中提取文本
- 保留页面边界
- 保留格式线索
- 处理多栏布局
依赖库: ,
pypdfpdfminer.six使用方式:
bash
python examples/chunk-pdf.py research-paper.pdf --strategy semantic --output pdf-chunks.jsonOptimization Guidelines
优化指南
Chunk Size Selection
分块大小选择
General recommendations:
| Content Type | Chunk Size | Overlap | Strategy |
|---|---|---|---|
| Q&A / FAQs | 200-400 | 50 | Sentence |
| Articles | 500-1000 | 100-200 | Semantic |
| Documentation | 1000-1500 | 200-300 | Recursive |
| Books | 1000-2000 | 300-400 | Semantic |
| Source code | 500-1000 | 100 | Recursive |
Test with your data: Use to find optimal settings
benchmark-chunking.py通用建议:
| 内容类型 | 分块大小 | 重叠度 | 推荐策略 |
|---|---|---|---|
| 问答/常见问题 | 200-400 | 50 | 基于句子 |
| 文章 | 500-1000 | 100-200 | 语义化 |
| 技术文档 | 1000-1500 | 200-300 | 递归 |
| 书籍 | 1000-2000 | 300-400 | 语义化 |
| 源代码 | 500-1000 | 100 | 递归 |
使用自有数据测试: 使用找到最优设置
benchmark-chunking.pyOverlap Strategies
重叠度策略
Why overlap matters:
- Prevents information loss at boundaries
- Improves retrieval of cross-boundary information
- Balances redundancy vs. coverage
Overlap guidelines:
- 10-15%: Minimal overlap for speed
- 15-20%: Standard overlap for most use cases
- 20-30%: High overlap for critical applications
重叠度的重要性:
- 防止边界处的信息丢失
- 提升跨边界信息的检索效果
- 在冗余度与覆盖范围间取得平衡
重叠度指南:
- 10-15%: 最小重叠度,追求速度
- 15-20%: 标准重叠度,适用于大多数场景
- 20-30%: 高重叠度,适用于关键业务场景
Performance Optimization
性能优化
Fast chunking (large documents):
bash
undefined快速分块(大文档):
bash
undefinedUse fixed-size for speed
使用固定大小分块提升速度
python scripts/chunk-fixed-size.py --input large-doc.txt --chunk-size 1000
**Quality chunking (smaller documents):**
```bashpython scripts/chunk-fixed-size.py --input large-doc.txt --chunk-size 1000
**高质量分块(小文档):**
```bashUse semantic for better context
使用语义化分块获得更好的上下文
python scripts/chunk-semantic.py --input article.txt --max-chunk-size 1500
**Batch processing:**
```bashpython scripts/chunk-semantic.py --input article.txt --max-chunk-size 1500
**批量处理:**
```bashProcess multiple files
处理多个文件
for file in documents/*.txt; do
python scripts/chunk-semantic.py --input "$file" --output "chunks/$(basename $file .txt).json"
done
undefinedfor file in documents/*.txt; do
python scripts/chunk-semantic.py --input "$file" --output "chunks/$(basename $file .txt).json"
done
undefinedEvaluation Workflow
评估流程
Step 1: Benchmark Strategies
步骤1:基准测试策略
bash
python scripts/benchmark-chunking.py \
--input sample-document.txt \
--strategies fixed,semantic,recursive \
--chunk-sizes 500,1000,1500bash
python scripts/benchmark-chunking.py \
--input sample-document.txt \
--strategies fixed,semantic,recursive \
--chunk-sizes 500,1000,1500Step 2: Analyze Results
步骤2:分析结果
Review metrics:
- Processing time (prefer < 100ms per document)
- Context preservation score (target > 0.85)
- Chunk size variance (lower is more predictable)
指标审查要点:
- 处理时间(建议单文档<100ms)
- 上下文保留度(目标>0.85)
- 分块大小方差(数值越小越可预测)
Step 3: A/B Test Retrieval
步骤3:A/B测试检索效果
Compare retrieval quality:
- Chunk same corpus with different strategies
- Run identical test queries against each
- Measure precision@k and recall@k
- Select strategy with best retrieval metrics
对比检索质量:
- 使用不同策略对同一语料库分块
- 对每个分块结果运行相同的测试查询
- 衡量precision@k与recall@k指标
- 选择检索指标最优的策略
Step 4: Production Deployment
步骤4:生产环境部署
Use configuration file:
python
import yaml
from chunking_strategies import get_chunker
config = yaml.safe_load(open('chunking-config.yaml'))
chunker = get_chunker(config['chunking']['default_strategy'], config)
chunks = chunker.chunk(document_text)使用配置文件:
python
import yaml
from chunking_strategies import get_chunker
config = yaml.safe_load(open('chunking-config.yaml'))
chunker = get_chunker(config['chunking']['default_strategy'], config)
chunks = chunker.chunk(document_text)Common Issues & Solutions
常见问题与解决方案
Issue: Chunks too small/large
- Adjust parameter
chunk_size - Check document structure (may need different strategy)
- Verify separator selection for recursive chunking
Issue: Lost context at boundaries
- Increase overlap
- Switch to semantic chunking
- Add parent context to metadata
Issue: Slow processing
- Use fixed-size chunking for large batches
- Reduce overlap
- Process files in parallel
Issue: Poor retrieval quality
- Benchmark different strategies
- Increase chunk size
- Try hybrid approach (semantic + fixed fallback)
问题:分块过大/过小
- 调整参数
chunk_size - 检查文档结构(可能需要更换策略)
- 验证递归分块的分隔符选择
问题:边界处上下文丢失
- 增加重叠度
- 切换为语义化分块
- 在元数据中添加父级上下文
问题:处理速度慢
- 大批次处理使用固定大小分块
- 降低重叠度
- 并行处理文件
问题:检索质量差
- 基准测试不同策略
- 增大分块大小
- 尝试混合策略(语义化分块+固定大小兜底)
Dependencies
依赖库
Core libraries:
bash
pip install tiktoken # Token counting
pip install nltk # Sentence splitting
pip install spacy # Advanced NLP (optional)For PDF support:
bash
pip install pypdf pdfminer.sixFor benchmarking:
bash
pip install pandas numpy scikit-learn核心库:
bash
pip install tiktoken # Token计数
pip install nltk # 句子拆分
pip install spacy # 高级NLP(可选)PDF支持依赖:
bash
pip install pypdf pdfminer.six基准测试依赖:
bash
pip install pandas numpy scikit-learnBest Practices Summary
最佳实践总结
- Start with semantic chunking for most documents
- Use recursive chunking for structured/hierarchical content
- Benchmark on your actual data before production
- Set overlap to 15-20% of chunk size
- Include metadata (source, page, section) in chunks
- Test retrieval quality, not just chunking speed
- Use appropriate chunk size for your embedding model token limit
- Document your chunking strategy choice and parameters
Supported Strategies: Fixed-Size, Semantic, Recursive, Sentence-Based, Custom
Output Format: JSON with text and metadata
Version: 1.0.0
- 大多数文档优先使用语义化分块
- 结构化/层级化内容使用递归分块
- 生产环境前使用自有数据进行基准测试
- 重叠度设置为分块大小的15-20%
- 分块中包含元数据(来源、页码、章节)
- 测试检索质量,而非仅关注分块速度
- 分块大小适配嵌入模型的Token限制
- 记录分块策略选择与参数配置
支持策略: 固定大小、语义化、递归、基于句子、自定义
输出格式: 包含文本与元数据的JSON
版本: 1.0.0