chunking-strategies

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Chunking Strategies

文档分块策略

Purpose: Provide production-ready document chunking implementations, benchmarking tools, and strategy selection guidance for RAG pipelines.
Activation Triggers:
  • Implementing document chunking for RAG
  • Optimizing chunk size and overlap
  • Comparing different chunking strategies
  • Benchmarking chunking performance
  • Processing different document types (markdown, code, PDFs)
  • Evaluating retrieval quality with different chunk strategies
Key Resources:
  • scripts/chunk-fixed-size.py
    - Fixed-size chunking implementation
  • scripts/chunk-semantic.py
    - Semantic chunking with paragraph preservation
  • scripts/chunk-recursive.py
    - Recursive chunking for hierarchical documents
  • scripts/benchmark-chunking.py
    - Benchmark and compare chunking strategies
  • templates/chunking-config.yaml
    - Chunking configuration template
  • templates/custom-splitter.py
    - Template for custom chunking logic
  • examples/chunk-markdown.py
    - Markdown-specific chunking
  • examples/chunk-code.py
    - Source code chunking
  • examples/chunk-pdf.py
    - PDF document chunking
用途: 为RAG管道提供可用于生产环境的文档分块实现、基准测试工具以及策略选择指导。
激活触发场景:
  • 为RAG实现文档分块
  • 优化分块大小与重叠度
  • 对比不同分块策略
  • 基准测试分块性能
  • 处理不同类型文档(Markdown、代码、PDF)
  • 评估不同分块策略下的检索质量
核心资源:
  • scripts/chunk-fixed-size.py
    - 固定大小分块实现
  • scripts/chunk-semantic.py
    - 保留段落的语义化分块
  • scripts/chunk-recursive.py
    - 面向层级文档的递归分块
  • scripts/benchmark-chunking.py
    - 基准测试与对比分块策略
  • templates/chunking-config.yaml
    - 分块配置模板
  • templates/custom-splitter.py
    - 自定义分块逻辑模板
  • examples/chunk-markdown.py
    - Markdown专属分块示例
  • examples/chunk-code.py
    - 源代码分块示例
  • examples/chunk-pdf.py
    - PDF文档分块示例

Chunking Strategy Overview

分块策略概述

Strategy Selection Guide

策略选择指南

Fixed-Size Chunking:
  • Best for: Uniform documents, simple content, consistent structure
  • Pros: Fast, predictable, simple implementation
  • Cons: May split semantic units, no context awareness
  • Use when: Speed matters more than semantic coherence
Semantic Chunking:
  • Best for: Natural language documents, articles, books
  • Pros: Preserves semantic boundaries, better context
  • Cons: Slower, variable chunk sizes
  • Use when: Content has clear paragraph/section structure
Recursive Chunking:
  • Best for: Hierarchical documents, technical docs, code
  • Pros: Preserves structure, handles nested content
  • Cons: Most complex, requires structure detection
  • Use when: Documents have clear hierarchical organization
Sentence-Based Chunking:
  • Best for: Q&A pairs, chatbots, precise retrieval
  • Pros: Natural boundaries, good for citations
  • Cons: Small chunks may lack context
  • Use when: Need precise attribution and citations
固定大小分块:
  • 最佳适用场景:格式统一的文档、简单内容、结构一致的文档
  • 优点:速度快、可预测、实现简单
  • 缺点:可能拆分语义单元,无上下文感知能力
  • 使用时机:速度优先级高于语义连贯性时
语义化分块:
  • 最佳适用场景:自然语言文档、文章、书籍
  • 优点:保留语义边界,上下文更完整
  • 缺点:速度较慢,分块大小不固定
  • 使用时机:内容具有清晰段落/章节结构时
递归分块:
  • 最佳适用场景:层级化文档、技术文档、代码
  • 优点:保留文档结构,处理嵌套内容
  • 缺点:实现最复杂,需要结构检测
  • 使用时机:文档具有清晰层级组织时
基于句子的分块:
  • 最佳适用场景:问答对、聊天机器人、精准检索
  • 优点:基于自然边界,适合引用场景
  • 缺点:分块过小可能缺失上下文
  • 使用时机:需要精准溯源与引用时

Implementation Scripts

实现脚本

1. Fixed-Size Chunking

1. 固定大小分块

Script:
scripts/chunk-fixed-size.py
Usage:
bash
python scripts/chunk-fixed-size.py \
  --input document.txt \
  --chunk-size 1000 \
  --overlap 200 \
  --output chunks.json
Parameters:
  • chunk-size
    : Number of characters per chunk (default: 1000)
  • overlap
    : Character overlap between chunks (default: 200)
  • split-on
    : Split on sentences, words, or characters (default: sentences)
Best Practices:
  • Use 500-1000 character chunks for most RAG applications
  • Set overlap to 10-20% of chunk size
  • Split on sentences for better coherence
脚本:
scripts/chunk-fixed-size.py
使用方式:
bash
python scripts/chunk-fixed-size.py \
  --input document.txt \
  --chunk-size 1000 \
  --overlap 200 \
  --output chunks.json
参数说明:
  • chunk-size
    :每个分块的字符数(默认值:1000)
  • overlap
    :分块间的字符重叠数(默认值:200)
  • split-on
    :拆分依据(句子、单词或字符,默认值:句子)
最佳实践:
  • 大多数RAG应用使用500-1000字符的分块
  • 重叠度设置为分块大小的10-20%
  • 按句子拆分以获得更好的连贯性

2. Semantic Chunking

2. 语义化分块

Script:
scripts/chunk-semantic.py
Usage:
bash
python scripts/chunk-semantic.py \
  --input document.txt \
  --max-chunk-size 1500 \
  --output chunks.json
How it works:
  1. Detects natural boundaries (paragraphs, headings, line breaks)
  2. Groups content while respecting max chunk size
  3. Preserves semantic units (paragraphs stay together)
  4. Adds context headers for nested sections
Best for: Articles, blog posts, documentation, books
脚本:
scripts/chunk-semantic.py
使用方式:
bash
python scripts/chunk-semantic.py \
  --input document.txt \
  --max-chunk-size 1500 \
  --output chunks.json
工作原理:
  1. 检测自然边界(段落、标题、换行符)
  2. 在最大分块大小限制下组合内容
  3. 保留语义单元(段落保持完整)
  4. 为嵌套章节添加上下文标题
最佳适用场景: 文章、博客、技术文档、书籍

3. Recursive Chunking

3. 递归分块

Script:
scripts/chunk-recursive.py
Usage:
bash
python scripts/chunk-recursive.py \
  --input document.md \
  --chunk-size 1000 \
  --separators '["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]' \
  --output chunks.json
How it works:
  1. Tries to split on first separator (e.g., ## headings)
  2. If chunks still too large, recursively splits on next separator
  3. Continues until all chunks are within size limit
  4. Preserves hierarchical context
Separator hierarchy examples:
  • Markdown:
    ["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]
  • Python:
    ["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]
  • General:
    ["\\n\\n", "\\n", ". ", " "]
Best for: Structured documents, source code, technical manuals
脚本:
scripts/chunk-recursive.py
使用方式:
bash
python scripts/chunk-recursive.py \
  --input document.md \
  --chunk-size 1000 \
  --separators '["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]' \
  --output chunks.json
工作原理:
  1. 尝试使用第一个分隔符拆分(如##标题)
  2. 如果分块仍过大,递归使用下一个分隔符拆分
  3. 持续操作直到所有分块符合大小限制
  4. 保留层级上下文
分隔符层级示例:
  • Markdown:
    ["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]
  • Python:
    ["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]
  • 通用文本:
    ["\\n\\n", "\\n", ". ", " "]
最佳适用场景: 结构化文档、源代码、技术手册

4. Benchmark Chunking Strategies

4. 分块策略基准测试

Script:
scripts/benchmark-chunking.py
Usage:
bash
python scripts/benchmark-chunking.py \
  --input document.txt \
  --strategies fixed,semantic,recursive \
  --chunk-sizes 500,1000,1500 \
  --output benchmark-results.json
Metrics Evaluated:
  • Processing time: Speed of chunking
  • Chunk count: Total chunks generated
  • Chunk size variance: Consistency of chunk sizes
  • Context preservation: Semantic unit integrity (scored)
  • Retrieval quality: Simulated query performance
Output:
json
{
  "fixed-1000": {
    "time_ms": 45,
    "chunk_count": 127,
    "avg_size": 982,
    "size_variance": 12.3,
    "context_score": 0.72
  },
  "semantic-1000": {
    "time_ms": 156,
    "chunk_count": 114,
    "avg_size": 1087,
    "size_variance": 234.5,
    "context_score": 0.91
  }
}
脚本:
scripts/benchmark-chunking.py
使用方式:
bash
python scripts/benchmark-chunking.py \
  --input document.txt \
  --strategies fixed,semantic,recursive \
  --chunk-sizes 500,1000,1500 \
  --output benchmark-results.json
评估指标:
  • 处理时间: 分块速度
  • 分块数量: 生成的总分块数
  • 分块大小方差: 分块大小的一致性
  • 上下文保留度: 语义单元完整性(评分制)
  • 检索质量: 模拟查询性能
输出示例:
json
{
  "fixed-1000": {
    "time_ms": 45,
    "chunk_count": 127,
    "avg_size": 982,
    "size_variance": 12.3,
    "context_score": 0.72
  },
  "semantic-1000": {
    "time_ms": 156,
    "chunk_count": 114,
    "avg_size": 1087,
    "size_variance": 234.5,
    "context_score": 0.91
  }
}

Configuration Template

配置模板

Template:
templates/chunking-config.yaml
Complete configuration:
yaml
chunking:
  # Global defaults
  default_strategy: semantic
  default_chunk_size: 1000
  default_overlap: 200

  # Strategy-specific configs
  strategies:
    fixed_size:
      chunk_size: 1000
      overlap: 200
      split_on: sentence  # sentence, word, character

    semantic:
      max_chunk_size: 1500
      min_chunk_size: 200
      preserve_paragraphs: true
      add_headers: true  # Include section headers

    recursive:
      chunk_size: 1000
      overlap: 100
      separators:
        markdown: ["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]
        code: ["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]
        text: ["\\n\\n", ". ", " "]

  # Document type mappings
  document_types:
    ".md": semantic
    ".py": recursive
    ".txt": fixed_size
    ".pdf": semantic
模板:
templates/chunking-config.yaml
完整配置示例:
yaml
chunking:
  # 全局默认设置
  default_strategy: semantic
  default_chunk_size: 1000
  default_overlap: 200

  # 策略专属配置
  strategies:
    fixed_size:
      chunk_size: 1000
      overlap: 200
      split_on: sentence  # sentence, word, character

    semantic:
      max_chunk_size: 1500
      min_chunk_size: 200
      preserve_paragraphs: true
      add_headers: true  # Include section headers

    recursive:
      chunk_size: 1000
      overlap: 100
      separators:
        markdown: ["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]
        code: ["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]
        text: ["\\n\\n", ". ", " "]

  # 文档类型映射
  document_types:
    ".md": semantic
    ".py": recursive
    ".txt": fixed_size
    ".pdf": semantic

Custom Splitter Template

自定义分块器模板

Template:
templates/custom-splitter.py
Create your own chunking logic:
python
from typing import List, Dict
import re

class CustomChunker:
    def __init__(self, chunk_size: int = 1000, overlap: int = 200):
        self.chunk_size = chunk_size
        self.overlap = overlap

    def chunk(self, text: str, metadata: Dict = None) -> List[Dict]:
        """
        Implement custom chunking logic here.

        Returns:
            List of chunks with metadata:
            [
                {
                    "text": "chunk content",
                    "metadata": {
                        "chunk_id": 0,
                        "source": "document.txt",
                        "start_char": 0,
                        "end_char": 1000
                    }
                }
            ]
        """
        chunks = []

        # Your custom chunking logic here
        # Example: Split on custom pattern
        sections = self._split_sections(text)

        for i, section in enumerate(sections):
            chunks.append({
                "text": section,
                "metadata": {
                    "chunk_id": i,
                    "source": metadata.get("source", "unknown"),
                    "chunk_size": len(section)
                }
            })

        return chunks

    def _split_sections(self, text: str) -> List[str]:
        # Implement your splitting logic
        pass
模板:
templates/custom-splitter.py
创建自定义分块逻辑:
python
from typing import List, Dict
import re

class CustomChunker:
    def __init__(self, chunk_size: int = 1000, overlap: int = 200):
        self.chunk_size = chunk_size
        self.overlap = overlap

    def chunk(self, text: str, metadata: Dict = None) -> List[Dict]:
        """
        Implement custom chunking logic here.

        Returns:
            List of chunks with metadata:
            [
                {
                    "text": "chunk content",
                    "metadata": {
                        "chunk_id": 0,
                        "source": "document.txt",
                        "start_char": 0,
                        "end_char": 1000
                    }
                }
            ]
        """
        chunks = []

        # Your custom chunking logic here
        # Example: Split on custom pattern
        sections = self._split_sections(text)

        for i, section in enumerate(sections):
            chunks.append({
                "text": section,
                "metadata": {
                    "chunk_id": i,
                    "source": metadata.get("source", "unknown"),
                    "chunk_size": len(section)
                }
            })

        return chunks

    def _split_sections(self, text: str) -> List[str]:
        # Implement your splitting logic
        pass

Document-Specific Examples

文档专属示例

Markdown Chunking

Markdown分块

Example:
examples/chunk-markdown.py
Features:
  • Preserves heading hierarchy
  • Keeps code blocks together
  • Maintains list structure
  • Adds parent section context to chunks
Usage:
bash
python examples/chunk-markdown.py README.md --output readme-chunks.json
示例:
examples/chunk-markdown.py
特性:
  • 保留标题层级
  • 代码块保持完整
  • 维持列表结构
  • 为分块添加父章节上下文
使用方式:
bash
python examples/chunk-markdown.py README.md --output readme-chunks.json

Code Chunking

代码分块

Example:
examples/chunk-code.py
Features:
  • Splits on class and function boundaries
  • Preserves complete functions
  • Includes docstrings with implementations
  • Language-aware separator selection
Supported languages: Python, JavaScript, TypeScript, Java, Go
Usage:
bash
python examples/chunk-code.py src/main.py --language python --output code-chunks.json
示例:
examples/chunk-code.py
特性:
  • 按类和函数边界拆分
  • 保留完整函数
  • 文档字符串与实现代码关联
  • 支持语言感知的分隔符选择
支持语言: Python、JavaScript、TypeScript、Java、Go
使用方式:
bash
python examples/chunk-code.py src/main.py --language python --output code-chunks.json

PDF Chunking

PDF分块

Example:
examples/chunk-pdf.py
Features:
  • Extracts text from PDF
  • Preserves page boundaries
  • Maintains formatting clues
  • Handles multi-column layouts
Dependencies:
pypdf
,
pdfminer.six
Usage:
bash
python examples/chunk-pdf.py research-paper.pdf --strategy semantic --output pdf-chunks.json
示例:
examples/chunk-pdf.py
特性:
  • 从PDF中提取文本
  • 保留页面边界
  • 保留格式线索
  • 处理多栏布局
依赖库:
pypdf
,
pdfminer.six
使用方式:
bash
python examples/chunk-pdf.py research-paper.pdf --strategy semantic --output pdf-chunks.json

Optimization Guidelines

优化指南

Chunk Size Selection

分块大小选择

General recommendations:
Content TypeChunk SizeOverlapStrategy
Q&A / FAQs200-40050Sentence
Articles500-1000100-200Semantic
Documentation1000-1500200-300Recursive
Books1000-2000300-400Semantic
Source code500-1000100Recursive
Test with your data: Use
benchmark-chunking.py
to find optimal settings
通用建议:
内容类型分块大小重叠度推荐策略
问答/常见问题200-40050基于句子
文章500-1000100-200语义化
技术文档1000-1500200-300递归
书籍1000-2000300-400语义化
源代码500-1000100递归
使用自有数据测试: 使用
benchmark-chunking.py
找到最优设置

Overlap Strategies

重叠度策略

Why overlap matters:
  • Prevents information loss at boundaries
  • Improves retrieval of cross-boundary information
  • Balances redundancy vs. coverage
Overlap guidelines:
  • 10-15%: Minimal overlap for speed
  • 15-20%: Standard overlap for most use cases
  • 20-30%: High overlap for critical applications
重叠度的重要性:
  • 防止边界处的信息丢失
  • 提升跨边界信息的检索效果
  • 在冗余度与覆盖范围间取得平衡
重叠度指南:
  • 10-15%: 最小重叠度,追求速度
  • 15-20%: 标准重叠度,适用于大多数场景
  • 20-30%: 高重叠度,适用于关键业务场景

Performance Optimization

性能优化

Fast chunking (large documents):
bash
undefined
快速分块(大文档):
bash
undefined

Use fixed-size for speed

使用固定大小分块提升速度

python scripts/chunk-fixed-size.py --input large-doc.txt --chunk-size 1000

**Quality chunking (smaller documents):**
```bash
python scripts/chunk-fixed-size.py --input large-doc.txt --chunk-size 1000

**高质量分块(小文档):**
```bash

Use semantic for better context

使用语义化分块获得更好的上下文

python scripts/chunk-semantic.py --input article.txt --max-chunk-size 1500

**Batch processing:**
```bash
python scripts/chunk-semantic.py --input article.txt --max-chunk-size 1500

**批量处理:**
```bash

Process multiple files

处理多个文件

for file in documents/*.txt; do python scripts/chunk-semantic.py --input "$file" --output "chunks/$(basename $file .txt).json" done
undefined
for file in documents/*.txt; do python scripts/chunk-semantic.py --input "$file" --output "chunks/$(basename $file .txt).json" done
undefined

Evaluation Workflow

评估流程

Step 1: Benchmark Strategies

步骤1:基准测试策略

bash
python scripts/benchmark-chunking.py \
  --input sample-document.txt \
  --strategies fixed,semantic,recursive \
  --chunk-sizes 500,1000,1500
bash
python scripts/benchmark-chunking.py \
  --input sample-document.txt \
  --strategies fixed,semantic,recursive \
  --chunk-sizes 500,1000,1500

Step 2: Analyze Results

步骤2:分析结果

Review metrics:
  • Processing time (prefer < 100ms per document)
  • Context preservation score (target > 0.85)
  • Chunk size variance (lower is more predictable)
指标审查要点:
  • 处理时间(建议单文档<100ms)
  • 上下文保留度(目标>0.85)
  • 分块大小方差(数值越小越可预测)

Step 3: A/B Test Retrieval

步骤3:A/B测试检索效果

Compare retrieval quality:
  1. Chunk same corpus with different strategies
  2. Run identical test queries against each
  3. Measure precision@k and recall@k
  4. Select strategy with best retrieval metrics
对比检索质量:
  1. 使用不同策略对同一语料库分块
  2. 对每个分块结果运行相同的测试查询
  3. 衡量precision@k与recall@k指标
  4. 选择检索指标最优的策略

Step 4: Production Deployment

步骤4:生产环境部署

Use configuration file:
python
import yaml
from chunking_strategies import get_chunker

config = yaml.safe_load(open('chunking-config.yaml'))
chunker = get_chunker(config['chunking']['default_strategy'], config)

chunks = chunker.chunk(document_text)
使用配置文件:
python
import yaml
from chunking_strategies import get_chunker

config = yaml.safe_load(open('chunking-config.yaml'))
chunker = get_chunker(config['chunking']['default_strategy'], config)

chunks = chunker.chunk(document_text)

Common Issues & Solutions

常见问题与解决方案

Issue: Chunks too small/large
  • Adjust
    chunk_size
    parameter
  • Check document structure (may need different strategy)
  • Verify separator selection for recursive chunking
Issue: Lost context at boundaries
  • Increase overlap
  • Switch to semantic chunking
  • Add parent context to metadata
Issue: Slow processing
  • Use fixed-size chunking for large batches
  • Reduce overlap
  • Process files in parallel
Issue: Poor retrieval quality
  • Benchmark different strategies
  • Increase chunk size
  • Try hybrid approach (semantic + fixed fallback)
问题:分块过大/过小
  • 调整
    chunk_size
    参数
  • 检查文档结构(可能需要更换策略)
  • 验证递归分块的分隔符选择
问题:边界处上下文丢失
  • 增加重叠度
  • 切换为语义化分块
  • 在元数据中添加父级上下文
问题:处理速度慢
  • 大批次处理使用固定大小分块
  • 降低重叠度
  • 并行处理文件
问题:检索质量差
  • 基准测试不同策略
  • 增大分块大小
  • 尝试混合策略(语义化分块+固定大小兜底)

Dependencies

依赖库

Core libraries:
bash
pip install tiktoken  # Token counting
pip install nltk      # Sentence splitting
pip install spacy     # Advanced NLP (optional)
For PDF support:
bash
pip install pypdf pdfminer.six
For benchmarking:
bash
pip install pandas numpy scikit-learn
核心库:
bash
pip install tiktoken  # Token计数
pip install nltk      # 句子拆分
pip install spacy     # 高级NLP(可选)
PDF支持依赖:
bash
pip install pypdf pdfminer.six
基准测试依赖:
bash
pip install pandas numpy scikit-learn

Best Practices Summary

最佳实践总结

  1. Start with semantic chunking for most documents
  2. Use recursive chunking for structured/hierarchical content
  3. Benchmark on your actual data before production
  4. Set overlap to 15-20% of chunk size
  5. Include metadata (source, page, section) in chunks
  6. Test retrieval quality, not just chunking speed
  7. Use appropriate chunk size for your embedding model token limit
  8. Document your chunking strategy choice and parameters

Supported Strategies: Fixed-Size, Semantic, Recursive, Sentence-Based, Custom Output Format: JSON with text and metadata Version: 1.0.0
  1. 大多数文档优先使用语义化分块
  2. 结构化/层级化内容使用递归分块
  3. 生产环境前使用自有数据进行基准测试
  4. 重叠度设置为分块大小的15-20%
  5. 分块中包含元数据(来源、页码、章节)
  6. 测试检索质量,而非仅关注分块速度
  7. 分块大小适配嵌入模型的Token限制
  8. 记录分块策略选择与参数配置

支持策略: 固定大小、语义化、递归、基于句子、自定义 输出格式: 包含文本与元数据的JSON 版本: 1.0.0