chunking-strategies

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Chunking Strategies

文档分块策略

Purpose: Provide production-ready document chunking implementations, benchmarking tools, and strategy selection guidance for RAG pipelines.

Activation Triggers:

Implementing document chunking for RAG
Optimizing chunk size and overlap
Comparing different chunking strategies
Benchmarking chunking performance
Processing different document types (markdown, code, PDFs)
Evaluating retrieval quality with different chunk strategies

Key Resources:

```
scripts/chunk-fixed-size.py
```
- Fixed-size chunking implementation
```
scripts/chunk-semantic.py
```
- Semantic chunking with paragraph preservation
```
scripts/chunk-recursive.py
```
- Recursive chunking for hierarchical documents
```
scripts/benchmark-chunking.py
```
- Benchmark and compare chunking strategies
```
templates/chunking-config.yaml
```
- Chunking configuration template
```
templates/custom-splitter.py
```
- Template for custom chunking logic
```
examples/chunk-markdown.py
```
- Markdown-specific chunking
```
examples/chunk-code.py
```
- Source code chunking
```
examples/chunk-pdf.py
```
- PDF document chunking

用途： 为RAG管道提供可用于生产环境的文档分块实现、基准测试工具以及策略选择指导。

激活触发场景：

为RAG实现文档分块
优化分块大小与重叠度
对比不同分块策略
基准测试分块性能
处理不同类型文档（Markdown、代码、PDF）
评估不同分块策略下的检索质量

核心资源：

```
scripts/chunk-fixed-size.py
```
- 固定大小分块实现
```
scripts/chunk-semantic.py
```
- 保留段落的语义化分块
```
scripts/chunk-recursive.py
```
- 面向层级文档的递归分块
```
scripts/benchmark-chunking.py
```
- 基准测试与对比分块策略
```
templates/chunking-config.yaml
```
- 分块配置模板
```
templates/custom-splitter.py
```
- 自定义分块逻辑模板
```
examples/chunk-markdown.py
```
- Markdown专属分块示例
```
examples/chunk-code.py
```
- 源代码分块示例
```
examples/chunk-pdf.py
```
- PDF文档分块示例

Chunking Strategy Overview

分块策略概述

Strategy Selection Guide

策略选择指南

Fixed-Size Chunking:

Best for: Uniform documents, simple content, consistent structure
Pros: Fast, predictable, simple implementation
Cons: May split semantic units, no context awareness
Use when: Speed matters more than semantic coherence

Semantic Chunking:

Best for: Natural language documents, articles, books
Pros: Preserves semantic boundaries, better context
Cons: Slower, variable chunk sizes
Use when: Content has clear paragraph/section structure

Recursive Chunking:

Best for: Hierarchical documents, technical docs, code
Pros: Preserves structure, handles nested content
Cons: Most complex, requires structure detection
Use when: Documents have clear hierarchical organization

Sentence-Based Chunking:

Best for: Q&A pairs, chatbots, precise retrieval
Pros: Natural boundaries, good for citations
Cons: Small chunks may lack context
Use when: Need precise attribution and citations

固定大小分块：

最佳适用场景：格式统一的文档、简单内容、结构一致的文档
优点：速度快、可预测、实现简单
缺点：可能拆分语义单元，无上下文感知能力
使用时机：速度优先级高于语义连贯性时

语义化分块：

最佳适用场景：自然语言文档、文章、书籍
优点：保留语义边界，上下文更完整
缺点：速度较慢，分块大小不固定
使用时机：内容具有清晰段落/章节结构时

递归分块：

最佳适用场景：层级化文档、技术文档、代码
优点：保留文档结构，处理嵌套内容
缺点：实现最复杂，需要结构检测
使用时机：文档具有清晰层级组织时

基于句子的分块：

最佳适用场景：问答对、聊天机器人、精准检索
优点：基于自然边界，适合引用场景
缺点：分块过小可能缺失上下文
使用时机：需要精准溯源与引用时

Implementation Scripts

实现脚本

1. Fixed-Size Chunking

1. 固定大小分块

Script:

scripts/chunk-fixed-size.py

Usage:

bash

python scripts/chunk-fixed-size.py \
  --input document.txt \
  --chunk-size 1000 \
  --overlap 200 \
  --output chunks.json

Parameters:

```
chunk-size
```
: Number of characters per chunk (default: 1000)
```
overlap
```
: Character overlap between chunks (default: 200)
```
split-on
```
: Split on sentences, words, or characters (default: sentences)

Best Practices:

Use 500-1000 character chunks for most RAG applications
Set overlap to 10-20% of chunk size
Split on sentences for better coherence

脚本：

scripts/chunk-fixed-size.py

使用方式：

bash

python scripts/chunk-fixed-size.py \
  --input document.txt \
  --chunk-size 1000 \
  --overlap 200 \
  --output chunks.json

参数说明：

```
chunk-size
```
：每个分块的字符数（默认值：1000）
```
overlap
```
：分块间的字符重叠数（默认值：200）
```
split-on
```
：拆分依据（句子、单词或字符，默认值：句子）

最佳实践：

大多数RAG应用使用500-1000字符的分块
重叠度设置为分块大小的10-20%
按句子拆分以获得更好的连贯性

2. Semantic Chunking

2. 语义化分块

Script:

scripts/chunk-semantic.py

Usage:

bash

python scripts/chunk-semantic.py \
  --input document.txt \
  --max-chunk-size 1500 \
  --output chunks.json

How it works:

Detects natural boundaries (paragraphs, headings, line breaks)
Groups content while respecting max chunk size
Preserves semantic units (paragraphs stay together)
Adds context headers for nested sections

Best for: Articles, blog posts, documentation, books

脚本：

scripts/chunk-semantic.py

使用方式：

bash

python scripts/chunk-semantic.py \
  --input document.txt \
  --max-chunk-size 1500 \
  --output chunks.json

工作原理：

检测自然边界（段落、标题、换行符）
在最大分块大小限制下组合内容
保留语义单元（段落保持完整）
为嵌套章节添加上下文标题

最佳适用场景： 文章、博客、技术文档、书籍

3. Recursive Chunking

3. 递归分块

Script:

scripts/chunk-recursive.py

Usage:

bash

python scripts/chunk-recursive.py \
  --input document.md \
  --chunk-size 1000 \
  --separators '["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]' \
  --output chunks.json

How it works:

Tries to split on first separator (e.g., ## headings)
If chunks still too large, recursively splits on next separator
Continues until all chunks are within size limit
Preserves hierarchical context

Separator hierarchy examples:

Markdown:

["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]

Python:

["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]

General:
```
["\\n\\n", "\\n", ". ", " "]
```

Best for: Structured documents, source code, technical manuals

脚本：

scripts/chunk-recursive.py

使用方式：

bash

python scripts/chunk-recursive.py \
  --input document.md \
  --chunk-size 1000 \
  --separators '["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]' \
  --output chunks.json

工作原理：

尝试使用第一个分隔符拆分（如##标题）
如果分块仍过大，递归使用下一个分隔符拆分
持续操作直到所有分块符合大小限制
保留层级上下文

分隔符层级示例：

Markdown：

["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]

Python：

["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]

通用文本：
```
["\\n\\n", "\\n", ". ", " "]
```

最佳适用场景： 结构化文档、源代码、技术手册

4. Benchmark Chunking Strategies

4. 分块策略基准测试

Script:

scripts/benchmark-chunking.py

Usage:

bash

python scripts/benchmark-chunking.py \
  --input document.txt \
  --strategies fixed,semantic,recursive \
  --chunk-sizes 500,1000,1500 \
  --output benchmark-results.json

Metrics Evaluated:

Processing time: Speed of chunking
Chunk count: Total chunks generated
Chunk size variance: Consistency of chunk sizes
Context preservation: Semantic unit integrity (scored)
Retrieval quality: Simulated query performance

Output:

json

{
  "fixed-1000": {
    "time_ms": 45,
    "chunk_count": 127,
    "avg_size": 982,
    "size_variance": 12.3,
    "context_score": 0.72
  },
  "semantic-1000": {
    "time_ms": 156,
    "chunk_count": 114,
    "avg_size": 1087,
    "size_variance": 234.5,
    "context_score": 0.91
  }
}

脚本：

scripts/benchmark-chunking.py

使用方式：

bash

python scripts/benchmark-chunking.py \
  --input document.txt \
  --strategies fixed,semantic,recursive \
  --chunk-sizes 500,1000,1500 \
  --output benchmark-results.json

评估指标：

处理时间： 分块速度
分块数量： 生成的总分块数
分块大小方差： 分块大小的一致性
上下文保留度： 语义单元完整性（评分制）
检索质量： 模拟查询性能

输出示例：

json

{
  "fixed-1000": {
    "time_ms": 45,
    "chunk_count": 127,
    "avg_size": 982,
    "size_variance": 12.3,
    "context_score": 0.72
  },
  "semantic-1000": {
    "time_ms": 156,
    "chunk_count": 114,
    "avg_size": 1087,
    "size_variance": 234.5,
    "context_score": 0.91
  }
}

Configuration Template

配置模板

Template:

templates/chunking-config.yaml

Complete configuration:

yaml

chunking:
  # Global defaults
  default_strategy: semantic
  default_chunk_size: 1000
  default_overlap: 200

  # Strategy-specific configs
  strategies:
    fixed_size:
      chunk_size: 1000
      overlap: 200
      split_on: sentence  # sentence, word, character

    semantic:
      max_chunk_size: 1500
      min_chunk_size: 200
      preserve_paragraphs: true
      add_headers: true  # Include section headers

    recursive:
      chunk_size: 1000
      overlap: 100
      separators:
        markdown: ["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]
        code: ["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]
        text: ["\\n\\n", ". ", " "]

  # Document type mappings
  document_types:
    ".md": semantic
    ".py": recursive
    ".txt": fixed_size
    ".pdf": semantic

模板：

templates/chunking-config.yaml

完整配置示例：

yaml

chunking:
  # 全局默认设置
  default_strategy: semantic
  default_chunk_size: 1000
  default_overlap: 200

  # 策略专属配置
  strategies:
    fixed_size:
      chunk_size: 1000
      overlap: 200
      split_on: sentence  # sentence, word, character

    semantic:
      max_chunk_size: 1500
      min_chunk_size: 200
      preserve_paragraphs: true
      add_headers: true  # Include section headers

    recursive:
      chunk_size: 1000
      overlap: 100
      separators:
        markdown: ["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]
        code: ["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]
        text: ["\\n\\n", ". ", " "]

  # 文档类型映射
  document_types:
    ".md": semantic
    ".py": recursive
    ".txt": fixed_size
    ".pdf": semantic

Custom Splitter Template

自定义分块器模板

Template:

templates/custom-splitter.py

Create your own chunking logic:

python

from typing import List, Dict
import re

class CustomChunker:
    def __init__(self, chunk_size: int = 1000, overlap: int = 200):
        self.chunk_size = chunk_size
        self.overlap = overlap

    def chunk(self, text: str, metadata: Dict = None) -> List[Dict]:
        """
        Implement custom chunking logic here.

        Returns:
            List of chunks with metadata:
            [
                {
                    "text": "chunk content",
                    "metadata": {
                        "chunk_id": 0,
                        "source": "document.txt",
                        "start_char": 0,
                        "end_char": 1000
                    }
                }
            ]
        """
        chunks = []

        # Your custom chunking logic here
        # Example: Split on custom pattern
        sections = self._split_sections(text)

        for i, section in enumerate(sections):
            chunks.append({
                "text": section,
                "metadata": {
                    "chunk_id": i,
                    "source": metadata.get("source", "unknown"),
                    "chunk_size": len(section)
                }
            })

        return chunks

    def _split_sections(self, text: str) -> List[str]:
        # Implement your splitting logic
        pass

模板：

templates/custom-splitter.py

创建自定义分块逻辑：

python

from typing import List, Dict
import re

class CustomChunker:
    def __init__(self, chunk_size: int = 1000, overlap: int = 200):
        self.chunk_size = chunk_size
        self.overlap = overlap

    def chunk(self, text: str, metadata: Dict = None) -> List[Dict]:
        """
        Implement custom chunking logic here.

        Returns:
            List of chunks with metadata:
            [
                {
                    "text": "chunk content",
                    "metadata": {
                        "chunk_id": 0,
                        "source": "document.txt",
                        "start_char": 0,
                        "end_char": 1000
                    }
                }
            ]
        """
        chunks = []

        # Your custom chunking logic here
        # Example: Split on custom pattern
        sections = self._split_sections(text)

        for i, section in enumerate(sections):
            chunks.append({
                "text": section,
                "metadata": {
                    "chunk_id": i,
                    "source": metadata.get("source", "unknown"),
                    "chunk_size": len(section)
                }
            })

        return chunks

    def _split_sections(self, text: str) -> List[str]:
        # Implement your splitting logic
        pass

Document-Specific Examples

文档专属示例

Markdown Chunking

Markdown分块

Example:

examples/chunk-markdown.py

Features:

Preserves heading hierarchy
Keeps code blocks together
Maintains list structure
Adds parent section context to chunks

Usage:

bash

python examples/chunk-markdown.py README.md --output readme-chunks.json

示例：

examples/chunk-markdown.py

特性：

保留标题层级
代码块保持完整
维持列表结构
为分块添加父章节上下文

使用方式：

bash

python examples/chunk-markdown.py README.md --output readme-chunks.json

Code Chunking

代码分块

Example:

examples/chunk-code.py

Features:

Splits on class and function boundaries
Preserves complete functions
Includes docstrings with implementations
Language-aware separator selection

Supported languages: Python, JavaScript, TypeScript, Java, Go

Usage:

bash

python examples/chunk-code.py src/main.py --language python --output code-chunks.json

示例：

examples/chunk-code.py

特性：

按类和函数边界拆分
保留完整函数
文档字符串与实现代码关联
支持语言感知的分隔符选择

支持语言： Python、JavaScript、TypeScript、Java、Go

使用方式：

bash

python examples/chunk-code.py src/main.py --language python --output code-chunks.json

PDF Chunking

PDF分块

Example:

examples/chunk-pdf.py

Features:

Extracts text from PDF
Preserves page boundaries
Maintains formatting clues
Handles multi-column layouts

Dependencies:

pypdf

pdfminer.six

Usage:

bash

python examples/chunk-pdf.py research-paper.pdf --strategy semantic --output pdf-chunks.json

示例：

examples/chunk-pdf.py

特性：

从PDF中提取文本
保留页面边界
保留格式线索
处理多栏布局

依赖库：

pypdf

pdfminer.six

使用方式：

bash

python examples/chunk-pdf.py research-paper.pdf --strategy semantic --output pdf-chunks.json

Optimization Guidelines

优化指南

Chunk Size Selection

分块大小选择

General recommendations:

Content Type	Chunk Size	Overlap	Strategy
Q&A / FAQs	200-400	50	Sentence
Articles	500-1000	100-200	Semantic
Documentation	1000-1500	200-300	Recursive
Books	1000-2000	300-400	Semantic
Source code	500-1000	100	Recursive

Test with your data: Use

benchmark-chunking.py

to find optimal settings

通用建议：

内容类型	分块大小	重叠度	推荐策略
问答/常见问题	200-400	50	基于句子
文章	500-1000	100-200	语义化
技术文档	1000-1500	200-300	递归
书籍	1000-2000	300-400	语义化
源代码	500-1000	100	递归

使用自有数据测试： 使用

benchmark-chunking.py

找到最优设置

Overlap Strategies

重叠度策略

Why overlap matters:

Prevents information loss at boundaries
Improves retrieval of cross-boundary information
Balances redundancy vs. coverage

Overlap guidelines:

10-15%: Minimal overlap for speed
15-20%: Standard overlap for most use cases
20-30%: High overlap for critical applications

重叠度的重要性：

防止边界处的信息丢失
提升跨边界信息的检索效果
在冗余度与覆盖范围间取得平衡

重叠度指南：

10-15%: 最小重叠度，追求速度
15-20%: 标准重叠度，适用于大多数场景
20-30%: 高重叠度，适用于关键业务场景

Performance Optimization

性能优化

Fast chunking (large documents):

bash

undefined

快速分块（大文档）：

bash

undefined

Use fixed-size for speed

使用固定大小分块提升速度

python scripts/chunk-fixed-size.py --input large-doc.txt --chunk-size 1000


**Quality chunking (smaller documents):**
```bash

python scripts/chunk-fixed-size.py --input large-doc.txt --chunk-size 1000


**高质量分块（小文档）：**
```bash

Use semantic for better context

使用语义化分块获得更好的上下文

python scripts/chunk-semantic.py --input article.txt --max-chunk-size 1500


**Batch processing:**
```bash

python scripts/chunk-semantic.py --input article.txt --max-chunk-size 1500


**批量处理：**
```bash

Process multiple files

处理多个文件

for file in documents/*.txt; do python scripts/chunk-semantic.py --input "$file" --output "chunks/$(basename $file .txt).json" done

undefined

for file in documents/*.txt; do python scripts/chunk-semantic.py --input "$file" --output "chunks/$(basename $file .txt).json" done

undefined

Evaluation Workflow

评估流程

Step 1: Benchmark Strategies

步骤1：基准测试策略

bash

python scripts/benchmark-chunking.py \
  --input sample-document.txt \
  --strategies fixed,semantic,recursive \
  --chunk-sizes 500,1000,1500

bash

python scripts/benchmark-chunking.py \
  --input sample-document.txt \
  --strategies fixed,semantic,recursive \
  --chunk-sizes 500,1000,1500

Step 2: Analyze Results

步骤2：分析结果

Review metrics:

Processing time (prefer < 100ms per document)
Context preservation score (target > 0.85)
Chunk size variance (lower is more predictable)

指标审查要点：

处理时间（建议单文档<100ms）
上下文保留度（目标>0.85）
分块大小方差（数值越小越可预测）

Step 3: A/B Test Retrieval

步骤3：A/B测试检索效果

Compare retrieval quality:

Chunk same corpus with different strategies
Run identical test queries against each
Measure precision@k and recall@k
Select strategy with best retrieval metrics

对比检索质量：

使用不同策略对同一语料库分块
对每个分块结果运行相同的测试查询
衡量precision@k与recall@k指标
选择检索指标最优的策略

Step 4: Production Deployment

步骤4：生产环境部署

Use configuration file:

python

import yaml
from chunking_strategies import get_chunker

config = yaml.safe_load(open('chunking-config.yaml'))
chunker = get_chunker(config['chunking']['default_strategy'], config)

chunks = chunker.chunk(document_text)

使用配置文件：

python

import yaml
from chunking_strategies import get_chunker

config = yaml.safe_load(open('chunking-config.yaml'))
chunker = get_chunker(config['chunking']['default_strategy'], config)

chunks = chunker.chunk(document_text)

Common Issues & Solutions

常见问题与解决方案

Issue: Chunks too small/large

Adjust
```
chunk_size
```
parameter
Check document structure (may need different strategy)
Verify separator selection for recursive chunking

Issue: Lost context at boundaries

Increase overlap
Switch to semantic chunking
Add parent context to metadata

Issue: Slow processing

Use fixed-size chunking for large batches
Reduce overlap
Process files in parallel

Issue: Poor retrieval quality

Benchmark different strategies
Increase chunk size
Try hybrid approach (semantic + fixed fallback)

问题：分块过大/过小

调整
```
chunk_size
```
参数
检查文档结构（可能需要更换策略）
验证递归分块的分隔符选择

问题：边界处上下文丢失

增加重叠度
切换为语义化分块
在元数据中添加父级上下文

问题：处理速度慢

大批次处理使用固定大小分块
降低重叠度
并行处理文件

问题：检索质量差

基准测试不同策略
增大分块大小
尝试混合策略（语义化分块+固定大小兜底）

Dependencies

依赖库

Core libraries:

bash

pip install tiktoken  # Token counting
pip install nltk      # Sentence splitting
pip install spacy     # Advanced NLP (optional)

For PDF support:

bash

pip install pypdf pdfminer.six

For benchmarking:

bash

pip install pandas numpy scikit-learn

核心库：

bash

pip install tiktoken  # Token计数
pip install nltk      # 句子拆分
pip install spacy     # 高级NLP（可选）

PDF支持依赖：

bash

pip install pypdf pdfminer.six

基准测试依赖：

bash

pip install pandas numpy scikit-learn

Best Practices Summary

最佳实践总结

Start with semantic chunking for most documents
Use recursive chunking for structured/hierarchical content
Benchmark on your actual data before production
Set overlap to 15-20% of chunk size
Include metadata (source, page, section) in chunks
Test retrieval quality, not just chunking speed
Use appropriate chunk size for your embedding model token limit
Document your chunking strategy choice and parameters

Supported Strategies: Fixed-Size, Semantic, Recursive, Sentence-Based, Custom Output Format: JSON with text and metadata Version: 1.0.0

大多数文档优先使用语义化分块
结构化/层级化内容使用递归分块
生产环境前使用自有数据进行基准测试
重叠度设置为分块大小的15-20%
分块中包含元数据（来源、页码、章节）
测试检索质量，而非仅关注分块速度
分块大小适配嵌入模型的Token限制
记录分块策略选择与参数配置

支持策略： 固定大小、语义化、递归、基于句子、自定义 输出格式： 包含文本与元数据的JSON 版本： 1.0.0