chunking-strategy

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Chunking Strategy for RAG Systems

RAG系统的分块策略

Overview

概述

Implement optimal chunking strategies for Retrieval-Augmented Generation (RAG) systems and document processing pipelines. This skill provides a comprehensive framework for breaking large documents into smaller, semantically meaningful segments that preserve context while enabling efficient retrieval and search.

为检索增强生成（RAG）系统和文档处理流水线实现最优分块策略。本技能提供了一个全面的框架，可将大型文档拆分为更小的、语义相关的片段，在保留上下文的同时实现高效检索与搜索。

When to Use

适用场景

Use this skill when building RAG systems, optimizing vector search performance, implementing document processing pipelines, handling multi-modal content, or performance-tuning existing RAG systems with poor retrieval quality.

在构建RAG系统、优化向量搜索性能、实现文档处理流水线、处理多模态内容，或对检索质量不佳的现有RAG系统进行性能调优时，可使用本技能。

Instructions

操作指南

Choose Chunking Strategy

选择分块策略

Select appropriate chunking strategy based on document type and use case:

Fixed-Size Chunking (Level 1)
- Use for simple documents without clear structure
- Start with 512 tokens and 10-20% overlap
- Adjust size based on query type: 256 for factoid, 1024 for analytical
Recursive Character Chunking (Level 2)
- Use for documents with clear structural boundaries
- Implement hierarchical separators: paragraphs → sentences → words
- Customize separators for document types (HTML, Markdown)
Structure-Aware Chunking (Level 3)
- Use for structured documents (Markdown, code, tables, PDFs)
- Preserve semantic units: functions, sections, table blocks
- Validate structure preservation post-splitting
Semantic Chunking (Level 4)
- Use for complex documents with thematic shifts
- Implement embedding-based boundary detection
- Configure similarity threshold (0.8) and buffer size (3-5 sentences)
Advanced Methods (Level 5)
- Use Late Chunking for long-context embedding models
- Apply Contextual Retrieval for high-precision requirements
- Monitor computational costs vs. retrieval improvements

Reference detailed strategy implementations in references/strategies.md.

根据文档类型和使用场景选择合适的分块策略：

固定大小分块（Level 1）
- 适用于无清晰结构的简单文档
- 从512个token开始，设置10-20%的重叠率
- 根据查询类型调整大小：事实类查询用256，分析类查询用1024
递归字符分块（Level 2）
- 适用于有清晰结构边界的文档
- 实现分层分隔符：段落 → 句子 → 单词
- 针对文档类型（HTML、Markdown）自定义分隔符
结构感知分块（Level 3）
- 适用于结构化文档（Markdown、代码、表格、PDF）
- 保留语义单元：函数、章节、表格块
- 分块后验证结构是否保留
语义分块（Level 4）
- 适用于存在主题转换的复杂文档
- 实现基于嵌入的边界检测
- 配置相似度阈值（0.8）和缓冲区大小（3-5个句子）
高级方法（Level 5）
- 针对长上下文嵌入模型使用延迟分块（Late Chunking）
- 针对高精度需求应用上下文检索（Contextual Retrieval）
- 监控计算成本与检索效果提升的平衡

详细策略实现可参考 references/strategies.md。

Implement Chunking Pipeline

实现分块流水线

Follow these steps to implement effective chunking:

Pre-process documents
- Analyze document structure and content types
- Identify multi-modal content (tables, images, code)
- Assess information density and complexity
Select strategy parameters
- Choose chunk size based on embedding model context window
- Set overlap percentage (10-20% for most cases)
- Configure strategy-specific parameters
Process and validate
- Apply chosen chunking strategy
- Validate semantic coherence of chunks
- Test with representative documents
Evaluate and iterate
- Measure retrieval precision and recall
- Monitor processing latency and resource usage
- Optimize based on specific use case requirements

Reference detailed implementation guidelines in references/implementation.md.

遵循以下步骤实现高效分块：

预处理文档
- 分析文档结构和内容类型
- 识别多模态内容（表格、图片、代码）
- 评估信息密度和复杂度
选择策略参数
- 根据嵌入模型的上下文窗口选择分块大小
- 设置重叠百分比（大多数场景为10-20%）
- 配置策略特定参数
处理与验证
- 应用所选分块策略
- 验证分块的语义连贯性
- 使用代表性文档进行测试
评估与迭代
- 衡量检索精度和召回率
- 监控处理延迟和资源使用情况
- 根据特定场景需求进行优化

详细实现指南可参考 references/implementation.md。

Evaluate Performance

性能评估

Use these metrics to evaluate chunking effectiveness:

Retrieval Precision: Fraction of retrieved chunks that are relevant
Retrieval Recall: Fraction of relevant chunks that are retrieved
End-to-End Accuracy: Quality of final RAG responses
Processing Time: Latency impact on overall system
Resource Usage: Memory and computational costs

Reference detailed evaluation framework in references/evaluation.md.

使用以下指标评估分块效果：

检索精度：检索到的分块中相关分块的比例
检索召回率：所有相关分块中被检索到的比例
端到端准确率：最终RAG响应的质量
处理时间：对整体系统的延迟影响
资源使用：内存和计算成本

详细评估框架可参考 references/evaluation.md。

Examples

示例

Basic Fixed-Size Chunking

基础固定大小分块

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

Configure for factoid queries

splitter = RecursiveCharacterTextSplitter( chunk_size=256, chunk_overlap=25, length_function=len )

chunks = splitter.split_documents(documents)

undefined

splitter = RecursiveCharacterTextSplitter( chunk_size=256, chunk_overlap=25, length_function=len )

chunks = splitter.split_documents(documents)

undefined

Structure-Aware Code Chunking

结构感知代码分块

python

def chunk_python_code(code):
    """Split Python code into semantic chunks"""
    import ast

    tree = ast.parse(code)
    chunks = []

    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
            chunks.append(ast.get_source_segment(code, node))

    return chunks

python

def chunk_python_code(code):
    """Split Python code into semantic chunks"""
    import ast

    tree = ast.parse(code)
    chunks = []

    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
            chunks.append(ast.get_source_segment(code, node))

    return chunks

Semantic Chunking with Embeddings

基于嵌入的语义分块

python

def semantic_chunk(text, similarity_threshold=0.8):
    """Chunk text based on semantic boundaries"""
    sentences = split_into_sentences(text)
    embeddings = generate_embeddings(sentences)

    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        similarity = cosine_similarity(embeddings[i-1], embeddings[i])

        if similarity < similarity_threshold:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    chunks.append(" ".join(current_chunk))
    return chunks

python

def semantic_chunk(text, similarity_threshold=0.8):
    """Chunk text based on semantic boundaries"""
    sentences = split_into_sentences(text)
    embeddings = generate_embeddings(sentences)

    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        similarity = cosine_similarity(embeddings[i-1], embeddings[i])

        if similarity < similarity_threshold:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    chunks.append(" ".join(current_chunk))
    return chunks

Best Practices

最佳实践

Core Principles

核心原则

Balance context preservation with retrieval precision
Maintain semantic coherence within chunks
Optimize for embedding model constraints
Preserve document structure when beneficial

在上下文保留与检索精度之间取得平衡
保持分块内的语义连贯性
针对嵌入模型的约束进行优化
在有益时保留文档结构

Implementation Guidelines

实现指南

Start simple with fixed-size chunking (512 tokens, 10-20% overlap)
Test thoroughly with representative documents
Monitor both accuracy metrics and computational costs
Iterate based on specific document characteristics

从简单的固定大小分块开始（512个token，10-20%重叠率）
使用代表性文档进行全面测试
同时监控准确率指标和计算成本
根据特定文档特征进行迭代优化

Common Pitfalls to Avoid

需避免的常见陷阱

Over-chunking: Creating too many small, context-poor chunks
Under-chunking: Missing relevant information due to oversized chunks
Ignoring document structure and semantic boundaries
Using one-size-fits-all approach for diverse content types
Neglecting overlap for boundary-crossing information

过度分块：创建过多小型、缺乏上下文的分块
分块不足：因分块过大而遗漏相关信息
忽略文档结构和语义边界
对多样化内容类型采用一刀切的方法
忽略边界交叉信息的重叠设置

Constraints

约束条件

Resource Considerations

资源考量

Semantic and contextual methods require significant computational resources
Late chunking needs long-context embedding models
Complex strategies increase processing latency
Monitor memory usage for large document processing

语义和上下文方法需要大量计算资源
延迟分块需要长上下文嵌入模型
复杂策略会增加处理延迟
处理大型文档时监控内存使用情况

Quality Requirements

质量要求

Validate chunk semantic coherence post-processing
Test with domain-specific documents before deployment
Ensure chunks maintain standalone meaning where possible
Implement proper error handling for edge cases

分块后验证语义连贯性
部署前使用领域特定文档进行测试
尽可能确保分块具备独立含义
为边缘情况实现适当的错误处理

References

参考资料

Reference detailed documentation in the references/ folder:

strategies.md - Detailed strategy implementations
implementation.md - Complete implementation guidelines
evaluation.md - Performance evaluation framework
tools.md - Recommended libraries and frameworks
research.md - Key research papers and findings
advanced-strategies.md - 11 comprehensive chunking methods
semantic-methods.md - Semantic and contextual approaches
visualization-tools.md - Evaluation and visualization tools

详细文档可参考 references/ 文件夹：

strategies.md - 详细策略实现
implementation.md - 完整实现指南
evaluation.md - 性能评估框架
tools.md - 推荐库与框架
research.md - 关键研究论文与发现
advanced-strategies.md - 11种全面的分块方法
semantic-methods.md - 语义与上下文方法
visualization-tools.md - 评估与可视化工具