chunking-strategy

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Chunking Strategy for RAG Systems

RAG系统的分块策略

Overview

概述

Implement optimal chunking strategies for Retrieval-Augmented Generation (RAG) systems and document processing pipelines. This skill provides a comprehensive framework for breaking large documents into smaller, semantically meaningful segments that preserve context while enabling efficient retrieval and search.
为检索增强生成(RAG)系统和文档处理流水线实现最优分块策略。本技能提供了一个全面的框架,可将大型文档拆分为更小的、语义相关的片段,在保留上下文的同时实现高效检索与搜索。

When to Use

适用场景

Use this skill when building RAG systems, optimizing vector search performance, implementing document processing pipelines, handling multi-modal content, or performance-tuning existing RAG systems with poor retrieval quality.
在构建RAG系统、优化向量搜索性能、实现文档处理流水线、处理多模态内容,或对检索质量不佳的现有RAG系统进行性能调优时,可使用本技能。

Instructions

操作指南

Choose Chunking Strategy

选择分块策略

Select appropriate chunking strategy based on document type and use case:
  1. Fixed-Size Chunking (Level 1)
    • Use for simple documents without clear structure
    • Start with 512 tokens and 10-20% overlap
    • Adjust size based on query type: 256 for factoid, 1024 for analytical
  2. Recursive Character Chunking (Level 2)
    • Use for documents with clear structural boundaries
    • Implement hierarchical separators: paragraphs → sentences → words
    • Customize separators for document types (HTML, Markdown)
  3. Structure-Aware Chunking (Level 3)
    • Use for structured documents (Markdown, code, tables, PDFs)
    • Preserve semantic units: functions, sections, table blocks
    • Validate structure preservation post-splitting
  4. Semantic Chunking (Level 4)
    • Use for complex documents with thematic shifts
    • Implement embedding-based boundary detection
    • Configure similarity threshold (0.8) and buffer size (3-5 sentences)
  5. Advanced Methods (Level 5)
    • Use Late Chunking for long-context embedding models
    • Apply Contextual Retrieval for high-precision requirements
    • Monitor computational costs vs. retrieval improvements
Reference detailed strategy implementations in references/strategies.md.
根据文档类型和使用场景选择合适的分块策略:
  1. 固定大小分块(Level 1)
    • 适用于无清晰结构的简单文档
    • 从512个token开始,设置10-20%的重叠率
    • 根据查询类型调整大小:事实类查询用256,分析类查询用1024
  2. 递归字符分块(Level 2)
    • 适用于有清晰结构边界的文档
    • 实现分层分隔符:段落 → 句子 → 单词
    • 针对文档类型(HTML、Markdown)自定义分隔符
  3. 结构感知分块(Level 3)
    • 适用于结构化文档(Markdown、代码、表格、PDF)
    • 保留语义单元:函数、章节、表格块
    • 分块后验证结构是否保留
  4. 语义分块(Level 4)
    • 适用于存在主题转换的复杂文档
    • 实现基于嵌入的边界检测
    • 配置相似度阈值(0.8)和缓冲区大小(3-5个句子)
  5. 高级方法(Level 5)
    • 针对长上下文嵌入模型使用延迟分块(Late Chunking)
    • 针对高精度需求应用上下文检索(Contextual Retrieval)
    • 监控计算成本与检索效果提升的平衡
详细策略实现可参考 references/strategies.md

Implement Chunking Pipeline

实现分块流水线

Follow these steps to implement effective chunking:
  1. Pre-process documents
    • Analyze document structure and content types
    • Identify multi-modal content (tables, images, code)
    • Assess information density and complexity
  2. Select strategy parameters
    • Choose chunk size based on embedding model context window
    • Set overlap percentage (10-20% for most cases)
    • Configure strategy-specific parameters
  3. Process and validate
    • Apply chosen chunking strategy
    • Validate semantic coherence of chunks
    • Test with representative documents
  4. Evaluate and iterate
    • Measure retrieval precision and recall
    • Monitor processing latency and resource usage
    • Optimize based on specific use case requirements
Reference detailed implementation guidelines in references/implementation.md.
遵循以下步骤实现高效分块:
  1. 预处理文档
    • 分析文档结构和内容类型
    • 识别多模态内容(表格、图片、代码)
    • 评估信息密度和复杂度
  2. 选择策略参数
    • 根据嵌入模型的上下文窗口选择分块大小
    • 设置重叠百分比(大多数场景为10-20%)
    • 配置策略特定参数
  3. 处理与验证
    • 应用所选分块策略
    • 验证分块的语义连贯性
    • 使用代表性文档进行测试
  4. 评估与迭代
    • 衡量检索精度和召回率
    • 监控处理延迟和资源使用情况
    • 根据特定场景需求进行优化
详细实现指南可参考 references/implementation.md

Evaluate Performance

性能评估

Use these metrics to evaluate chunking effectiveness:
  • Retrieval Precision: Fraction of retrieved chunks that are relevant
  • Retrieval Recall: Fraction of relevant chunks that are retrieved
  • End-to-End Accuracy: Quality of final RAG responses
  • Processing Time: Latency impact on overall system
  • Resource Usage: Memory and computational costs
Reference detailed evaluation framework in references/evaluation.md.
使用以下指标评估分块效果:
  • 检索精度:检索到的分块中相关分块的比例
  • 检索召回率:所有相关分块中被检索到的比例
  • 端到端准确率:最终RAG响应的质量
  • 处理时间:对整体系统的延迟影响
  • 资源使用:内存和计算成本
详细评估框架可参考 references/evaluation.md

Examples

示例

Basic Fixed-Size Chunking

基础固定大小分块

python
from langchain.text_splitter import RecursiveCharacterTextSplitter
python
from langchain.text_splitter import RecursiveCharacterTextSplitter

Configure for factoid queries

Configure for factoid queries

splitter = RecursiveCharacterTextSplitter( chunk_size=256, chunk_overlap=25, length_function=len )
chunks = splitter.split_documents(documents)
undefined
splitter = RecursiveCharacterTextSplitter( chunk_size=256, chunk_overlap=25, length_function=len )
chunks = splitter.split_documents(documents)
undefined

Structure-Aware Code Chunking

结构感知代码分块

python
def chunk_python_code(code):
    """Split Python code into semantic chunks"""
    import ast

    tree = ast.parse(code)
    chunks = []

    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
            chunks.append(ast.get_source_segment(code, node))

    return chunks
python
def chunk_python_code(code):
    """Split Python code into semantic chunks"""
    import ast

    tree = ast.parse(code)
    chunks = []

    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
            chunks.append(ast.get_source_segment(code, node))

    return chunks

Semantic Chunking with Embeddings

基于嵌入的语义分块

python
def semantic_chunk(text, similarity_threshold=0.8):
    """Chunk text based on semantic boundaries"""
    sentences = split_into_sentences(text)
    embeddings = generate_embeddings(sentences)

    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        similarity = cosine_similarity(embeddings[i-1], embeddings[i])

        if similarity < similarity_threshold:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    chunks.append(" ".join(current_chunk))
    return chunks
python
def semantic_chunk(text, similarity_threshold=0.8):
    """Chunk text based on semantic boundaries"""
    sentences = split_into_sentences(text)
    embeddings = generate_embeddings(sentences)

    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        similarity = cosine_similarity(embeddings[i-1], embeddings[i])

        if similarity < similarity_threshold:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    chunks.append(" ".join(current_chunk))
    return chunks

Best Practices

最佳实践

Core Principles

核心原则

  • Balance context preservation with retrieval precision
  • Maintain semantic coherence within chunks
  • Optimize for embedding model constraints
  • Preserve document structure when beneficial
  • 在上下文保留与检索精度之间取得平衡
  • 保持分块内的语义连贯性
  • 针对嵌入模型的约束进行优化
  • 在有益时保留文档结构

Implementation Guidelines

实现指南

  • Start simple with fixed-size chunking (512 tokens, 10-20% overlap)
  • Test thoroughly with representative documents
  • Monitor both accuracy metrics and computational costs
  • Iterate based on specific document characteristics
  • 从简单的固定大小分块开始(512个token,10-20%重叠率)
  • 使用代表性文档进行全面测试
  • 同时监控准确率指标和计算成本
  • 根据特定文档特征进行迭代优化

Common Pitfalls to Avoid

需避免的常见陷阱

  • Over-chunking: Creating too many small, context-poor chunks
  • Under-chunking: Missing relevant information due to oversized chunks
  • Ignoring document structure and semantic boundaries
  • Using one-size-fits-all approach for diverse content types
  • Neglecting overlap for boundary-crossing information
  • 过度分块:创建过多小型、缺乏上下文的分块
  • 分块不足:因分块过大而遗漏相关信息
  • 忽略文档结构和语义边界
  • 对多样化内容类型采用一刀切的方法
  • 忽略边界交叉信息的重叠设置

Constraints

约束条件

Resource Considerations

资源考量

  • Semantic and contextual methods require significant computational resources
  • Late chunking needs long-context embedding models
  • Complex strategies increase processing latency
  • Monitor memory usage for large document processing
  • 语义和上下文方法需要大量计算资源
  • 延迟分块需要长上下文嵌入模型
  • 复杂策略会增加处理延迟
  • 处理大型文档时监控内存使用情况

Quality Requirements

质量要求

  • Validate chunk semantic coherence post-processing
  • Test with domain-specific documents before deployment
  • Ensure chunks maintain standalone meaning where possible
  • Implement proper error handling for edge cases
  • 分块后验证语义连贯性
  • 部署前使用领域特定文档进行测试
  • 尽可能确保分块具备独立含义
  • 为边缘情况实现适当的错误处理

References

参考资料

Reference detailed documentation in the references/ folder:
  • strategies.md - Detailed strategy implementations
  • implementation.md - Complete implementation guidelines
  • evaluation.md - Performance evaluation framework
  • tools.md - Recommended libraries and frameworks
  • research.md - Key research papers and findings
  • advanced-strategies.md - 11 comprehensive chunking methods
  • semantic-methods.md - Semantic and contextual approaches
  • visualization-tools.md - Evaluation and visualization tools
详细文档可参考 references/ 文件夹:
  • strategies.md - 详细策略实现
  • implementation.md - 完整实现指南
  • evaluation.md - 性能评估框架
  • tools.md - 推荐库与框架
  • research.md - 关键研究论文与发现
  • advanced-strategies.md - 11种全面的分块方法
  • semantic-methods.md - 语义与上下文方法
  • visualization-tools.md - 评估与可视化工具