chunking-strategy
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseChunking Strategy for RAG Systems
RAG系统的分块策略
Overview
概述
Implement optimal chunking strategies for Retrieval-Augmented Generation (RAG) systems and document processing pipelines. This skill provides a comprehensive framework for breaking large documents into smaller, semantically meaningful segments that preserve context while enabling efficient retrieval and search.
为检索增强生成(RAG)系统和文档处理流水线实现最优分块策略。本技能提供了一个全面的框架,可将大型文档拆分为更小的、语义相关的片段,在保留上下文的同时实现高效检索与搜索。
When to Use
适用场景
Use this skill when building RAG systems, optimizing vector search performance, implementing document processing pipelines, handling multi-modal content, or performance-tuning existing RAG systems with poor retrieval quality.
在构建RAG系统、优化向量搜索性能、实现文档处理流水线、处理多模态内容,或对检索质量不佳的现有RAG系统进行性能调优时,可使用本技能。
Instructions
操作指南
Choose Chunking Strategy
选择分块策略
Select appropriate chunking strategy based on document type and use case:
-
Fixed-Size Chunking (Level 1)
- Use for simple documents without clear structure
- Start with 512 tokens and 10-20% overlap
- Adjust size based on query type: 256 for factoid, 1024 for analytical
-
Recursive Character Chunking (Level 2)
- Use for documents with clear structural boundaries
- Implement hierarchical separators: paragraphs → sentences → words
- Customize separators for document types (HTML, Markdown)
-
Structure-Aware Chunking (Level 3)
- Use for structured documents (Markdown, code, tables, PDFs)
- Preserve semantic units: functions, sections, table blocks
- Validate structure preservation post-splitting
-
Semantic Chunking (Level 4)
- Use for complex documents with thematic shifts
- Implement embedding-based boundary detection
- Configure similarity threshold (0.8) and buffer size (3-5 sentences)
-
Advanced Methods (Level 5)
- Use Late Chunking for long-context embedding models
- Apply Contextual Retrieval for high-precision requirements
- Monitor computational costs vs. retrieval improvements
Reference detailed strategy implementations in references/strategies.md.
根据文档类型和使用场景选择合适的分块策略:
-
固定大小分块(Level 1)
- 适用于无清晰结构的简单文档
- 从512个token开始,设置10-20%的重叠率
- 根据查询类型调整大小:事实类查询用256,分析类查询用1024
-
递归字符分块(Level 2)
- 适用于有清晰结构边界的文档
- 实现分层分隔符:段落 → 句子 → 单词
- 针对文档类型(HTML、Markdown)自定义分隔符
-
结构感知分块(Level 3)
- 适用于结构化文档(Markdown、代码、表格、PDF)
- 保留语义单元:函数、章节、表格块
- 分块后验证结构是否保留
-
语义分块(Level 4)
- 适用于存在主题转换的复杂文档
- 实现基于嵌入的边界检测
- 配置相似度阈值(0.8)和缓冲区大小(3-5个句子)
-
高级方法(Level 5)
- 针对长上下文嵌入模型使用延迟分块(Late Chunking)
- 针对高精度需求应用上下文检索(Contextual Retrieval)
- 监控计算成本与检索效果提升的平衡
详细策略实现可参考 references/strategies.md。
Implement Chunking Pipeline
实现分块流水线
Follow these steps to implement effective chunking:
-
Pre-process documents
- Analyze document structure and content types
- Identify multi-modal content (tables, images, code)
- Assess information density and complexity
-
Select strategy parameters
- Choose chunk size based on embedding model context window
- Set overlap percentage (10-20% for most cases)
- Configure strategy-specific parameters
-
Process and validate
- Apply chosen chunking strategy
- Validate semantic coherence of chunks
- Test with representative documents
-
Evaluate and iterate
- Measure retrieval precision and recall
- Monitor processing latency and resource usage
- Optimize based on specific use case requirements
Reference detailed implementation guidelines in references/implementation.md.
遵循以下步骤实现高效分块:
-
预处理文档
- 分析文档结构和内容类型
- 识别多模态内容(表格、图片、代码)
- 评估信息密度和复杂度
-
选择策略参数
- 根据嵌入模型的上下文窗口选择分块大小
- 设置重叠百分比(大多数场景为10-20%)
- 配置策略特定参数
-
处理与验证
- 应用所选分块策略
- 验证分块的语义连贯性
- 使用代表性文档进行测试
-
评估与迭代
- 衡量检索精度和召回率
- 监控处理延迟和资源使用情况
- 根据特定场景需求进行优化
详细实现指南可参考 references/implementation.md。
Evaluate Performance
性能评估
Use these metrics to evaluate chunking effectiveness:
- Retrieval Precision: Fraction of retrieved chunks that are relevant
- Retrieval Recall: Fraction of relevant chunks that are retrieved
- End-to-End Accuracy: Quality of final RAG responses
- Processing Time: Latency impact on overall system
- Resource Usage: Memory and computational costs
Reference detailed evaluation framework in references/evaluation.md.
使用以下指标评估分块效果:
- 检索精度:检索到的分块中相关分块的比例
- 检索召回率:所有相关分块中被检索到的比例
- 端到端准确率:最终RAG响应的质量
- 处理时间:对整体系统的延迟影响
- 资源使用:内存和计算成本
详细评估框架可参考 references/evaluation.md。
Examples
示例
Basic Fixed-Size Chunking
基础固定大小分块
python
from langchain.text_splitter import RecursiveCharacterTextSplitterpython
from langchain.text_splitter import RecursiveCharacterTextSplitterConfigure for factoid queries
Configure for factoid queries
splitter = RecursiveCharacterTextSplitter(
chunk_size=256,
chunk_overlap=25,
length_function=len
)
chunks = splitter.split_documents(documents)
undefinedsplitter = RecursiveCharacterTextSplitter(
chunk_size=256,
chunk_overlap=25,
length_function=len
)
chunks = splitter.split_documents(documents)
undefinedStructure-Aware Code Chunking
结构感知代码分块
python
def chunk_python_code(code):
"""Split Python code into semantic chunks"""
import ast
tree = ast.parse(code)
chunks = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
chunks.append(ast.get_source_segment(code, node))
return chunkspython
def chunk_python_code(code):
"""Split Python code into semantic chunks"""
import ast
tree = ast.parse(code)
chunks = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
chunks.append(ast.get_source_segment(code, node))
return chunksSemantic Chunking with Embeddings
基于嵌入的语义分块
python
def semantic_chunk(text, similarity_threshold=0.8):
"""Chunk text based on semantic boundaries"""
sentences = split_into_sentences(text)
embeddings = generate_embeddings(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
similarity = cosine_similarity(embeddings[i-1], embeddings[i])
if similarity < similarity_threshold:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
chunks.append(" ".join(current_chunk))
return chunkspython
def semantic_chunk(text, similarity_threshold=0.8):
"""Chunk text based on semantic boundaries"""
sentences = split_into_sentences(text)
embeddings = generate_embeddings(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
similarity = cosine_similarity(embeddings[i-1], embeddings[i])
if similarity < similarity_threshold:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
chunks.append(" ".join(current_chunk))
return chunksBest Practices
最佳实践
Core Principles
核心原则
- Balance context preservation with retrieval precision
- Maintain semantic coherence within chunks
- Optimize for embedding model constraints
- Preserve document structure when beneficial
- 在上下文保留与检索精度之间取得平衡
- 保持分块内的语义连贯性
- 针对嵌入模型的约束进行优化
- 在有益时保留文档结构
Implementation Guidelines
实现指南
- Start simple with fixed-size chunking (512 tokens, 10-20% overlap)
- Test thoroughly with representative documents
- Monitor both accuracy metrics and computational costs
- Iterate based on specific document characteristics
- 从简单的固定大小分块开始(512个token,10-20%重叠率)
- 使用代表性文档进行全面测试
- 同时监控准确率指标和计算成本
- 根据特定文档特征进行迭代优化
Common Pitfalls to Avoid
需避免的常见陷阱
- Over-chunking: Creating too many small, context-poor chunks
- Under-chunking: Missing relevant information due to oversized chunks
- Ignoring document structure and semantic boundaries
- Using one-size-fits-all approach for diverse content types
- Neglecting overlap for boundary-crossing information
- 过度分块:创建过多小型、缺乏上下文的分块
- 分块不足:因分块过大而遗漏相关信息
- 忽略文档结构和语义边界
- 对多样化内容类型采用一刀切的方法
- 忽略边界交叉信息的重叠设置
Constraints
约束条件
Resource Considerations
资源考量
- Semantic and contextual methods require significant computational resources
- Late chunking needs long-context embedding models
- Complex strategies increase processing latency
- Monitor memory usage for large document processing
- 语义和上下文方法需要大量计算资源
- 延迟分块需要长上下文嵌入模型
- 复杂策略会增加处理延迟
- 处理大型文档时监控内存使用情况
Quality Requirements
质量要求
- Validate chunk semantic coherence post-processing
- Test with domain-specific documents before deployment
- Ensure chunks maintain standalone meaning where possible
- Implement proper error handling for edge cases
- 分块后验证语义连贯性
- 部署前使用领域特定文档进行测试
- 尽可能确保分块具备独立含义
- 为边缘情况实现适当的错误处理
References
参考资料
Reference detailed documentation in the references/ folder:
- strategies.md - Detailed strategy implementations
- implementation.md - Complete implementation guidelines
- evaluation.md - Performance evaluation framework
- tools.md - Recommended libraries and frameworks
- research.md - Key research papers and findings
- advanced-strategies.md - 11 comprehensive chunking methods
- semantic-methods.md - Semantic and contextual approaches
- visualization-tools.md - Evaluation and visualization tools
详细文档可参考 references/ 文件夹:
- strategies.md - 详细策略实现
- implementation.md - 完整实现指南
- evaluation.md - 性能评估框架
- tools.md - 推荐库与框架
- research.md - 关键研究论文与发现
- advanced-strategies.md - 11种全面的分块方法
- semantic-methods.md - 语义与上下文方法
- visualization-tools.md - 评估与可视化工具