Loading...
Loading...
Document chunking implementations and benchmarking tools for RAG pipelines including fixed-size, semantic, recursive, and sentence-based strategies. Use when implementing document processing, optimizing chunk sizes, comparing chunking approaches, benchmarking retrieval performance, or when user mentions chunking, text splitting, document segmentation, RAG optimization, or chunk evaluation.
npx skill4agent add vanman2024/ai-dev-marketplace chunking-strategiesscripts/chunk-fixed-size.pyscripts/chunk-semantic.pyscripts/chunk-recursive.pyscripts/benchmark-chunking.pytemplates/chunking-config.yamltemplates/custom-splitter.pyexamples/chunk-markdown.pyexamples/chunk-code.pyexamples/chunk-pdf.pyscripts/chunk-fixed-size.pypython scripts/chunk-fixed-size.py \
--input document.txt \
--chunk-size 1000 \
--overlap 200 \
--output chunks.jsonchunk-sizeoverlapsplit-onscripts/chunk-semantic.pypython scripts/chunk-semantic.py \
--input document.txt \
--max-chunk-size 1500 \
--output chunks.jsonscripts/chunk-recursive.pypython scripts/chunk-recursive.py \
--input document.md \
--chunk-size 1000 \
--separators '["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]' \
--output chunks.json["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]["\\n\\n", "\\n", ". ", " "]scripts/benchmark-chunking.pypython scripts/benchmark-chunking.py \
--input document.txt \
--strategies fixed,semantic,recursive \
--chunk-sizes 500,1000,1500 \
--output benchmark-results.json{
"fixed-1000": {
"time_ms": 45,
"chunk_count": 127,
"avg_size": 982,
"size_variance": 12.3,
"context_score": 0.72
},
"semantic-1000": {
"time_ms": 156,
"chunk_count": 114,
"avg_size": 1087,
"size_variance": 234.5,
"context_score": 0.91
}
}templates/chunking-config.yamlchunking:
# Global defaults
default_strategy: semantic
default_chunk_size: 1000
default_overlap: 200
# Strategy-specific configs
strategies:
fixed_size:
chunk_size: 1000
overlap: 200
split_on: sentence # sentence, word, character
semantic:
max_chunk_size: 1500
min_chunk_size: 200
preserve_paragraphs: true
add_headers: true # Include section headers
recursive:
chunk_size: 1000
overlap: 100
separators:
markdown: ["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]
code: ["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]
text: ["\\n\\n", ". ", " "]
# Document type mappings
document_types:
".md": semantic
".py": recursive
".txt": fixed_size
".pdf": semantictemplates/custom-splitter.pyfrom typing import List, Dict
import re
class CustomChunker:
def __init__(self, chunk_size: int = 1000, overlap: int = 200):
self.chunk_size = chunk_size
self.overlap = overlap
def chunk(self, text: str, metadata: Dict = None) -> List[Dict]:
"""
Implement custom chunking logic here.
Returns:
List of chunks with metadata:
[
{
"text": "chunk content",
"metadata": {
"chunk_id": 0,
"source": "document.txt",
"start_char": 0,
"end_char": 1000
}
}
]
"""
chunks = []
# Your custom chunking logic here
# Example: Split on custom pattern
sections = self._split_sections(text)
for i, section in enumerate(sections):
chunks.append({
"text": section,
"metadata": {
"chunk_id": i,
"source": metadata.get("source", "unknown"),
"chunk_size": len(section)
}
})
return chunks
def _split_sections(self, text: str) -> List[str]:
# Implement your splitting logic
passexamples/chunk-markdown.pypython examples/chunk-markdown.py README.md --output readme-chunks.jsonexamples/chunk-code.pypython examples/chunk-code.py src/main.py --language python --output code-chunks.jsonexamples/chunk-pdf.pypypdfpdfminer.sixpython examples/chunk-pdf.py research-paper.pdf --strategy semantic --output pdf-chunks.json| Content Type | Chunk Size | Overlap | Strategy |
|---|---|---|---|
| Q&A / FAQs | 200-400 | 50 | Sentence |
| Articles | 500-1000 | 100-200 | Semantic |
| Documentation | 1000-1500 | 200-300 | Recursive |
| Books | 1000-2000 | 300-400 | Semantic |
| Source code | 500-1000 | 100 | Recursive |
benchmark-chunking.py# Use fixed-size for speed
python scripts/chunk-fixed-size.py --input large-doc.txt --chunk-size 1000# Use semantic for better context
python scripts/chunk-semantic.py --input article.txt --max-chunk-size 1500# Process multiple files
for file in documents/*.txt; do
python scripts/chunk-semantic.py --input "$file" --output "chunks/$(basename $file .txt).json"
donepython scripts/benchmark-chunking.py \
--input sample-document.txt \
--strategies fixed,semantic,recursive \
--chunk-sizes 500,1000,1500import yaml
from chunking_strategies import get_chunker
config = yaml.safe_load(open('chunking-config.yaml'))
chunker = get_chunker(config['chunking']['default_strategy'], config)
chunks = chunker.chunk(document_text)chunk_sizepip install tiktoken # Token counting
pip install nltk # Sentence splitting
pip install spacy # Advanced NLP (optional)pip install pypdf pdfminer.sixpip install pandas numpy scikit-learn