addon-rag-ingestion-pipeline
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAdd-on: Multi-Format RAG Ingestion Pipeline
附加组件:多格式RAG摄入管道
Use this skill when an existing project needs RAG ingestion/retrieval across multiple document formats.
当现有项目需要跨多种文档格式的RAG摄入/检索功能时使用此技能。
Compatibility
兼容性
- Works with .
architect-python-uv-batch - Works with .
architect-python-uv-fastapi-sqlalchemy - Can back a Next.js app via a Python worker service.
- 适配。
architect-python-uv-batch - 适配。
architect-python-uv-fastapi-sqlalchemy - 可通过Python worker服务为Next.js应用提供后端支持。
Inputs
输入参数
Collect:
- : one or more of
SOURCE_FORMATS,pdf,markdown,txt,html.csv - :
EMBED_PROVIDERoropenai.sentence-transformers - :
VECTOR_STORE,pgvector, or existing vector layer.chroma - : default
CHUNK_SIZE.1000 - : default
CHUNK_OVERLAP.150 - : default
TOP_K.5
收集以下配置:
- :支持
SOURCE_FORMATS、pdf、markdown、txt、html中的一种或多种。csv - :
EMBED_PROVIDER或openai。sentence-transformers - :
VECTOR_STORE、pgvector或现有向量层。chroma - :默认值
CHUNK_SIZE。1000 - :默认值
CHUNK_OVERLAP。150 - :默认值
TOP_K。5
Integration Workflow
集成工作流
- Add dependencies (Python worker path):
bash
uv add pypdf markdown-it-py beautifulsoup4 pandas langchain-text-splitters- If :
EMBED_PROVIDER=openaiuv add openai - If :
EMBED_PROVIDER=sentence-transformersuv add sentence-transformers - If :
VECTOR_STORE=chromauv add chromadb
- Add modules:
text
src/{{MODULE_NAME}}/rag/
loaders/pdf_loader.py
loaders/markdown_loader.py
loaders/text_loader.py
loaders/html_loader.py
loaders/csv_loader.py
normalize.py
chunking.py
embeddings.py
indexer.py
retriever.py- Use a normalized document contract:
document_idsource_pathsource_typecontent- (filename/page/section/checksum/ingested_at/model_version)
metadata
- Implement ingestion entrypoint:
bash
uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown,txt- Implement retrieval entrypoint:
bash
uv run {{PROJECT_NAME}} rag-query --q "question" --top-k 5- Ensure both commands are wired in the project CLI/script entrypoint.
- depends on an existing index from
rag-query; do not run these validation commands in parallel.rag-ingest
- 添加依赖(Python worker路径下执行):
bash
uv add pypdf markdown-it-py beautifulsoup4 pandas langchain-text-splitters- 如果:执行
EMBED_PROVIDER=openaiuv add openai - 如果:执行
EMBED_PROVIDER=sentence-transformersuv add sentence-transformers - 如果:执行
VECTOR_STORE=chromauv add chromadb
- 新增模块文件:
text
src/{{MODULE_NAME}}/rag/
loaders/pdf_loader.py
loaders/markdown_loader.py
loaders/text_loader.py
loaders/html_loader.py
loaders/csv_loader.py
normalize.py
chunking.py
embeddings.py
indexer.py
retriever.py- 使用标准化文档契约:
document_idsource_pathsource_typecontent- (文件名/页码/章节/校验和/摄入时间/模型版本)
metadata
- 实现摄入入口:
bash
uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown,txt- 实现检索入口:
bash
uv run {{PROJECT_NAME}} rag-query --q "question" --top-k 5- 确保两个命令都已接入项目CLI/脚本入口。
- 依赖
rag-query生成的现有索引,请勿并行运行这些验证命令。rag-ingest
Loader Notes
加载器说明
- PDF: extract per page and keep in metadata.
page_number - Markdown: keep heading hierarchy and section anchors in metadata.
- Text: detect encoding fallback (, then
utf-8).latin-1 - HTML: strip script/style tags and preserve title/headings where possible.
- CSV: convert rows into stable textual records with row identifiers.
- PDF:按页提取内容,在元数据中保留(页码)。
page_number - Markdown:保留标题层级和章节锚点到元数据中。
- Text:检测编码兜底方案(优先, fallback为
utf-8)。latin-1 - HTML:移除script/style标签,尽可能保留标题/ heading信息。
- CSV:将行转换为带行标识符的稳定文本记录。
Minimal Defaults
最小默认实现
normalize.py
normalize.pynormalize.py
normalize.pypython
import re
import unicodedata
def normalize_text(raw: str) -> str:
text = unicodedata.normalize("NFKC", raw)
text = text.replace("\r\n", "\n")
text = re.sub(r"[ \t]+", " ", text)
text = re.sub(r"\n{3,}", "\n\n", text)
return text.strip()python
import re
import unicodedata
def normalize_text(raw: str) -> str:
text = unicodedata.normalize("NFKC", raw)
text = text.replace("\r\n", "\n")
text = re.sub(r"[ \t]+", " ", text)
text = re.sub(r"\n{3,}", "\n\n", text)
return text.strip()chunking.py
chunking.pychunking.py
chunking.pypython
from langchain_text_splitters import RecursiveCharacterTextSplitter
def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 150) -> list[str]:
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=overlap,
separators=["\n\n", "\n", ". ", " ", ""],
)
return splitter.split_text(text)python
from langchain_text_splitters import RecursiveCharacterTextSplitter
def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 150) -> list[str]:
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=overlap,
separators=["\n\n", "\n", ". ", " ", ""],
)
return splitter.split_text(text)Guardrails
防护规则
-
Documentation contract for generated code:
- Python: write module docstrings and docstrings for public classes, methods, and functions.
- Next.js/TypeScript: write JSDoc for exported components, hooks, utilities, and route handlers.
- Add concise rationale comments only for non-obvious logic, invariants, or safety constraints.
- Apply this contract even when using template snippets below; expand templates as needed.
-
Deduplicate ingestion by checksum to keep re-runs idempotent.
-
Store embedding model/version so re-indexing can be reasoned about.
-
Never interpolate user queries into raw SQL vector search.
-
Keep ingestion async/offline for large corpora; do not block request-response paths.
-
Preserve citation metadata for retrieval (, section, page, row id).
source_path
-
生成代码的文档契约:
- Python:为模块、公共类、方法和函数编写文档字符串。
- Next.js/TypeScript:为导出的组件、hooks、工具函数和路由处理器编写JSDoc。
- 仅为非直观逻辑、不变量或安全约束添加简洁的说明注释。
- 即使使用下方模板片段也需遵守该契约,可按需扩展模板。
-
通过校验和对摄入内容去重,保证重复运行的幂等性。
-
存储嵌入模型/版本信息,方便后续重索引操作溯源。
-
永远不要将用户查询直接拼接进原生SQL向量搜索语句。
-
大型语料库的摄入操作采用异步/离线执行,不要阻塞请求响应链路。
-
保留检索所需的引用元数据(、章节、页码、行ID)。
source_path
Validation Checklist
验证检查清单
- Confirm generated code includes required docstrings/JSDoc and rationale comments for non-obvious logic.
bash
uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown
uv run {{PROJECT_NAME}} rag-query --q "smoke test" --top-k 5
uv run pytest -q- 确认生成代码包含必填的文档字符串/JSDoc,以及非直观逻辑的说明注释。
bash
uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown
uv run {{PROJECT_NAME}} rag-query --q "smoke test" --top-k 5
uv run pytest -qDecision Justification Rule
决策说明规则
- Every non-trivial decision must include a concrete justification.
- Capture the alternatives considered and why they were rejected.
- State tradeoffs and residual risks for the chosen option.
- If justification is missing, treat the task as incomplete and surface it as a blocker.
- 所有非 trivial 决策必须包含具体的理由说明。
- 记录考虑过的替代方案以及被否决的原因。
- 说明所选方案的权衡和残留风险。
- 如果缺少理由说明,视为任务未完成,将其作为阻塞项上报。