ai-data-engineering

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AI Data Engineering

AI数据工程

Purpose

用途

Build data infrastructure for AI/ML systems including RAG pipelines, feature stores, and embedding generation. Provides architecture patterns, orchestration workflows, and evaluation metrics for production AI applications.
为AI/ML系统构建数据基础设施,包括RAG管道、特征存储和嵌入生成。为生产级AI应用提供架构模式、编排工作流与评估指标。

When to Use

适用场景

Use this skill when:
  • Building RAG (Retrieval-Augmented Generation) pipelines
  • Implementing semantic search or vector databases
  • Setting up ML feature stores for real-time serving
  • Creating embedding generation pipelines
  • Evaluating RAG quality with RAGAS metrics
  • Orchestrating data workflows for AI systems
  • Integrating with frontend skills (ai-chat, search-filter)
Skip this skill if:
  • Building traditional CRUD applications (use databases-relational)
  • Simple key-value storage (use databases-nosql)
  • No AI/ML components in the application
在以下场景使用本技能:
  • 构建RAG(检索增强生成)管道
  • 实现语义搜索或向量数据库
  • 搭建用于实时服务的ML特征存储
  • 创建嵌入生成管道
  • 使用RAGAS指标评估RAG质量
  • 编排AI系统的数据工作流
  • 与前端技能集成(ai-chat、search-filter)
在以下场景跳过本技能:
  • 构建传统CRUD应用(使用databases-relational技能)
  • 简单键值存储(使用databases-nosql技能)
  • 应用中无AI/ML组件

RAG Pipeline Architecture

RAG管道架构

RAG pipelines have 5 distinct stages. Understanding this architecture is critical for production implementations.
┌─────────────────────────────────────────────────────────────┐
│                    RAG Pipeline (5 Stages)                   │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. INGESTION → Load documents (PDF, DOCX, Markdown)        │
│  2. INDEXING → Chunk (512 tokens) + Embed + Store           │
│  3. RETRIEVAL → Query embedding + Vector search + Filters   │
│  4. GENERATION → Context injection + LLM streaming          │
│  5. EVALUATION → RAGAS metrics (faithfulness, relevancy)    │
│                                                              │
└─────────────────────────────────────────────────────────────┘
For complete RAG architecture with implementation patterns, see:
  • references/rag-architecture.md
    - Detailed 5-stage breakdown
  • examples/langchain-rag/basic_rag.py
    - Working implementation
RAG管道包含5个不同阶段。理解该架构是生产环境落地的关键。
┌─────────────────────────────────────────────────────────────┐
│                    RAG Pipeline (5 Stages)                   │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. INGESTION → Load documents (PDF, DOCX, Markdown)        │
│  2. INDEXING → Chunk (512 tokens) + Embed + Store           │
│  3. RETRIEVAL → Query embedding + Vector search + Filters   │
│  4. GENERATION → Context injection + LLM streaming          │
│  5. EVALUATION → RAGAS metrics (faithfulness, relevancy)    │
│                                                              │
└─────────────────────────────────────────────────────────────┘
如需包含实现模式的完整RAG架构参考,请查看:
  • references/rag-architecture.md
    - 5阶段详细解析
  • examples/langchain-rag/basic_rag.py
    - 可运行的实现示例

Chunking Strategies

分块策略

Chunking is the most critical decision for RAG quality. Poor chunking breaks retrieval.
Default Recommendation:
  • Size: 512 tokens
  • Overlap: 50-100 tokens
  • Method: Fixed token-based
Why these values:
  • Too small (<256 tokens): Loses context, requires many retrievals
  • Too large (>1024 tokens): Includes irrelevant content, hits token limits
  • Overlap prevents information loss at chunk boundaries
Alternative strategies for special cases:
python
undefined
分块是影响RAG质量的最关键决策。不合理的分块会破坏检索效果。
默认推荐配置:
  • 大小: 512 tokens
  • 重叠: 50-100 tokens
  • 方法: 基于固定token的分块
选择这些值的原因:
  • 太小(<256 tokens):丢失上下文信息,需要多次检索
  • 太大(>1024 tokens):包含无关内容,触发token限制
  • 重叠设置可避免块边界处的信息丢失
特殊场景下的替代策略:
python
undefined

Code-aware chunking (preserves functions/classes)

代码感知分块(保留函数/类结构)

from langchain.text_splitter import RecursiveCharacterTextSplitter
code_splitter = RecursiveCharacterTextSplitter.from_language( language="python", chunk_size=512, chunk_overlap=50 )
from langchain.text_splitter import RecursiveCharacterTextSplitter
code_splitter = RecursiveCharacterTextSplitter.from_language( language="python", chunk_size=512, chunk_overlap=50 )

Semantic chunking (splits on meaning, not tokens)

语义分块(基于语义而非token分割)

from langchain.text_splitter import SemanticChunker
semantic_splitter = SemanticChunker( embeddings=embeddings, breakpoint_threshold_type="percentile" # Split at semantic boundaries )

**See:** `references/chunking-strategies.md` for complete decision framework
from langchain.text_splitter import SemanticChunker
semantic_splitter = SemanticChunker( embeddings=embeddings, breakpoint_threshold_type="percentile" # 在语义边界处分割 )

**参考:** `references/chunking-strategies.md` 中的完整决策框架

Embedding Generation

嵌入生成

Embedding quality directly impacts retrieval accuracy. Voyage AI is currently best-in-class.
Primary Recommendation: Voyage AI voyage-3
  • Dimensions: 1024
  • MTEB Score: 69.0 (highest as of Dec 2025)
  • Cost: $$$ but 9.74% better than OpenAI
  • Use for: Production systems requiring best retrieval quality
Cost-Effective Alternative: OpenAI text-embedding-3-small
  • Dimensions: 1536
  • MTEB Score: 62.3
  • Cost: $ (5x cheaper than voyage-3)
  • Use for: Development, prototyping, cost-sensitive applications
Implementation:
python
from langchain_voyageai import VoyageAIEmbeddings
from langchain_openai import OpenAIEmbeddings
嵌入质量直接影响检索准确率。目前Voyage AI的模型处于行业领先水平。
首要推荐:Voyage AI voyage-3
  • 维度:1024
  • MTEB评分:69.0(截至2025年12月为最高分)
  • 成本:$$$,但比OpenAI模型效果好9.74%
  • 适用场景:对检索质量要求最高的生产系统
高性价比替代方案:OpenAI text-embedding-3-small
  • 维度:1536
  • MTEB评分:62.3
  • 成本:$(比voyage-3便宜5倍)
  • 适用场景:开发、原型验证、对成本敏感的应用
实现代码:
python
from langchain_voyageai import VoyageAIEmbeddings
from langchain_openai import OpenAIEmbeddings

Production (best quality)

生产环境(最优质量)

embeddings = VoyageAIEmbeddings( model="voyage-3", voyage_api_key="your-api-key" )
embeddings = VoyageAIEmbeddings( model="voyage-3", voyage_api_key="your-api-key" )

Development (cost-effective)

开发环境(高性价比)

embeddings = OpenAIEmbeddings( model="text-embedding-3-small", openai_api_key="your-api-key" )

**See:** `references/embedding-strategies.md` for complete provider comparison
embeddings = OpenAIEmbeddings( model="text-embedding-3-small", openai_api_key="your-api-key" )

**参考:** `references/embedding-strategies.md` 中的完整服务商对比

RAGAS Evaluation Metrics

RAGAS评估指标

Traditional metrics (BLEU, ROUGE) don't measure RAG quality. RAGAS provides LLM-as-judge evaluation.
4 Core Metrics:
MetricMeasuresGood Score
FaithfulnessFactual consistency with retrieved context> 0.8
Answer RelevancyDoes answer address the user's question?> 0.7
Context PrecisionAre retrieved chunks actually relevant?> 0.6
Context RecallWere all necessary chunks retrieved?> 0.7
Quick evaluation script:
bash
undefined
传统指标(BLEU、ROUGE)无法衡量RAG质量。RAGAS提供基于LLM作为评判者的评估方案。
4个核心指标:
指标衡量内容优秀分数
Faithfulness与检索到的上下文的事实一致性> 0.8
Answer Relevancy回答是否解决了用户的问题?> 0.7
Context Precision检索到的块是否真正相关?> 0.6
Context Recall是否检索到了所有必要的块?> 0.7
快速评估脚本:
bash
undefined

Run RAGAS evaluation (TOKEN-FREE script execution)

运行RAGAS评估(无需TOKEN的脚本执行)

python scripts/evaluate_rag.py --dataset eval_data.json --output results.json

**Manual implementation:**

```python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

dataset = {
    "question": ["What is the capital of France?"],
    "answer": ["Paris is the capital of France."],
    "contexts": [["France's capital is Paris."]],
    "ground_truth": ["Paris"]
}

result = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
print(f"Faithfulness: {result['faithfulness']}")
print(f"Answer Relevancy: {result['answer_relevancy']}")
See:
references/evaluation-metrics.md
for complete RAGAS implementation guide
python scripts/evaluate_rag.py --dataset eval_data.json --output results.json

**手动实现代码:**

```python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

dataset = {
    "question": ["What is the capital of France?"],
    "answer": ["Paris is the capital of France."],
    "contexts": [["France's capital is Paris."]],
    "ground_truth": ["Paris"]
}

result = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
print(f"Faithfulness: {result['faithfulness']}")
print(f"Answer Relevancy: {result['answer_relevancy']}")
参考:
references/evaluation-metrics.md
中的完整RAGAS实现指南

Feature Stores

特征存储

Feature stores solve the "training-serving skew" problem by providing consistent feature computation.
Primary Recommendation: Feast - Open source, works with any backend (PostgreSQL, Redis, DynamoDB, S3, BigQuery, Snowflake)
Basic usage:
python
from feast import FeatureStore
store = FeatureStore(repo_path="feature_repo/")
特征存储通过提供一致的特征计算逻辑,解决“训练-服务偏差”问题。
首要推荐:Feast - 开源工具,支持任意后端(PostgreSQL、Redis、DynamoDB、S3、BigQuery、Snowflake)
基础用法:
python
from feast import FeatureStore
store = FeatureStore(repo_path="feature_repo/")

Online serving (low-latency)

在线服务(低延迟)

features = store.get_online_features( features=["user_features:total_orders"], entity_rows=[{"user_id": 1001}] ).to_dict()

**See:** `references/feature-stores.md` for complete Feast setup and alternatives (Tecton, Hopsworks)
features = store.get_online_features( features=["user_features:total_orders"], entity_rows=[{"user_id": 1001}] ).to_dict()

**参考:** `references/feature-stores.md` 中的完整Feast搭建指南及替代方案(Tecton、Hopsworks)

LangChain Orchestration

LangChain编排

LangChain is the primary framework for LLM orchestration with the largest ecosystem (24,215+ API reference snippets).
Context7 Library ID:
/websites/langchain_oss_python_langchain
(Trust: High, Snippets: 435)
Basic RAG Chain:
python
from langchain_core.prompts import ChatPromptTemplate
from langchain_qdrant import QdrantVectorStore
from langchain_voyageai import VoyageAIEmbeddings
LangChain是LLM编排的主流框架,拥有最庞大的生态系统(24215+ API参考片段)。
Context7库ID:
/websites/langchain_oss_python_langchain
(可信度:高,片段数量:435)
基础RAG链:
python
from langchain_core.prompts import ChatPromptTemplate
from langchain_qdrant import QdrantVectorStore
from langchain_voyageai import VoyageAIEmbeddings

Setup retriever

配置检索器

vectorstore = QdrantVectorStore( client=qdrant_client, embedding=VoyageAIEmbeddings(model="voyage-3") ) retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 5})
vectorstore = QdrantVectorStore( client=qdrant_client, embedding=VoyageAIEmbeddings(model="voyage-3") ) retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 5})

Build chain

构建链

prompt = ChatPromptTemplate.from_template( "Answer based on context:\n{context}\n\nQuestion: {question}" ) chain = {"context": retriever, "question": lambda x: x} | prompt | ChatOpenAI() | StrOutputParser()
prompt = ChatPromptTemplate.from_template( "Answer based on context:\n{context}\n\nQuestion: {question}" ) chain = {"context": retriever, "question": lambda x: x} | prompt | ChatOpenAI() | StrOutputParser()

Stream response

流式响应

for chunk in chain.stream("What is the capital of France?"): print(chunk, end="", flush=True)

**See:** `references/langchain-patterns.md` - Complete LangChain 0.3+ patterns with streaming and hybrid search
for chunk in chain.stream("What is the capital of France?"): print(chunk, end="", flush=True)

**参考:** `references/langchain-patterns.md` - 包含流式传输与混合搜索的完整LangChain 0.3+模式

Orchestration Tools

编排工具

Modern AI pipelines require workflow orchestration beyond cron jobs.
Primary Recommendation: Dagster (for ML/AI pipelines) - Asset-centric design, best lineage tracking, perfect for RAG
Example: Embedding Pipeline
python
from dagster import asset
from langchain_voyageai import VoyageAIEmbeddings

@asset
def raw_documents():
    """Load documents from S3."""
    return documents

@asset
def chunked_documents(raw_documents):
    """Split into 512-token chunks with 50-token overlap."""
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
    return splitter.split_documents(raw_documents)

@asset
def embedded_documents(chunked_documents):
    """Generate embeddings with Voyage AI."""
    embeddings = VoyageAIEmbeddings(model="voyage-3")
    return embeddings.embed_documents([doc.page_content for doc in chunked_documents])
See:
references/orchestration-tools.md
for complete Dagster patterns and alternatives (Prefect, Airflow 3.0, dbt)
现代AI管道需要超越定时任务的工作流编排能力。
首要推荐:Dagster(面向ML/AI管道) - 以资产为中心的设计,提供最佳的 lineage 追踪,非常适合RAG场景
示例:嵌入管道
python
from dagster import asset
from langchain_voyageai import VoyageAIEmbeddings

@asset
def raw_documents():
    """从S3加载文档。"""
    return documents

@asset
def chunked_documents(raw_documents):
    """分割为512-token的块,重叠50个token。"""
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
    return splitter.split_documents(raw_documents)

@asset
def embedded_documents(chunked_documents):
    """使用Voyage AI生成嵌入。"""
    embeddings = VoyageAIEmbeddings(model="voyage-3")
    return embeddings.embed_documents([doc.page_content for doc in chunked_documents])
参考:
references/orchestration-tools.md
中的完整Dagster模式及替代方案(Prefect、Airflow 3.0、dbt)

Integration with Frontend Skills

与前端技能的集成

ai-chat Skill → RAG Backend

ai-chat技能 → RAG后端

The ai-chat skill consumes RAG pipeline outputs for streaming responses.
Backend API (FastAPI):
python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

@app.post("/api/rag/stream")
async def stream_rag(query: str):
    async def generate():
        chain = RetrievalQA.from_chain_type(llm=OpenAI(streaming=True), retriever=vectorstore.as_retriever())
        async for chunk in chain.astream(query):
            yield chunk
    return StreamingResponse(generate(), media_type="text/plain")
See:
references/rag-architecture.md
for complete frontend integration patterns
ai-chat技能会消费RAG管道的输出以实现流式响应。
后端API(FastAPI):
python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

@app.post("/api/rag/stream")
async def stream_rag(query: str):
    async def generate():
        chain = RetrievalQA.from_chain_type(llm=OpenAI(streaming=True), retriever=vectorstore.as_retriever())
        async for chunk in chain.astream(query):
            yield chunk
    return StreamingResponse(generate(), media_type="text/plain")
参考:
references/rag-architecture.md
中的完整前端集成模式

search-filter Skill → Semantic Search

search-filter技能 → 语义搜索

The search-filter skill uses semantic search backends for vector similarity.
Backend (Qdrant + Voyage AI):
python
from qdrant_client import QdrantClient
from langchain_voyageai import VoyageAIEmbeddings

@app.post("/api/search/semantic")
async def semantic_search(query: str, filters: dict):
    query_vector = VoyageAIEmbeddings(model="voyage-3").embed_query(query)
    results = QdrantClient().search(
        collection_name="documents",
        query_vector=query_vector,
        query_filter=filters,
        limit=10
    )
    return {"results": results}
search-filter技能使用语义搜索后端实现向量相似度匹配。
后端实现(Qdrant + Voyage AI):
python
from qdrant_client import QdrantClient
from langchain_voyageai import VoyageAIEmbeddings

@app.post("/api/search/semantic")
async def semantic_search(query: str, filters: dict):
    query_vector = VoyageAIEmbeddings(model="voyage-3").embed_query(query)
    results = QdrantClient().search(
        collection_name="documents",
        query_vector=query_vector,
        query_filter=filters,
        limit=10
    )
    return {"results": results}

Data Versioning

数据版本控制

Primary Recommendation: LakeFS (acquired DVC team November 2025)
Git-like operations on data lakes: branch, commit, merge, time travel. Works with S3/Azure/GCS.
python
import lakefs

branch = lakefs.Branch("main").create("experiment-voyage-3")
branch.commit("Updated embeddings to voyage-3")
branch.merge_into("main")
See:
references/data-versioning.md
for complete LakeFS setup
首要推荐:LakeFS(2025年11月收购DVC团队)
为数据湖提供类Git的操作:分支、提交、合并、时间旅行。支持S3/Azure/GCS。
python
import lakefs

branch = lakefs.Branch("main").create("experiment-voyage-3")
branch.commit("Updated embeddings to voyage-3")
branch.merge_into("main")
参考:
references/data-versioning.md
中的完整LakeFS搭建指南

Quick Start Workflow

快速开始工作流

1. Set up vector database:
bash
undefined
1. 配置向量数据库:
bash
undefined

Run Qdrant setup script (TOKEN-FREE execution)

运行Qdrant配置脚本(无需TOKEN的执行)

python scripts/setup_qdrant.py --collection docs --dimension 1024

**2. Chunk and embed documents:**

```bash
python scripts/setup_qdrant.py --collection docs --dimension 1024

**2. 分块并嵌入文档:**

```bash

Chunk documents (TOKEN-FREE execution)

文档分块(无需TOKEN的执行)

python scripts/chunk_documents.py
--input data/documents/
--chunk-size 512
--overlap 50
--output data/chunks/

**3. Implement RAG pipeline:**

See `examples/langchain-rag/basic_rag.py` for complete working example.

**4. Evaluate with RAGAS:**

```bash
python scripts/chunk_documents.py
--input data/documents/
--chunk-size 512
--overlap 50
--output data/chunks/

**3. 实现RAG管道:**

查看`examples/langchain-rag/basic_rag.py`获取完整可运行示例。

**4. 使用RAGAS评估:**

```bash

Run evaluation (TOKEN-FREE execution)

运行评估(无需TOKEN的执行)

python scripts/evaluate_rag.py
--dataset data/eval_qa.json
--output results/ragas_metrics.json

**5. Deploy with orchestration:**

See `examples/dagster-pipelines/embedding_pipeline.py` for production deployment.
python scripts/evaluate_rag.py
--dataset data/eval_qa.json
--output results/ragas_metrics.json

**5. 通过编排工具部署:**

查看`examples/dagster-pipelines/embedding_pipeline.py`获取生产环境部署示例。

Dependencies

依赖项

Required Python packages:
bash
undefined
必需的Python包:
bash
undefined

Core RAG

核心RAG

pip install langchain langchain-core langchain-openai langchain-voyageai langchain-qdrant
pip install langchain langchain-core langchain-openai langchain-voyageai langchain-qdrant

Vector database

向量数据库

pip install qdrant-client
pip install qdrant-client

Evaluation

评估工具

pip install ragas datasets
pip install ragas datasets

Feature stores

特征存储

pip install feast
pip install feast

Orchestration

编排工具

pip install dagster dagster-webserver
pip install dagster dagster-webserver

Data versioning

数据版本控制

pip install lakefs-client

**Optional for alternatives:**

```bash
pip install lakefs-client

**可选的替代方案依赖:**

```bash

LlamaIndex (alternative to LangChain)

LlamaIndex(LangChain的替代框架)

pip install llama-index
pip install llama-index

dbt (SQL transformations)

dbt(SQL数据转换)

pip install dbt-core dbt-postgres
pip install dbt-core dbt-postgres

Prefect (alternative orchestration)

Prefect(替代编排工具)

pip install prefect
undefined
pip install prefect
undefined

Troubleshooting

故障排除

Common Issues:
1. Poor retrieval quality - Check chunk size (try 512 tokens), increase overlap (50-100), try hybrid search, re-rank with Cohere
2. Slow embedding generation - Batch documents (100-1000), use async APIs, cache with Redis, use smaller model for dev
3. High LLM costs - Reduce retrieved chunks (k=3), use cheaper re-ranking models, cache frequent queries
See:
references/rag-architecture.md
for complete troubleshooting guide
常见问题:
1. 检索质量差 - 检查分块大小(尝试512 tokens),增加重叠量(50-100),尝试混合搜索,使用Cohere进行重排序
2. 嵌入生成速度慢 - 批量处理文档(100-1000个),使用异步API,用Redis缓存,开发环境使用更小的模型
3. LLM成本高 - 减少检索到的块数量(k=3),使用更便宜的重排序模型,缓存高频查询
参考:
references/rag-architecture.md
中的完整故障排除指南

Best Practices

最佳实践

Chunking: Default to 512 tokens with 50-token overlap. Use semantic chunking for complex documents. Preserve code structure for source code.
Embeddings: Use Voyage AI voyage-3 for production, OpenAI text-embedding-3-small for development. Never mix embedding models (re-embed everything if changing).
Evaluation: Run RAGAS metrics on every pipeline change. Maintain test dataset of 50+ question-answer pairs. Track metrics over time.
Orchestration: Use Dagster for ML/AI pipelines, dbt for SQL transformations only. Version control all pipeline code.
Frontend Integration: Always stream LLM responses. Implement retry logic. Show citations/sources to users. Handle empty results gracefully.
分块: 默认使用512 tokens大小搭配50 tokens重叠。复杂文档使用语义分块。源代码分块时保留代码结构。
嵌入: 生产环境使用Voyage AI voyage-3,开发环境使用OpenAI text-embedding-3-small。切勿混合使用嵌入模型(更换模型时需重新嵌入所有数据)。
评估: 每次管道变更后运行RAGAS指标。维护包含50+问答对的测试数据集。持续跟踪指标变化。
编排: ML/AI管道使用Dagster,仅SQL转换使用dbt。对所有管道代码进行版本控制。
前端集成: 始终使用LLM流式响应。实现重试逻辑。向用户展示引用/来源。优雅处理空结果。

Additional Resources

额外资源

Reference Documentation:
  • references/rag-architecture.md
    - Complete RAG pipeline guide
  • references/chunking-strategies.md
    - Decision framework for chunking
  • references/embedding-strategies.md
    - Embedding model comparison
  • references/langchain-patterns.md
    - LangChain 0.3+ patterns
  • references/feature-stores.md
    - Feast setup and alternatives
  • references/evaluation-metrics.md
    - RAGAS implementation guide
Working Examples:
  • examples/langchain-rag/basic_rag.py
    - Simple RAG chain
  • examples/langchain-rag/streaming_rag.py
    - Streaming responses
  • examples/langchain-rag/hybrid_search.py
    - Vector + BM25
  • examples/llamaindex-agents/query_engine.py
    - LlamaIndex alternative
  • examples/feast-features/
    - Complete feature store setup
  • examples/dagster-pipelines/embedding_pipeline.py
    - Production pipeline
Executable Scripts (TOKEN-FREE):
  • scripts/evaluate_rag.py
    - RAGAS evaluation runner
  • scripts/chunk_documents.py
    - Document chunking utility
  • scripts/benchmark_retrieval.py
    - Retrieval quality benchmark
  • scripts/setup_qdrant.py
    - Qdrant collection setup
参考文档:
  • references/rag-architecture.md
    - 完整RAG管道指南
  • references/chunking-strategies.md
    - 分块决策框架
  • references/embedding-strategies.md
    - 嵌入模型对比
  • references/langchain-patterns.md
    - LangChain 0.3+模式
  • references/feature-stores.md
    - Feast搭建及替代方案
  • references/evaluation-metrics.md
    - RAGAS实现指南
可运行示例:
  • examples/langchain-rag/basic_rag.py
    - 简单RAG链
  • examples/langchain-rag/streaming_rag.py
    - 流式响应
  • examples/langchain-rag/hybrid_search.py
    - 向量+BM25混合搜索
  • examples/llamaindex-agents/query_engine.py
    - LlamaIndex替代方案
  • examples/feast-features/
    - 完整特征存储搭建
  • examples/dagster-pipelines/embedding_pipeline.py
    - 生产级管道
可执行脚本(无需TOKEN):
  • scripts/evaluate_rag.py
    - RAGAS评估运行器
  • scripts/chunk_documents.py
    - 文档分块工具
  • scripts/benchmark_retrieval.py
    - 检索质量基准测试
  • scripts/setup_qdrant.py
    - Qdrant集合配置工具