rag-skills

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

RAG Skills for LlamaFarm

LlamaFarm的RAG技能指南

Framework-specific patterns and code review checklists for the RAG component.

Extends: python-skills - All Python best practices apply here.

针对RAG组件的框架专属模式与代码审查清单。

扩展自：python-skills - 所有Python最佳实践均适用于此。

Component Overview

组件概览

Aspect	Technology	Version
Python	Python	3.11+
Document Processing	LlamaIndex	0.13+
Vector Storage	ChromaDB	1.0+
Task Queue	Celery	5.5+
Embeddings	Universal/Ollama/OpenAI	Multiple

维度	技术栈	版本
Python	Python	3.11+
文档处理	LlamaIndex	0.13+
向量存储	ChromaDB	1.0+
任务队列	Celery	5.5+
嵌入模型	Universal/Ollama/OpenAI	多版本兼容

Directory Structure

目录结构

rag/
├── api.py                 # Search and database APIs
├── celery_app.py          # Celery configuration
├── main.py                # Entry point
├── core/
│   ├── base.py            # Document, Component, Pipeline ABCs
│   ├── factories.py       # Component factories
│   ├── ingest_handler.py  # File ingestion with safety checks
│   ├── blob_processor.py  # Binary file processing
│   ├── settings.py        # Pydantic settings
│   └── logging.py         # RAGStructLogger
├── components/
│   ├── embedders/         # Embedding providers
│   ├── extractors/        # Metadata extractors
│   ├── parsers/           # Document parsers (LlamaIndex)
│   ├── retrievers/        # Retrieval strategies
│   └── stores/            # Vector stores (ChromaDB, FAISS)
├── tasks/                 # Celery tasks
│   ├── ingest_tasks.py    # File ingestion
│   ├── search_tasks.py    # Database search
│   ├── query_tasks.py     # Complex queries
│   ├── health_tasks.py    # Health checks
│   └── stats_tasks.py     # Statistics
└── utils/
    └── embedding_safety.py  # Circuit breaker, validation

rag/
├── api.py                 # 搜索与数据库API
├── celery_app.py          # Celery配置文件
├── main.py                # 程序入口
├── core/
│   ├── base.py            # 文档、组件、流水线抽象基类（ABC）
│   ├── factories.py       # 组件工厂
│   ├── ingest_handler.py  # 带安全校验的文件摄入模块
│   ├── blob_processor.py  # 二进制文件处理模块
│   ├── settings.py        # Pydantic配置项
│   └── logging.py         # RAG结构化日志器
├── components/
│   ├── embedders/         # 嵌入模型提供商实现
│   ├── extractors/        # 元数据提取器
│   ├── parsers/           # 文档解析器（基于LlamaIndex）
│   ├── retrievers/        # 检索策略实现
│   └── stores/            # 向量存储（ChromaDB、FAISS）
├── tasks/                 # Celery任务
│   ├── ingest_tasks.py    # 文件摄入任务
│   ├── search_tasks.py    # 数据库搜索任务
│   ├── query_tasks.py     # 复杂查询任务
│   ├── health_tasks.py    # 健康检查任务
│   └── stats_tasks.py     # 统计任务
└── utils/
    └── embedding_safety.py  # 熔断器、校验工具

Quick Reference

快速参考

Topic	File	Key Points
LlamaIndex	llamaindex.md	Document parsing, chunking, node conversion
ChromaDB	chromadb.md	Collections, embeddings, distance metrics
Celery	celery.md	Task routing, error handling, worker config
Performance	performance.md	Batching, caching, deduplication

主题	文件	核心要点
LlamaIndex	llamaindex.md	文档解析、分块、节点转换
ChromaDB	chromadb.md	集合管理、嵌入存储、距离度量
Celery	celery.md	任务路由、错误处理、Worker配置
性能优化	performance.md	批量处理、缓存、去重

Core Patterns

核心模式

Document Dataclass

文档数据类

python

from dataclasses import dataclass, field
from typing import Any

@dataclass
class Document:
    content: str
    metadata: dict[str, Any] = field(default_factory=dict)
    id: str = field(default_factory=lambda: str(uuid.uuid4()))
    source: str | None = None
    embeddings: list[float] | None = None

python

from dataclasses import dataclass, field
from typing import Any

@dataclass
class Document:
    content: str
    metadata: dict[str, Any] = field(default_factory=dict)
    id: str = field(default_factory=lambda: str(uuid.uuid4()))
    source: str | None = None
    embeddings: list[float] | None = None

Component Abstract Base Class

组件抽象基类

python

from abc import ABC, abstractmethod

class Component(ABC):
    def __init__(
        self,
        name: str | None = None,
        config: dict[str, Any] | None = None,
        project_dir: Path | None = None,
    ):
        self.name = name or self.__class__.__name__
        self.config = config or {}
        self.logger = RAGStructLogger(__name__).bind(name=self.name)
        self.project_dir = project_dir

    @abstractmethod
    def process(self, documents: list[Document]) -> ProcessingResult:
        pass

python

from abc import ABC, abstractmethod

class Component(ABC):
    def __init__(
        self,
        name: str | None = None,
        config: dict[str, Any] | None = None,
        project_dir: Path | None = None,
    ):
        self.name = name or self.__class__.__name__
        self.config = config or {}
        self.logger = RAGStructLogger(__name__).bind(name=self.name)
        self.project_dir = project_dir

    @abstractmethod
    def process(self, documents: list[Document]) -> ProcessingResult:
        pass

Retrieval Strategy Pattern

检索策略模式

python

class RetrievalStrategy(Component, ABC):
    @abstractmethod
    def retrieve(
        self,
        query_embedding: list[float],
        vector_store,
        top_k: int = 5,
        **kwargs
    ) -> RetrievalResult:
        pass

    @abstractmethod
    def supports_vector_store(self, vector_store_type: str) -> bool:
        pass

python

class RetrievalStrategy(Component, ABC):
    @abstractmethod
    def retrieve(
        self,
        query_embedding: list[float],
        vector_store,
        top_k: int = 5,
        **kwargs
    ) -> RetrievalResult:
        pass

    @abstractmethod
    def supports_vector_store(self, vector_store_type: str) -> bool:
        pass

Embedder with Circuit Breaker

带熔断器的嵌入模型

python

class Embedder(Component):
    DEFAULT_FAILURE_THRESHOLD = 5
    DEFAULT_RESET_TIMEOUT = 60.0

    def __init__(self, ...):
        super().__init__(...)
        self._circuit_breaker = CircuitBreaker(
            failure_threshold=config.get("failure_threshold", 5),
            reset_timeout=config.get("reset_timeout", 60.0),
        )
        self._fail_fast = config.get("fail_fast", True)

    def embed_text(self, text: str) -> list[float]:
        self.check_circuit_breaker()
        try:
            embedding = self._call_embedding_api(text)
            self.record_success()
            return embedding
        except Exception as e:
            self.record_failure(e)
            if self._fail_fast:
                raise EmbedderUnavailableError(str(e)) from e
            return [0.0] * self.get_embedding_dimension()

python

class Embedder(Component):
    DEFAULT_FAILURE_THRESHOLD = 5
    DEFAULT_RESET_TIMEOUT = 60.0

    def __init__(self, ...):
        super().__init__(...)
        self._circuit_breaker = CircuitBreaker(
            failure_threshold=config.get("failure_threshold", 5),
            reset_timeout=config.get("reset_timeout", 60.0),
        )
        self._fail_fast = config.get("fail_fast", True)

    def embed_text(self, text: str) -> list[float]:
        self.check_circuit_breaker()
        try:
            embedding = self._call_embedding_api(text)
            self.record_success()
            return embedding
        except Exception as e:
            self.record_failure(e)
            if self._fail_fast:
                raise EmbedderUnavailableError(str(e)) from e
            return [0.0] * self.get_embedding_dimension()

Review Checklist Summary

审查清单摘要

When reviewing RAG code:

LlamaIndex (Medium priority)
- Proper chunking configuration
- Metadata preservation during parsing
- Error handling for unsupported formats
ChromaDB (High priority)
- Thread-safe client access
- Proper distance metric selection
- Metadata type compatibility
Celery (High priority)
- Task routing to correct queue
- Error logging with context
- Proper serialization
Performance (Medium priority)
- Batch processing for embeddings
- Deduplication enabled
- Appropriate caching

See individual topic files for detailed checklists with grep patterns.

审查RAG代码时需关注：

LlamaIndex（中等优先级）
- 合理的分块配置
- 解析过程中的元数据保留
- 不支持格式的错误处理
ChromaDB（高优先级）
- 线程安全的客户端访问
- 合适的距离度量选择
- 元数据类型兼容性
Celery（高优先级）
- 任务路由至正确队列
- 带上下文的错误日志
- 合理的序列化配置
性能优化（中等优先级）
- 嵌入的批量处理
- 启用去重机制
- 合理的缓存策略

各主题的详细审查清单（含grep匹配规则）请查看对应主题文件。