embedding-optimization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseEmbedding Optimization
向量嵌入优化
Optimize embedding generation for cost, performance, and quality in RAG and semantic search systems.
针对RAG和语义搜索系统,从成本、性能和质量三个维度优化嵌入生成过程。
When to Use This Skill
何时使用该技能
Trigger this skill when:
- Building RAG (Retrieval Augmented Generation) systems
- Implementing semantic search or similarity detection
- Optimizing embedding API costs (reducing by 70-90%)
- Improving document retrieval quality through better chunking
- Processing large document corpora (thousands to millions of documents)
- Selecting between API-based vs. local embedding models
在以下场景中启用该技能:
- 构建RAG(Retrieval Augmented Generation,检索增强生成)系统
- 实现语义搜索或相似度检测功能
- 优化嵌入API成本(可降低70-90%)
- 通过更优的分块策略提升文档检索质量
- 处理大规模文档语料库(数千至数百万份文档)
- 在基于API和本地嵌入模型之间做选型
Model Selection Framework
模型选择框架
Choose the optimal embedding model based on requirements:
Quick Recommendations:
- Startup/MVP: (local, 384 dims, zero API costs)
all-MiniLM-L6-v2 - Production: (API, 1,536 dims, balanced quality/cost)
text-embedding-3-small - High Quality: (API, 3,072 dims, premium)
text-embedding-3-large - Multilingual: (local, 768 dims) or Cohere
multilingual-e5-baseembed-multilingual-v3.0
For detailed decision frameworks including cost comparisons, quality benchmarks, and data privacy considerations, see .
references/model-selection-guide.mdModel Comparison Summary:
| Model | Type | Dimensions | Cost per 1M tokens | Best For |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | Local | 384 | $0 (compute only) | High volume, tight budgets |
| BGE-base-en-v1.5 | Local | 768 | $0 (compute only) | Quality + cost balance |
| text-embedding-3-small | API | 1,536 | $0.02 | General purpose production |
| text-embedding-3-large | API | 3,072 | $0.13 | Premium quality requirements |
| embed-multilingual-v3.0 | API | 1,024 | $0.10 | 100+ language support |
根据需求选择最优的嵌入模型:
快速推荐:
- 初创项目/MVP: (本地部署,384维度,零API成本)
all-MiniLM-L6-v2 - 生产环境: (API调用,1536维度,平衡质量与成本)
text-embedding-3-small - 高质量需求: (API调用,3072维度,旗舰级质量)
text-embedding-3-large - 多语言场景: (本地部署,768维度)或Cohere
multilingual-e5-baseembed-multilingual-v3.0
如需包含成本对比、质量基准和数据隐私考量的详细决策框架,请参阅。
references/model-selection-guide.md模型对比摘要:
| 模型 | 类型 | 维度 | 每百万令牌成本 | 适用场景 |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 本地 | 384 | $0(仅计算资源成本) | 高吞吐量、预算紧张场景 |
| BGE-base-en-v1.5 | 本地 | 768 | $0(仅计算资源成本) | 质量与成本平衡场景 |
| text-embedding-3-small | API | 1,536 | $0.02 | 通用生产环境 |
| text-embedding-3-large | API | 3,072 | $0.13 | 高质量需求场景 |
| embed-multilingual-v3.0 | API | 1,024 | $0.10 | 支持100+种语言的场景 |
Chunking Strategies
分块策略
Select chunking strategy based on content type and use case:
Content Type → Strategy Mapping:
- Documentation: Recursive (heading-aware), 800 chars, 100 overlap
- Code: Recursive (function-level), 1,000 chars, 100 overlap
- Q&A/FAQ: Fixed-size, 500 chars, 50 overlap (precise retrieval)
- Legal/Technical: Semantic (large), 1,500 chars, 200 overlap (context preservation)
- Blog Posts: Semantic (paragraph), 1,000 chars, 100 overlap
- Academic Papers: Recursive (section-aware), 1,200 chars, 150 overlap
For detailed chunking patterns, decision trees, and implementation guidance, see .
references/chunking-strategies.mdQuick Start with CLI:
bash
python scripts/chunk_document.py \
--input document.txt \
--content-type markdown \
--chunk-size 800 \
--overlap 100 \
--output chunks.jsonl根据内容类型和使用场景选择合适的分块策略:
内容类型→策略映射:
- 文档类: 递归分块(识别标题),800字符,100字符重叠
- 代码类: 递归分块(函数级),1000字符,100字符重叠
- 问答/FAQ: 固定大小分块,500字符,50字符重叠(精准检索)
- 法律/技术文档: 语义分块(大尺寸),1500字符,200字符重叠(保留上下文)
- 博客文章: 语义分块(段落级),1000字符,100字符重叠
- 学术论文: 递归分块(识别章节),1200字符,150字符重叠
如需详细的分块模式、决策树和实现指南,请参阅。
references/chunking-strategies.mdCLI快速开始:
bash
python scripts/chunk_document.py \
--input document.txt \
--content-type markdown \
--chunk-size 800 \
--overlap 100 \
--output chunks.jsonlCaching Implementation
缓存实现
Achieve 80-90% cost reduction through content-addressable caching.
Caching Architecture by Query Volume:
- <10K queries/month: In-memory cache (Python )
lru_cache - 10K-100K queries/month: Redis (fast, TTL-based expiration)
- 100K-1M queries/month: Redis (hot) + PostgreSQL (warm)
- >1M queries/month: Multi-tier (Redis + PostgreSQL + S3)
Production Caching with Redis:
bash
undefined通过内容可寻址缓存实现80-90%的成本降低。
按查询量划分的缓存架构:
- <1万次查询/月: 内存缓存(Python )
lru_cache - 1万-10万次查询/月: Redis(快速,基于TTL的过期策略)
- 10万-100万次查询/月: Redis(热点数据) + PostgreSQL(温数据)
- >100万次查询/月: 多层架构(Redis + PostgreSQL + S3)
基于Redis的生产级缓存:
bash
undefinedEmbed documents with caching enabled
启用缓存的文档嵌入
python scripts/cached_embedder.py
--model text-embedding-3-small
--input documents.jsonl
--output embeddings.npy
--cache-backend redis
--cache-ttl 2592000 # 30 days
--model text-embedding-3-small
--input documents.jsonl
--output embeddings.npy
--cache-backend redis
--cache-ttl 2592000 # 30 days
**Caching ROI Example:**
- 50,000 document chunks
- 20% duplicate content
- Without caching: $0.50 API cost
- With caching (60% hit rate): $0.20 API cost
- **Savings: 60% ($0.30)**python scripts/cached_embedder.py
--model text-embedding-3-small
--input documents.jsonl
--output embeddings.npy
--cache-backend redis
--cache-ttl 2592000 # 30天
--model text-embedding-3-small
--input documents.jsonl
--output embeddings.npy
--cache-backend redis
--cache-ttl 2592000 # 30天
**缓存投资回报示例:**
- 50,000个文档分块
- 20%重复内容
- 无缓存:$0.50 API成本
- 启用缓存(60%命中率):$0.20 API成本
- **节省:60%($0.30)**Dimensionality Trade-offs
维度权衡
Balance storage, search speed, and quality:
| Dimensions | Storage (1M vectors) | Search Speed (p95) | Quality | Use Case |
|---|---|---|---|---|
| 384 | 1.5 GB | 10ms | Good | Large-scale search |
| 768 | 3 GB | 15ms | High | General purpose RAG |
| 1,536 | 6 GB | 25ms | Very High | High-quality retrieval |
| 3,072 | 12 GB | 40ms | Highest | Premium applications |
Key Insight: For most RAG applications, 768 dimensions (BGE-base-en-v1.5 local or equivalent) provides the best quality/cost/speed balance.
平衡存储、搜索速度和质量:
| 维度 | 100万向量的存储量 | 搜索速度(p95) | 质量 | 适用场景 |
|---|---|---|---|---|
| 384 | 1.5 GB | 10ms | 良好 | 大规模搜索 |
| 768 | 3 GB | 15ms | 高 | 通用RAG场景 |
| 1,536 | 6 GB | 25ms | 极高 | 高质量检索 |
| 3,072 | 12 GB | 40ms | 顶级 | 高端应用 |
核心结论: 对于大多数RAG应用,768维度(如本地部署的BGE-base-en-v1.5或同类模型)能实现最佳的质量/成本/速度平衡。
Batch Processing Optimization
批量处理优化
Maximize throughput for large-scale ingestion:
OpenAI API:
- Batch up to 2,048 inputs per request
- Implement rate limiting (tier-dependent: 500-5,000 RPM)
- Use parallel requests with backoff on rate limits
Local Models (sentence-transformers):
- GPU acceleration (CUDA, MPS for Apple Silicon)
- Batch size tuning (32-128 based on GPU memory)
- Multi-GPU support for maximum throughput
Expected Throughput:
- OpenAI API: 1,000-5,000 texts/minute (rate limit dependent)
- Local GPU (RTX 3090): 5,000-10,000 texts/minute
- Local CPU: 100-500 texts/minute
针对大规模数据摄入最大化吞吐量:
OpenAI API:
- 每个请求最多批量处理2048个输入
- 实现速率限制(根据层级:500-5000 RPM)
- 采用带退避机制的并行请求处理速率限制
本地模型(sentence-transformers):
- GPU加速(CUDA,Apple Silicon使用MPS)
- 批量大小调优(根据GPU内存设置32-128)
- 多GPU支持以实现最大吞吐量
预期吞吐量:
- OpenAI API:1000-5000文本/分钟(取决于速率限制)
- 本地GPU(RTX 3090):5000-10000文本/分钟
- 本地CPU:100-500文本/分钟
Performance Monitoring
性能监控
Track key metrics for optimization:
Critical Metrics:
- Latency: Embedding generation time (p50, p95, p99)
- Throughput: Embeddings per second/minute
- Cost: API usage tracking (USD per 1K/1M tokens)
- Cache Efficiency: Hit rate percentage
For detailed monitoring setup, metric collection patterns, and dashboarding, see .
references/performance-monitoring.mdMonitor with Wrapper:
python
from scripts.performance_monitor import MonitoredEmbedder
monitored = MonitoredEmbedder(
embedder=your_embedder,
cost_per_1k_tokens=0.00002 # OpenAI pricing
)
embeddings = monitored.embed_batch(texts)
metrics = monitored.get_metrics()
print(f"Cache hit rate: {metrics['cache_hit_rate_pct']}%")
print(f"Total cost: ${metrics['total_cost_usd']}")跟踪关键指标以持续优化:
核心指标:
- 延迟: 嵌入生成时间(p50、p95、p99)
- 吞吐量: 每秒/每分钟生成的嵌入数量
- 成本: API使用量跟踪(每千/百万令牌的美元成本)
- 缓存效率: 命中率百分比
如需详细的监控设置、指标收集模式和仪表盘搭建指南,请参阅。
references/performance-monitoring.md通过包装器监控:
python
from scripts.performance_monitor import MonitoredEmbedder
monitored = MonitoredEmbedder(
embedder=your_embedder,
cost_per_1k_tokens=0.00002 # OpenAI定价
)
embeddings = monitored.embed_batch(texts)
metrics = monitored.get_metrics()
print(f"缓存命中率: {metrics['cache_hit_rate_pct']}%")
print(f"总成本: ${metrics['total_cost_usd']}")Working Examples
实战示例
See directory for complete implementations:
examples/Python Examples:
- - OpenAI embeddings with Redis caching
examples/openai_cached.py - - sentence-transformers local embedding
examples/local_embedder.py - - Content-aware recursive chunking
examples/smart_chunker.py - - Pipeline performance tracking
examples/performance_monitor.py - - Large-scale document processing
examples/batch_processor.py
All examples include:
- Complete, runnable code
- Dependency installation instructions
- Error handling and retry logic
- Configuration options
请查看目录获取完整实现:
examples/Python示例:
- - 带Redis缓存的OpenAI嵌入
examples/openai_cached.py - - sentence-transformers本地嵌入
examples/local_embedder.py - - 内容感知型递归分块
examples/smart_chunker.py - - 流水线性能跟踪
examples/performance_monitor.py - - 大规模文档处理
examples/batch_processor.py
所有示例包含:
- 完整可运行代码
- 依赖安装说明
- 错误处理和重试逻辑
- 配置选项
Integration Points
集成点
Upstream (This skill provides to):
- Vector Databases: Embeddings flow to Pinecone, Weaviate, Qdrant, pgvector
- RAG Systems: Optimized embeddings for retrieval pipelines
- Semantic Search: Query and document embeddings for similarity search
Downstream (This skill uses from):
- Document Processing: Chunk documents before embedding
- Data Ingestion: Process documents from various sources
Related Skills:
- For RAG architecture, see skill
building-ai-chat - For vector database operations, see skill
databases-vector - For data ingestion pipelines, see skill
ingesting-data
上游(该技能为以下组件提供支持):
- 向量数据库: 嵌入流向Pinecone、Weaviate、Qdrant、pgvector
- RAG系统: 为检索流水线提供优化后的嵌入
- 语义搜索: 为相似度搜索提供查询和文档嵌入
下游(该技能依赖以下组件):
- 文档处理: 嵌入前对文档进行分块
- 数据摄入: 处理来自各种来源的文档
相关技能:
- 如需RAG架构相关内容,请参阅技能
building-ai-chat - 如需向量数据库操作相关内容,请参阅技能
databases-vector - 如需数据摄入流水线相关内容,请参阅技能
ingesting-data
Common Patterns
常见模式
Pattern 1: RAG Pipeline
Document → Chunk → Embed → Store (vector DB) → RetrievePattern 2: Semantic Search
Query → Embed → Search (vector DB) → Rank → DisplayPattern 3: Multi-Stage Retrieval (Cost Optimization)
Query → Cheap Embedding (384d) → Initial Search →
Expensive Embedding (1,536d) → Rerank Top-K → ReturnCost Savings: 70% reduction vs. single-stage with expensive embeddings
模式1:RAG流水线
文档 → 分块 → 嵌入 → 存储(向量数据库) → 检索模式2:语义搜索
查询 → 嵌入 → 搜索(向量数据库) → 排序 → 展示模式3:多阶段检索(成本优化)
查询 → 低成本嵌入(384维) → 初始搜索 →
高成本嵌入(1536维) → 对Top-K结果重排序 → 返回成本节省: 与单阶段高成本嵌入相比,减少70%的成本
Quick Reference Checklist
快速参考检查清单
Model Selection:
- Identified data privacy requirements (local vs. API)
- Calculated expected query volume
- Determined quality requirements (good/high/highest)
- Checked multilingual support needs
Chunking:
- Analyzed content type (code, docs, legal, etc.)
- Selected appropriate chunk size (500-1,500 chars)
- Set overlap to prevent context loss (50-200 chars)
- Validated chunks preserve semantic boundaries
Caching:
- Implemented content-addressable hashing
- Selected cache backend (Redis, PostgreSQL)
- Set TTL based on content volatility
- Monitoring cache hit rate (target: >60%)
Performance:
- Tracking latency (embedding generation time)
- Measuring throughput (embeddings/sec)
- Monitoring costs (USD spent on API calls)
- Optimizing batch sizes for maximum efficiency
模型选择:
- 明确数据隐私要求(本地 vs API)
- 计算预期查询量
- 确定质量要求(良好/高/顶级)
- 确认多语言支持需求
分块:
- 分析内容类型(代码、文档、法律文件等)
- 选择合适的分块大小(500-1500字符)
- 设置重叠字符数以避免上下文丢失(50-200字符)
- 验证分块保留了语义边界
缓存:
- 实现内容可寻址哈希
- 选择缓存后端(Redis、PostgreSQL)
- 根据内容更新频率设置TTL
- 监控缓存命中率(目标:>60%)
性能:
- 跟踪延迟(嵌入生成时间)
- 测量吞吐量(每秒嵌入数量)
- 监控成本(API调用花费的美元金额)
- 优化批量大小以实现最大效率