sentence-transformers
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSentence Transformers - State-of-the-Art Embeddings
Sentence Transformers - 最先进的嵌入技术
Python framework for sentence and text embeddings using transformers.
基于Transformers的Python句子与文本嵌入框架。
When to use Sentence Transformers
何时使用Sentence Transformers
Use when:
- Need high-quality embeddings for RAG
- Semantic similarity and search
- Text clustering and classification
- Multilingual embeddings (100+ languages)
- Running embeddings locally (no API)
- Cost-effective alternative to OpenAI embeddings
Metrics:
- 15,700+ GitHub stars
- 5000+ pre-trained models
- 100+ languages supported
- Based on PyTorch/Transformers
Use alternatives instead:
- OpenAI Embeddings: Need API-based, highest quality
- Instructor: Task-specific instructions
- Cohere Embed: Managed service
适用场景:
- 需要为RAG生成高质量嵌入
- 语义相似度计算与搜索
- 文本聚类与分类
- 多语言嵌入(支持100+种语言)
- 本地运行嵌入(无需API)
- OpenAI嵌入的高性价比替代方案
核心指标:
- 15700+ GitHub星标
- 5000+预训练模型
- 支持100+种语言
- 基于PyTorch/Transformers构建
替代方案选择:
- OpenAI Embeddings:需要基于API的最高质量嵌入
- Instructor:任务特定指令场景
- Cohere Embed:托管服务场景
Quick start
快速开始
Installation
安装
bash
pip install sentence-transformersbash
pip install sentence-transformersBasic usage
基础用法
python
from sentence_transformers import SentenceTransformerpython
from sentence_transformers import SentenceTransformerLoad model
加载模型
model = SentenceTransformer('all-MiniLM-L6-v2')
model = SentenceTransformer('all-MiniLM-L6-v2')
Generate embeddings
生成嵌入
sentences = [
"This is an example sentence",
"Each sentence is converted to a vector"
]
embeddings = model.encode(sentences)
print(embeddings.shape) # (2, 384)
sentences = [
"This is an example sentence",
"Each sentence is converted to a vector"
]
embeddings = model.encode(sentences)
print(embeddings.shape) # (2, 384)
Cosine similarity
余弦相似度计算
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")
undefinedfrom sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")
undefinedPopular models
热门模型
General purpose
通用场景
python
undefinedpython
undefinedFast, good quality (384 dim)
速度快、质量优(384维度)
model = SentenceTransformer('all-MiniLM-L6-v2')
model = SentenceTransformer('all-MiniLM-L6-v2')
Better quality (768 dim)
质量更优(768维度)
model = SentenceTransformer('all-mpnet-base-v2')
model = SentenceTransformer('all-mpnet-base-v2')
Best quality (1024 dim, slower)
质量最佳(1024维度,速度较慢)
model = SentenceTransformer('all-roberta-large-v1')
undefinedmodel = SentenceTransformer('all-roberta-large-v1')
undefinedMultilingual
多语言场景
python
undefinedpython
undefined50+ languages
支持50+种语言
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
100+ languages
支持100+种语言
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
undefinedmodel = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
undefinedDomain-specific
特定领域场景
python
undefinedpython
undefinedLegal domain
法律领域
model = SentenceTransformer('nlpaueb/legal-bert-base-uncased')
model = SentenceTransformer('nlpaueb/legal-bert-base-uncased')
Scientific papers
学术论文领域
model = SentenceTransformer('allenai/specter')
model = SentenceTransformer('allenai/specter')
Code
代码领域
model = SentenceTransformer('microsoft/codebert-base')
undefinedmodel = SentenceTransformer('microsoft/codebert-base')
undefinedSemantic search
语义搜索
python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')Corpus
语料库
corpus = [
"Python is a programming language",
"Machine learning uses algorithms",
"Neural networks are powerful"
]
corpus = [
"Python is a programming language",
"Machine learning uses algorithms",
"Neural networks are powerful"
]
Encode corpus
编码语料库
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
Query
查询语句
query = "What is Python?"
query_embedding = model.encode(query, convert_to_tensor=True)
query = "What is Python?"
query_embedding = model.encode(query, convert_to_tensor=True)
Find most similar
查找最相似结果
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3)
print(hits)
undefinedhits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3)
print(hits)
undefinedSimilarity computation
相似度计算
python
undefinedpython
undefinedCosine similarity
余弦相似度
similarity = util.cos_sim(embedding1, embedding2)
similarity = util.cos_sim(embedding1, embedding2)
Dot product
点积
similarity = util.dot_score(embedding1, embedding2)
similarity = util.dot_score(embedding1, embedding2)
Pairwise cosine similarity
成对余弦相似度
similarities = util.cos_sim(embeddings, embeddings)
undefinedsimilarities = util.cos_sim(embeddings, embeddings)
undefinedBatch encoding
批量编码
python
undefinedpython
undefinedEfficient batch processing
高效批量处理
sentences = ["sentence 1", "sentence 2", ...] * 1000
embeddings = model.encode(
sentences,
batch_size=32,
show_progress_bar=True,
convert_to_tensor=False # or True for PyTorch tensors
)
undefinedsentences = ["sentence 1", "sentence 2", ...] * 1000
embeddings = model.encode(
sentences,
batch_size=32,
show_progress_bar=True,
convert_to_tensor=False # 或设置为True以输出PyTorch张量
)
undefinedFine-tuning
模型微调
python
from sentence_transformers import InputExample, losses
from torch.utils.data import DataLoaderpython
from sentence_transformers import InputExample, losses
from torch.utils.data import DataLoaderTraining data
训练数据
train_examples = [
InputExample(texts=['sentence 1', 'sentence 2'], label=0.8),
InputExample(texts=['sentence 3', 'sentence 4'], label=0.3),
]
train_dataloader = DataLoader(train_examples, batch_size=16)
train_examples = [
InputExample(texts=['sentence 1', 'sentence 2'], label=0.8),
InputExample(texts=['sentence 3', 'sentence 4'], label=0.3),
]
train_dataloader = DataLoader(train_examples, batch_size=16)
Loss function
损失函数
train_loss = losses.CosineSimilarityLoss(model)
train_loss = losses.CosineSimilarityLoss(model)
Train
开始训练
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=10,
warmup_steps=100
)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=10,
warmup_steps=100
)
Save
保存模型
model.save('my-finetuned-model')
undefinedmodel.save('my-finetuned-model')
undefinedLangChain integration
LangChain集成
python
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-mpnet-base-v2"
)python
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-mpnet-base-v2"
)Use with vector stores
与向量存储结合使用
from langchain_chroma import Chroma
vectorstore = Chroma.from_documents(
documents=docs,
embedding=embeddings
)
undefinedfrom langchain_chroma import Chroma
vectorstore = Chroma.from_documents(
documents=docs,
embedding=embeddings
)
undefinedLlamaIndex integration
LlamaIndex集成
python
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(
model_name="sentence-transformers/all-mpnet-base-v2"
)
from llama_index.core import Settings
Settings.embed_model = embed_modelpython
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(
model_name="sentence-transformers/all-mpnet-base-v2"
)
from llama_index.core import Settings
Settings.embed_model = embed_modelUse in index
在索引中使用
index = VectorStoreIndex.from_documents(documents)
undefinedindex = VectorStoreIndex.from_documents(documents)
undefinedModel selection guide
模型选择指南
| Model | Dimensions | Speed | Quality | Use Case |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Fast | Good | General, prototyping |
| all-mpnet-base-v2 | 768 | Medium | Better | Production RAG |
| all-roberta-large-v1 | 1024 | Slow | Best | High accuracy needed |
| paraphrase-multilingual | 768 | Medium | Good | Multilingual |
| 模型 | 维度 | 速度 | 质量 | 适用场景 |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | 快 | 良好 | 通用场景、原型开发 |
| all-mpnet-base-v2 | 768 | 中等 | 更优 | 生产环境RAG |
| all-roberta-large-v1 | 1024 | 慢 | 最佳 | 需要高精度场景 |
| paraphrase-multilingual | 768 | 中等 | 良好 | 多语言场景 |
Best practices
最佳实践
- Start with all-MiniLM-L6-v2 - Good baseline
- Normalize embeddings - Better for cosine similarity
- Use GPU if available - 10× faster encoding
- Batch encoding - More efficient
- Cache embeddings - Expensive to recompute
- Fine-tune for domain - Improves quality
- Test different models - Quality varies by task
- Monitor memory - Large models need more RAM
- 从all-MiniLM-L6-v2开始 - 优质基准模型
- 归一化嵌入 - 提升余弦相似度效果
- 使用GPU(如果可用) - 编码速度提升10倍
- 批量编码 - 更高效率
- 缓存嵌入结果 - 避免重复计算开销
- 针对领域微调 - 提升任务质量
- 测试不同模型 - 不同模型在任务上的质量表现有差异
- 监控内存使用 - 大型模型需要更多内存
Performance
性能表现
| Model | Speed (sentences/sec) | Memory | Dimension |
|---|---|---|---|
| MiniLM | ~2000 | 120MB | 384 |
| MPNet | ~600 | 420MB | 768 |
| RoBERTa | ~300 | 1.3GB | 1024 |
| 模型 | 速度(句子/秒) | 内存占用 | 维度 |
|---|---|---|---|
| MiniLM | ~2000 | 120MB | 384 |
| MPNet | ~600 | 420MB | 768 |
| RoBERTa | ~300 | 1.3GB | 1024 |
Resources
相关资源
- GitHub: https://github.com/UKPLab/sentence-transformers ⭐ 15,700+
- Models: https://huggingface.co/sentence-transformers
- Docs: https://www.sbert.net
- License: Apache 2.0
- GitHub:https://github.com/UKPLab/sentence-transformers ⭐ 15,700+
- 模型库:https://huggingface.co/sentence-transformers
- 文档:https://www.sbert.net
- 许可证:Apache 2.0