sentence-transformers

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Sentence Transformers - State-of-the-Art Embeddings

Sentence Transformers - 最先进的嵌入技术

Python framework for sentence and text embeddings using transformers.
基于Transformers的Python句子与文本嵌入框架。

When to use Sentence Transformers

何时使用Sentence Transformers

Use when:
  • Need high-quality embeddings for RAG
  • Semantic similarity and search
  • Text clustering and classification
  • Multilingual embeddings (100+ languages)
  • Running embeddings locally (no API)
  • Cost-effective alternative to OpenAI embeddings
Metrics:
  • 15,700+ GitHub stars
  • 5000+ pre-trained models
  • 100+ languages supported
  • Based on PyTorch/Transformers
Use alternatives instead:
  • OpenAI Embeddings: Need API-based, highest quality
  • Instructor: Task-specific instructions
  • Cohere Embed: Managed service
适用场景:
  • 需要为RAG生成高质量嵌入
  • 语义相似度计算与搜索
  • 文本聚类与分类
  • 多语言嵌入(支持100+种语言)
  • 本地运行嵌入(无需API)
  • OpenAI嵌入的高性价比替代方案
核心指标:
  • 15700+ GitHub星标
  • 5000+预训练模型
  • 支持100+种语言
  • 基于PyTorch/Transformers构建
替代方案选择:
  • OpenAI Embeddings:需要基于API的最高质量嵌入
  • Instructor:任务特定指令场景
  • Cohere Embed:托管服务场景

Quick start

快速开始

Installation

安装

bash
pip install sentence-transformers
bash
pip install sentence-transformers

Basic usage

基础用法

python
from sentence_transformers import SentenceTransformer
python
from sentence_transformers import SentenceTransformer

Load model

加载模型

model = SentenceTransformer('all-MiniLM-L6-v2')
model = SentenceTransformer('all-MiniLM-L6-v2')

Generate embeddings

生成嵌入

sentences = [ "This is an example sentence", "Each sentence is converted to a vector" ]
embeddings = model.encode(sentences) print(embeddings.shape) # (2, 384)
sentences = [ "This is an example sentence", "Each sentence is converted to a vector" ]
embeddings = model.encode(sentences) print(embeddings.shape) # (2, 384)

Cosine similarity

余弦相似度计算

from sentence_transformers.util import cos_sim similarity = cos_sim(embeddings[0], embeddings[1]) print(f"Similarity: {similarity.item():.4f}")
undefined
from sentence_transformers.util import cos_sim similarity = cos_sim(embeddings[0], embeddings[1]) print(f"Similarity: {similarity.item():.4f}")
undefined

Popular models

热门模型

General purpose

通用场景

python
undefined
python
undefined

Fast, good quality (384 dim)

速度快、质量优(384维度)

model = SentenceTransformer('all-MiniLM-L6-v2')
model = SentenceTransformer('all-MiniLM-L6-v2')

Better quality (768 dim)

质量更优(768维度)

model = SentenceTransformer('all-mpnet-base-v2')
model = SentenceTransformer('all-mpnet-base-v2')

Best quality (1024 dim, slower)

质量最佳(1024维度,速度较慢)

model = SentenceTransformer('all-roberta-large-v1')
undefined
model = SentenceTransformer('all-roberta-large-v1')
undefined

Multilingual

多语言场景

python
undefined
python
undefined

50+ languages

支持50+种语言

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

100+ languages

支持100+种语言

model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
undefined
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
undefined

Domain-specific

特定领域场景

python
undefined
python
undefined

Legal domain

法律领域

model = SentenceTransformer('nlpaueb/legal-bert-base-uncased')
model = SentenceTransformer('nlpaueb/legal-bert-base-uncased')

Scientific papers

学术论文领域

model = SentenceTransformer('allenai/specter')
model = SentenceTransformer('allenai/specter')

Code

代码领域

model = SentenceTransformer('microsoft/codebert-base')
undefined
model = SentenceTransformer('microsoft/codebert-base')
undefined

Semantic search

语义搜索

python
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')
python
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

Corpus

语料库

corpus = [ "Python is a programming language", "Machine learning uses algorithms", "Neural networks are powerful" ]
corpus = [ "Python is a programming language", "Machine learning uses algorithms", "Neural networks are powerful" ]

Encode corpus

编码语料库

corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

Query

查询语句

query = "What is Python?" query_embedding = model.encode(query, convert_to_tensor=True)
query = "What is Python?" query_embedding = model.encode(query, convert_to_tensor=True)

Find most similar

查找最相似结果

hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3) print(hits)
undefined
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3) print(hits)
undefined

Similarity computation

相似度计算

python
undefined
python
undefined

Cosine similarity

余弦相似度

similarity = util.cos_sim(embedding1, embedding2)
similarity = util.cos_sim(embedding1, embedding2)

Dot product

点积

similarity = util.dot_score(embedding1, embedding2)
similarity = util.dot_score(embedding1, embedding2)

Pairwise cosine similarity

成对余弦相似度

similarities = util.cos_sim(embeddings, embeddings)
undefined
similarities = util.cos_sim(embeddings, embeddings)
undefined

Batch encoding

批量编码

python
undefined
python
undefined

Efficient batch processing

高效批量处理

sentences = ["sentence 1", "sentence 2", ...] * 1000
embeddings = model.encode( sentences, batch_size=32, show_progress_bar=True, convert_to_tensor=False # or True for PyTorch tensors )
undefined
sentences = ["sentence 1", "sentence 2", ...] * 1000
embeddings = model.encode( sentences, batch_size=32, show_progress_bar=True, convert_to_tensor=False # 或设置为True以输出PyTorch张量 )
undefined

Fine-tuning

模型微调

python
from sentence_transformers import InputExample, losses
from torch.utils.data import DataLoader
python
from sentence_transformers import InputExample, losses
from torch.utils.data import DataLoader

Training data

训练数据

train_examples = [ InputExample(texts=['sentence 1', 'sentence 2'], label=0.8), InputExample(texts=['sentence 3', 'sentence 4'], label=0.3), ]
train_dataloader = DataLoader(train_examples, batch_size=16)
train_examples = [ InputExample(texts=['sentence 1', 'sentence 2'], label=0.8), InputExample(texts=['sentence 3', 'sentence 4'], label=0.3), ]
train_dataloader = DataLoader(train_examples, batch_size=16)

Loss function

损失函数

train_loss = losses.CosineSimilarityLoss(model)
train_loss = losses.CosineSimilarityLoss(model)

Train

开始训练

model.fit( train_objectives=[(train_dataloader, train_loss)], epochs=10, warmup_steps=100 )
model.fit( train_objectives=[(train_dataloader, train_loss)], epochs=10, warmup_steps=100 )

Save

保存模型

model.save('my-finetuned-model')
undefined
model.save('my-finetuned-model')
undefined

LangChain integration

LangChain集成

python
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)
python
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

Use with vector stores

与向量存储结合使用

from langchain_chroma import Chroma
vectorstore = Chroma.from_documents( documents=docs, embedding=embeddings )
undefined
from langchain_chroma import Chroma
vectorstore = Chroma.from_documents( documents=docs, embedding=embeddings )
undefined

LlamaIndex integration

LlamaIndex集成

python
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

from llama_index.core import Settings
Settings.embed_model = embed_model
python
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

from llama_index.core import Settings
Settings.embed_model = embed_model

Use in index

在索引中使用

index = VectorStoreIndex.from_documents(documents)
undefined
index = VectorStoreIndex.from_documents(documents)
undefined

Model selection guide

模型选择指南

ModelDimensionsSpeedQualityUse Case
all-MiniLM-L6-v2384FastGoodGeneral, prototyping
all-mpnet-base-v2768MediumBetterProduction RAG
all-roberta-large-v11024SlowBestHigh accuracy needed
paraphrase-multilingual768MediumGoodMultilingual
模型维度速度质量适用场景
all-MiniLM-L6-v2384良好通用场景、原型开发
all-mpnet-base-v2768中等更优生产环境RAG
all-roberta-large-v11024最佳需要高精度场景
paraphrase-multilingual768中等良好多语言场景

Best practices

最佳实践

  1. Start with all-MiniLM-L6-v2 - Good baseline
  2. Normalize embeddings - Better for cosine similarity
  3. Use GPU if available - 10× faster encoding
  4. Batch encoding - More efficient
  5. Cache embeddings - Expensive to recompute
  6. Fine-tune for domain - Improves quality
  7. Test different models - Quality varies by task
  8. Monitor memory - Large models need more RAM
  1. 从all-MiniLM-L6-v2开始 - 优质基准模型
  2. 归一化嵌入 - 提升余弦相似度效果
  3. 使用GPU(如果可用) - 编码速度提升10倍
  4. 批量编码 - 更高效率
  5. 缓存嵌入结果 - 避免重复计算开销
  6. 针对领域微调 - 提升任务质量
  7. 测试不同模型 - 不同模型在任务上的质量表现有差异
  8. 监控内存使用 - 大型模型需要更多内存

Performance

性能表现

ModelSpeed (sentences/sec)MemoryDimension
MiniLM~2000120MB384
MPNet~600420MB768
RoBERTa~3001.3GB1024
模型速度(句子/秒)内存占用维度
MiniLM~2000120MB384
MPNet~600420MB768
RoBERTa~3001.3GB1024

Resources

相关资源