When this skill is activated, always start your first response with the 🧢 emoji.

当激活本技能时，你的第一条回复请以🧢表情开头。

NLP Engineering

NLP工程

A practical framework for building production NLP systems. This skill covers the full stack of natural language processing - from raw text ingestion through tokenization, embedding, retrieval, classification, and generation - with an emphasis on making the right architectural choices at each layer. Designed for engineers who know Python and ML basics and need opinionated guidance on building reliable, scalable text processing pipelines.

一个用于构建生产级NLP系统的实用框架。本技能涵盖了自然语言处理的全流程——从原始文本摄入，到分词、嵌入、检索、分类和生成——重点在于在每个环节做出正确的架构选择。专为掌握Python和机器学习基础知识，需要构建可靠、可扩展文本处理流水线的工程师设计。

When to use this skill

何时使用本技能

Trigger this skill when the user:

Builds a text preprocessing or cleaning pipeline
Generates or stores embeddings for documents or queries
Implements semantic search or similarity-based retrieval
Classifies text into categories (sentiment, intent, topic, etc.)
Extracts named entities, relationships, or structured data from text
Summarizes long documents (extractive or abstractive)
Chunks documents for RAG (Retrieval-Augmented Generation) pipelines
Tunes tokenization strategies (BPE, wordpiece, whitespace)

Do NOT trigger this skill for:

Pure LLM prompt engineering or chain-of-thought with no text processing pipeline
Speech-to-text or image captioning (separate modalities with different toolchains)

当用户有以下需求时触发本技能：

构建文本预处理或清洗流水线
为文档或查询生成或存储嵌入向量
实现语义搜索或基于相似度的检索
将文本分类为不同类别（情感、意图、主题等）
从文本中提取命名实体、关系或结构化数据
对长文档进行摘要（抽取式或生成式）
为RAG（Retrieval-Augmented Generation，检索增强生成）流水线拆分文档块
调整分词策略（BPE、WordPiece、按空白字符分词）

以下场景请勿触发本技能：

仅进行LLM提示词工程或思维链推理，无文本处理流水线
语音转文字或图像字幕生成（属于不同模态，使用不同工具链）

Key principles

核心原则

Preprocessing is load-bearing - Garbage in, garbage out. Inconsistent casing, stray HTML, and unicode noise degrade every downstream component. Invest in a reproducible cleaning pipeline before touching a model.
Match the model to the task - A 66M-parameter sentence-transformer is often better than GPT-4 embeddings for a narrow domain retrieval task, and 100x cheaper. Pick the smallest model that hits your quality bar.
Embed offline, search online - Pre-compute embeddings at index time. Doing embedding + vector search in the request path is an avoidable latency sink. Only re-embed at write time (new docs) or on model upgrade.
Chunk with overlap, not just length - Fixed-length chunking without overlap splits sentences at boundaries and degrades retrieval recall. Always use a sliding window with 10-20% overlap and respect sentence boundaries.
Evaluate before you ship - Define offline metrics (precision@k, NDCG, ROUGE, F1) before building. An NLP system without evals is a system you cannot improve or regress-test.

预处理是基础 - 垃圾进，垃圾出。大小写不一致、残留HTML标签和Unicode噪声会降低所有下游组件的性能。在接触模型之前，先投入精力构建可复现的清洗流水线。
为任务匹配合适的模型 - 对于特定领域的检索任务，66M参数的sentence-transformer往往比GPT-4的嵌入效果更好，且成本低100倍。选择能满足质量要求的最小模型。
离线生成嵌入，在线搜索 - 在索引阶段预先计算嵌入向量。在请求路径中同时进行嵌入生成和向量搜索会造成不必要的延迟。仅在写入新文档或升级模型时重新生成嵌入。
带重叠拆分文档块，而非仅按长度 - 无重叠的固定长度拆分可能会在句子边界处分割文本，降低检索召回率。始终使用带有10-20%重叠的滑动窗口，并尊重句子边界。
上线前先评估 - 在构建系统前定义离线指标（precision@k、NDCG、ROUGE、F1）。没有评估指标的NLP系统无法进行优化或回归测试。

Core concepts

核心概念

Tokenization

分词

Tokenization converts raw text into a sequence of tokens a model can process. Modern models use subword tokenizers (BPE, WordPiece, SentencePiece) rather than whitespace splitting, allowing them to handle out-of-vocabulary words gracefully by decomposing them into known subword units.

Key considerations: token budget (LLMs have context windows), language coverage (multilingual text needs a multilingual tokenizer), and domain vocabulary (medical/legal/code text may have poor tokenization with general-purpose tokenizers).

分词将原始文本转换为模型可处理的token序列。现代模型使用子词分词器（BPE、WordPiece、SentencePiece）而非按空白字符分词，通过将未登录词分解为已知子词单元，优雅地处理未登录词。

关键考虑因素：token预算（LLM有上下文窗口限制）、语言覆盖范围（多语言文本需要多语言分词器）、领域词汇（医疗/法律/代码文本使用通用分词器可能效果不佳）。

Embeddings

嵌入向量

An embedding is a dense vector representation of text that encodes semantic meaning. Similar texts produce vectors with high cosine similarity. Embeddings are the foundation of semantic search, clustering, and classification.

Two categories: encoding models (sentence-transformers, E5, BGE) are fast, cheap, and purpose-built for retrieval. LLM embeddings (OpenAI

text-embedding-3

, Cohere Embed) are convenient API calls but cost money per token and introduce external latency.

嵌入向量是文本的密集向量表示，编码了语义信息。相似的文本会生成余弦相似度高的向量。嵌入向量是语义搜索、聚类和分类的基础。

分为两类：编码模型（sentence-transformers、E5、BGE）速度快、成本低，专为检索任务设计。LLM嵌入（OpenAI

text-embedding-3

、Cohere Embed）是便捷的API调用，但按token收费且会引入外部延迟。

Attention and transformers

注意力机制与Transformer

Transformers process the full token sequence in parallel using self-attention, letting every token attend to every other token. This gives transformer-based models long-range context understanding that recurrent models lacked. For NLP tasks, you almost never need to implement attention from scratch - use HuggingFace

transformers

and fine-tune a pretrained checkpoint.

Transformer使用自注意力机制并行处理整个token序列，让每个token都能关注到其他所有token。这使得基于Transformer的模型具备循环模型所没有的长上下文理解能力。对于NLP任务，你几乎不需要从零实现注意力机制——使用HuggingFace

transformers

库并微调预训练模型即可。

Vector similarity

向量相似度

Three distance metrics dominate:

Metric	Formula (conceptual)	Best for
Cosine similarity	angle between vectors	Normalized embeddings, most retrieval
Dot product	magnitude + angle	When vector magnitude carries information
Euclidean distance	straight-line distance	Rare; prefer cosine for NLP

Most vector stores (Pinecone, Weaviate, pgvector, FAISS) default to cosine or dot product. Normalize your embeddings before storing them to make cosine and dot product equivalent.

三种主要的距离指标：

指标	概念公式	适用场景
余弦相似度	向量之间的夹角	归一化嵌入向量，大多数检索场景
点积	向量幅度+夹角	当向量幅度携带信息时
欧氏距离	直线距离	罕见；NLP场景优先选择余弦相似度

大多数向量存储（Pinecone、Weaviate、pgvector、FAISS）默认使用余弦相似度或点积。存储前先归一化嵌入向量，可使余弦相似度和点积结果等价。

Common tasks

常见任务

Text preprocessing pipeline

文本预处理流水线

Build a reproducible cleaning pipeline before any modeling step. Apply in this order: decode -> strip HTML -> normalize unicode -> lowercase -> remove noise -> normalize whitespace.

python

import re
import unicodedata
from bs4 import BeautifulSoup

def preprocess(text: str, lowercase: bool = True) -> str:
    # 1. Decode HTML entities and strip tags
    text = BeautifulSoup(text, "html.parser").get_text(separator=" ")

    # 2. Normalize unicode (NFD -> NFC, remove combining chars if needed)
    text = unicodedata.normalize("NFC", text)

    # 3. Lowercase
    if lowercase:
        text = text.lower()

    # 4. Remove URLs, emails, special tokens
    text = re.sub(r"https?://\S+|www\.\S+", " ", text)
    text = re.sub(r"\S+@\S+\.\S+", " ", text)

    # 5. Collapse whitespace
    text = re.sub(r"\s+", " ", text).strip()

    return text

在进行任何建模步骤之前，先构建一个可复现的清洗流水线。按以下顺序执行：解码 -> 去除HTML标签 -> 归一化Unicode -> 转为小写 -> 移除噪声 -> 归一化空白字符。

python

import re
import unicodedata
from bs4 import BeautifulSoup

def preprocess(text: str, lowercase: bool = True) -> str:
    # 1. Decode HTML entities and strip tags
    text = BeautifulSoup(text, "html.parser").get_text(separator=" ")

    # 2. Normalize unicode (NFD -> NFC, remove combining chars if needed)
    text = unicodedata.normalize("NFC", text)

    # 3. Lowercase
    if lowercase:
        text = text.lower()

    # 4. Remove URLs, emails, special tokens
    text = re.sub(r"https?://\S+|www\.\S+", " ", text)
    text = re.sub(r"\S+@\S+\.\S+", " ", text)

    # 5. Collapse whitespace
    text = re.sub(r"\s+", " ", text).strip()

    return text

Usage

clean = preprocess("<p>Visit https://example.com for more info.</p>")

-> "visit for more info."


> Persist the preprocessing config (lowercase flag, regex patterns) alongside
> your model so training and inference use identical transformations.


> 请将预处理配置（小写转换开关、正则表达式模式）与模型一起保存，确保训练和推理使用完全相同的转换逻辑。

Generate embeddings

生成嵌入向量

Use

sentence-transformers

for local, cost-free embeddings or the OpenAI API for convenience. Always batch your calls.

python

undefined

使用

sentence-transformers

进行本地免费嵌入，或使用OpenAI API以获取便捷性。请始终批量调用接口。

python

undefined

Option A: sentence-transformers (local, free, fast on GPU)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5")

documents = ["The quick brown fox", "Machine learning is fun", "NLP rocks"]

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5")

documents = ["The quick brown fox", "Machine learning is fun", "NLP rocks"]

encode() handles batching internally; show_progress_bar for large corpora

embeddings = model.encode(documents, normalize_embeddings=True, show_progress_bar=True)

-> numpy array, shape (3, 384)

Option B: OpenAI embeddings API

from openai import OpenAI

client = OpenAI()

def embed_batch(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]: # Strip newlines - they degrade embedding quality per OpenAI docs texts = [t.replace("\n", " ") for t in texts] response = client.embeddings.create(input=texts, model=model) return [item.embedding for item in response.data]

undefined

from openai import OpenAI

client = OpenAI()

def embed_batch(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]: # Strip newlines - they degrade embedding quality per OpenAI docs texts = [t.replace("\n", " ") for t in texts] response = client.embeddings.create(input=texts, model=model) return [item.embedding for item in response.data]

undefined

Build semantic search

构建语义搜索

Index embeddings into a vector store and retrieve by cosine similarity at query time. This example uses FAISS for local search and pgvector for PostgreSQL.

python

import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5")

将嵌入向量索引到向量存储中，查询时通过余弦相似度进行检索。本示例使用FAISS进行本地搜索，使用pgvector进行PostgreSQL检索。

python

import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5")

--- Indexing ---

docs = ["Python is a programming language.", "The Eiffel Tower is in Paris.", ...] doc_embeddings = model.encode(docs, normalize_embeddings=True).astype("float32")

Inner product on normalized vectors = cosine similarity

index = faiss.IndexFlatIP(doc_embeddings.shape[1]) index.add(doc_embeddings)

--- Retrieval ---

def search(query: str, top_k: int = 5) -> list[tuple[str, float]]: q_emb = model.encode([query], normalize_embeddings=True).astype("float32") scores, indices = index.search(q_emb, top_k) return [(docs[i], float(scores[0][j])) for j, i in enumerate(indices[0])]

results = search("programming languages for data science")

def search(query: str, top_k: int = 5) -> list[tuple[str, float]]: q_emb = model.encode([query], normalize_embeddings=True).astype("float32") scores, indices = index.search(q_emb, top_k) return [(docs[i], float(scores[0][j])) for j, i in enumerate(indices[0])]

results = search("programming languages for data science")

-> [("Python is a programming language.", 0.87), ...]


> For production, use `faiss.IndexIVFFlat` (approximate, faster) or a managed
> vector store (pgvector, Pinecone, Weaviate) rather than exact `IndexFlatIP`.


> 生产环境中，请使用`faiss.IndexIVFFlat`（近似检索，速度更快）或托管向量存储（pgvector、Pinecone、Weaviate），而非精确检索的`IndexFlatIP`。

Text classification with transformers

基于Transformer的文本分类

Fine-tune a pretrained encoder for sequence classification. HuggingFace

transformers

+

datasets

is the standard stack.

python

from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
import torch

MODEL_ID = "distilbert-base-uncased"
LABELS = ["negative", "neutral", "positive"]
id2label = {i: l for i, l in enumerate(LABELS)}
label2id = {l: i for i, l in enumerate(LABELS)}

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_ID, num_labels=len(LABELS), id2label=id2label, label2id=label2id
)

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

微调预训练编码器进行序列分类。HuggingFace

transformers

+

datasets

是标准工具栈。

python

from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
import torch

MODEL_ID = "distilbert-base-uncased"
LABELS = ["negative", "neutral", "positive"]
id2label = {i: l for i, l in enumerate(LABELS)}
label2id = {l: i for i, l in enumerate(LABELS)}

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_ID, num_labels=len(LABELS), id2label=id2label, label2id=label2id
)

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

train_data: list of {"text": str, "label": int}

train_ds = Dataset.from_list(train_data).map(tokenize, batched=True)

args = TrainingArguments( output_dir="./sentiment-model", num_train_epochs=3, per_device_train_batch_size=32, evaluation_strategy="epoch", save_strategy="best", load_best_model_at_end=True, )

trainer = Trainer(model=model, args=args, train_dataset=train_ds, eval_dataset=eval_ds) trainer.train()


> Use `distilbert` or `roberta-base` for most classification tasks. Only
> escalate to larger models if the smaller ones underperform after fine-tuning.

train_ds = Dataset.from_list(train_data).map(tokenize, batched=True)

args = TrainingArguments( output_dir="./sentiment-model", num_train_epochs=3, per_device_train_batch_size=32, evaluation_strategy="epoch", save_strategy="best", load_best_model_at_end=True, )

trainer = Trainer(model=model, args=args, train_dataset=train_ds, eval_dataset=eval_ds) trainer.train()


> 大多数分类任务使用`distilbert`或`roberta-base`即可。只有在小模型微调后效果仍不达标时，再考虑使用更大的模型。

NER pipeline

NER流水线

Use spaCy for fast rule-augmented NER or a HuggingFace token classification model for custom entity types.

python

import spacy
from transformers import pipeline

使用spaCy进行快速的规则增强型NER，或使用HuggingFace的token分类模型处理自定义实体类型。

python

import spacy
from transformers import pipeline

Option A: spaCy (fast, battle-tested for standard entities)

nlp = spacy.load("en_core_web_sm")

def extract_entities(text: str) -> list[dict]: doc = nlp(text) return [ {"text": ent.text, "label": ent.label_, "start": ent.start_char, "end": ent.end_char} for ent in doc.ents ]

entities = extract_entities("Apple Inc. was founded by Steve Jobs in Cupertino.")

nlp = spacy.load("en_core_web_sm")

def extract_entities(text: str) -> list[dict]: doc = nlp(text) return [ {"text": ent.text, "label": ent.label_, "start": ent.start_char, "end": ent.end_char} for ent in doc.ents ]

entities = extract_entities("Apple Inc. was founded by Steve Jobs in Cupertino.")

-> [{"text": "Apple Inc.", "label": "ORG", ...}, {"text": "Steve Jobs", "label": "PERSON", ...}]

Option B: HuggingFace token classification (custom entities, higher accuracy)

ner = pipeline( "token-classification", model="dslim/bert-base-NER", aggregation_strategy="simple", # merges B-/I- tokens into spans ) results = ner("OpenAI released GPT-4 in San Francisco.")

undefined

ner = pipeline( "token-classification", model="dslim/bert-base-NER", aggregation_strategy="simple", # merges B-/I- tokens into spans ) results = ner("OpenAI released GPT-4 in San Francisco.")

undefined

Extractive and abstractive summarization

抽取式与生成式摘要

Choose extractive for faithfulness (no hallucination risk) and abstractive for fluency.

python

undefined

追求忠实度（无幻觉风险）选择抽取式摘要，追求流畅度选择生成式摘要。

python

undefined

--- Extractive: rank sentences by TF-IDF centrality ---

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity import numpy as np

def extractive_summary(text: str, n_sentences: int = 3) -> str: sentences = [s.strip() for s in text.split(".") if s.strip()] tfidf = TfidfVectorizer().fit_transform(sentences) sim_matrix = cosine_similarity(tfidf) scores = sim_matrix.sum(axis=1) top_indices = np.argsort(scores)[-n_sentences:][::-1] return ". ".join(sentences[i] for i in sorted(top_indices)) + "."

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity import numpy as np

def extractive_summary(text: str, n_sentences: int = 3) -> str: sentences = [s.strip() for s in text.split(".") if s.strip()] tfidf = TfidfVectorizer().fit_transform(sentences) sim_matrix = cosine_similarity(tfidf) scores = sim_matrix.sum(axis=1) top_indices = np.argsort(scores)[-n_sentences:][::-1] return ". ".join(sentences[i] for i in sorted(top_indices)) + "."

--- Abstractive: seq2seq model ---

from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

def abstractive_summary(text: str, max_length: int = 130) -> str: # BART has a 1024-token context window - chunk long documents first result = summarizer(text, max_length=max_length, min_length=30, do_sample=False) return result[0]["summary_text"]

undefined

from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

def abstractive_summary(text: str, max_length: int = 130) -> str: # BART has a 1024-token context window - chunk long documents first result = summarizer(text, max_length=max_length, min_length=30, do_sample=False) return result[0]["summary_text"]

undefined

Chunking strategies for long documents

长文档拆分策略

Chunking is critical for RAG quality. Poor chunking is the single most common cause of poor retrieval recall.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_document(
    text: str,
    chunk_size: int = 512,
    chunk_overlap: int = 64,
) -> list[dict]:
    """
    Recursive splitter tries paragraph -> sentence -> word boundaries in order.
    chunk_overlap ensures context continuity across chunk boundaries.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    chunks = splitter.split_text(text)
    return [{"text": chunk, "chunk_index": i, "total_chunks": len(chunks)} for i, chunk in enumerate(chunks)]

文档拆分对RAG的质量至关重要。拆分不当是检索召回率低的最常见原因。

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_document(
    text: str,
    chunk_size: int = 512,
    chunk_overlap: int = 64,
) -> list[dict]:
    """
    Recursive splitter tries paragraph -> sentence -> word boundaries in order.
    chunk_overlap ensures context continuity across chunk boundaries.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    chunks = splitter.split_text(text)
    return [{"text": chunk, "chunk_index": i, "total_chunks": len(chunks)} for i, chunk in enumerate(chunks)]

Semantic chunking (group sentences by embedding similarity instead of length)

from langchain_experimental.text_splitter import SemanticChunker from langchain_openai.embeddings import OpenAIEmbeddings

semantic_splitter = SemanticChunker( OpenAIEmbeddings(), breakpoint_threshold_type="percentile", # split where similarity drops sharply breakpoint_threshold_amount=95, ) semantic_chunks = semantic_splitter.create_documents([text])


> Rule of thumb: chunk_size 256-512 tokens for precise retrieval, 512-1024 for
> richer context. Always store chunk metadata (source doc ID, page, position)
> alongside the embedding.

---

from langchain_experimental.text_splitter import SemanticChunker from langchain_openai.embeddings import OpenAIEmbeddings

semantic_splitter = SemanticChunker( OpenAIEmbeddings(), breakpoint_threshold_type="percentile", # split where similarity drops sharply breakpoint_threshold_amount=95, ) semantic_chunks = semantic_splitter.create_documents([text])


> 经验法则：精确检索使用256-512 token的块大小，需要更丰富上下文时使用512-1024 token。请始终将块元数据（源文档ID、页码、位置）与嵌入向量一起存储。

---

Anti-patterns / common mistakes

反模式/常见错误

Mistake	Why it's wrong	What to do instead
Embedding raw HTML or markdown	Markup tokens poison the semantic space	Strip all markup in preprocessing before embedding
Fixed-size chunks with no overlap	Splits sentences at boundaries, breaks coherence	Use recursive splitter with 10-20% overlap
Re-embedding at query time if corpus is static	Unnecessary latency on every request	Pre-compute all embeddings offline; embed only on writes
Using Euclidean distance for text similarity	Less meaningful than cosine for high-dimensional sparse-ish vectors	Normalize embeddings and use cosine/dot product
Fine-tuning a large model before trying a small pretrained one	Expensive, slow, often unnecessary	Benchmark a frozen small model first; fine-tune only if quality gap exists
Ignoring tokenizer mismatch between training and inference	Token boundaries differ, degrading model accuracy	Use the same tokenizer class and vocab for train and serve

错误做法	问题所在	正确做法
直接嵌入原始HTML或Markdown	标记token会污染语义空间	嵌入前先在预处理阶段移除所有标记
无重叠的固定大小拆分	在句子边界分割文本，破坏连贯性	使用带10-20%重叠的递归拆分器
静态语料库仍在查询时生成嵌入	每个请求都产生不必要的延迟	离线预计算所有嵌入；仅在写入新文档时生成嵌入
文本相似度使用欧氏距离	对于高维稀疏向量，欧氏距离的意义不如余弦相似度	归一化嵌入向量并使用余弦相似度/点积
未尝试小预训练模型就直接微调大模型	成本高、速度慢，通常无必要	先测试冻结的小模型；仅当存在明显质量差距时再微调
忽略训练与推理时的分词器不匹配	Token边界不同，降低模型精度	训练和部署时使用相同的分词器类和词汇表

References

参考资料

For detailed comparison tables and implementation guidance on specific topics, read the relevant file from the

references/

folder:

```
references/embedding-models.md
```
- comparison of OpenAI, Cohere, sentence-transformers, E5, BGE with dimensions, benchmarks, and cost

Only load a references file if the current task requires it - they are long and will consume context.

如需特定主题的详细对比表和实现指南，请阅读

references/

文件夹中的相关文件：

```
references/embedding-models.md
```
- 对比OpenAI、Cohere、sentence-transformers、E5、BGE的维度、基准测试结果和成本

仅在当前任务需要时加载参考文件——这些文件篇幅较长，会占用上下文空间。

nlp-engineering

Original

Translation

NLP Engineering

NLP工程

When to use this skill

何时使用本技能

Key principles

核心原则

Core concepts

核心概念

Tokenization

分词

Embeddings

嵌入向量

Attention and transformers

注意力机制与Transformer

Vector similarity

向量相似度

Common tasks

常见任务

Text preprocessing pipeline

文本预处理流水线

Usage

Usage

-> "visit for more info."

-> "visit for more info."

Generate embeddings

生成嵌入向量

Option A: sentence-transformers (local, free, fast on GPU)

Option A: sentence-transformers (local, free, fast on GPU)

encode() handles batching internally; show_progress_bar for large corpora

encode() handles batching internally; show_progress_bar for large corpora

-> numpy array, shape (3, 384)

-> numpy array, shape (3, 384)

Option B: OpenAI embeddings API

Option B: OpenAI embeddings API

Build semantic search

构建语义搜索

--- Indexing ---

--- Indexing ---

Inner product on normalized vectors = cosine similarity

Inner product on normalized vectors = cosine similarity

--- Retrieval ---

--- Retrieval ---

-> [("Python is a programming language.", 0.87), ...]

-> [("Python is a programming language.", 0.87), ...]

Text classification with transformers

基于Transformer的文本分类

train_data: list of {"text": str, "label": int}

train_data: list of {"text": str, "label": int}

NER pipeline

NER流水线

Option A: spaCy (fast, battle-tested for standard entities)

Option A: spaCy (fast, battle-tested for standard entities)

-> [{"text": "Apple Inc.", "label": "ORG", ...}, {"text": "Steve Jobs", "label": "PERSON", ...}]

-> [{"text": "Apple Inc.", "label": "ORG", ...}, {"text": "Steve Jobs", "label": "PERSON", ...}]

Option B: HuggingFace token classification (custom entities, higher accuracy)

Option B: HuggingFace token classification (custom entities, higher accuracy)

Extractive and abstractive summarization

抽取式与生成式摘要

--- Extractive: rank sentences by TF-IDF centrality ---

--- Extractive: rank sentences by TF-IDF centrality ---

--- Abstractive: seq2seq model ---

--- Abstractive: seq2seq model ---

Chunking strategies for long documents

长文档拆分策略

Semantic chunking (group sentences by embedding similarity instead of length)

Semantic chunking (group sentences by embedding similarity instead of length)

Anti-patterns / common mistakes

反模式/常见错误

References

参考资料

Related skills

相关技能