algo-nlp-ner
Original:🇺🇸 English
Translated
Implement Named Entity Recognition to identify and classify entities in text. Use this skill when the user needs to extract people, organizations, locations, dates, or custom entities from documents — even if they say 'extract names from text', 'find companies mentioned', or 'entity extraction'.
3installs
Added on
NPX Install
npx skill4agent add asgard-ai-platform/skills algo-nlp-nerTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Named Entity Recognition
Overview
NER identifies and classifies named entities in text into predefined categories (Person, Organization, Location, Date, Money, etc.). Approaches: rule-based (regex, gazetteers), statistical (CRF), neural (BiLSTM-CRF, transformer-based). Modern NER uses spaCy or Hugging Face models with F1 scores 85-95%.
When to Use
Trigger conditions:
- Extracting structured entities from unstructured text
- Building knowledge graphs from documents
- Preprocessing for information retrieval or question answering
When NOT to use:
- For text classification (categorizing whole documents, not extracting entities)
- For relation extraction between entities (need additional RE model)
Algorithm
IRON LAW: NER Performance Depends on DOMAIN Match
A model trained on news text (OntoNotes) performs poorly on medical
records or legal documents. Domain-specific entities (drug names,
legal citations, product SKUs) require domain-specific training data
or fine-tuning. Always evaluate on YOUR domain's data.Phase 1: Input Validation
Determine: target entity types (standard: PER, ORG, LOC, DATE, MONEY or custom), input language, domain. Select appropriate pre-trained model or prepare training data.
Gate: Entity types defined, model or training data available.
Phase 2: Core Algorithm
Pre-trained model approach:
- Load model (spaCy, Hugging Face NER pipeline)
- Process text through the pipeline
- Extract entity spans with type labels and confidence scores
Fine-tuning approach:
- Annotate 200+ domain-specific examples in BIO format
- Fine-tune transformer model (BERT, RoBERTa) on annotated data
- Evaluate on held-out test set
Phase 3: Verification
Evaluate: precision, recall, F1 per entity type. Check: boundary detection (exact span match) and type classification accuracy.
Gate: F1 > 0.80 per entity type on domain-relevant test data.
Phase 4: Output
Return extracted entities with types, positions, and confidence.
Output Format
json
{
"entities": [{"text": "Apple Inc.", "type": "ORG", "start": 0, "end": 10, "confidence": 0.95}],
"metadata": {"model": "en_core_web_trf", "entities_found": 15, "types": {"PER": 5, "ORG": 6, "LOC": 4}}
}Examples
Sample I/O
Input: "Tim Cook announced that Apple will open a new store in Taipei on March 15."
Expected: [Tim Cook/PER, Apple/ORG, Taipei/LOC, March 15/DATE]
Edge Cases
| Input | Expected | Why |
|---|---|---|
| "Apple" (no context) | Ambiguous (fruit or company) | Context-dependent entity typing |
| Nested entities | Depends on scheme | "Bank of America" = ORG, "America" = LOC within |
| Misspelled entity | May miss | "Appel" not in training data |
Gotchas
- Boundary errors: NER often gets the entity type right but the span wrong ("New" vs "New York City"). Evaluate with both exact and partial match metrics.
- Ambiguity: "Jordan" can be a person, country, or brand. Context-dependent disambiguation is hard; some models output the most likely type.
- Chinese/Japanese NER: No whitespace tokenization makes boundary detection harder. Use language-specific tokenizers (jieba for Chinese).
- Annotation consistency: Training data quality is critical. Inconsistent annotations (sometimes labeling "Dr." as part of name, sometimes not) degrade model performance.
- Entity linking: NER identifies mentions; entity linking resolves them to knowledge base entries. "Apple" → Apple Inc. (Q312) or apple (fruit). These are separate tasks.
References
- For BIO annotation format and guidelines, see
references/bio-annotation.md - For fine-tuning NER with transformers, see
references/transformer-ner.md