langextract

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LangExtract - Structured Information Extraction

LangExtract - 结构化信息提取

Expert assistance for extracting structured, source-grounded information from unstructured text using large language models.
借助大语言模型从非结构化文本中提取带来源溯源的结构化信息的专业工具。

When to Use This Skill

何时使用该Skill

Use this skill when you need to:
  • Extract structured entities from unstructured text (medical notes, reports, documents)
  • Maintain precise source grounding (map extracted data to original text locations)
  • Process long documents beyond LLM token limits
  • Visualize extraction results with interactive HTML highlighting
  • Extract clinical information from medical records
  • Structure radiology or pathology reports
  • Extract medications, diagnoses, or symptoms from clinical notes
  • Analyze literary texts for characters, emotions, relationships
  • Build domain-specific extraction pipelines
  • Work with Gemini, OpenAI, or local models (Ollama)
  • Generate schema-compliant outputs without fine-tuning
当你需要以下功能时,可使用本Skill:
  • 从非结构化文本(医疗记录、报告、文档)中提取结构化实体
  • 保持精准的来源溯源(将提取的数据映射到原文位置)
  • 处理超出LLM令牌限制的长文档
  • 通过交互式HTML高亮可视化提取结果
  • 从医疗记录中提取临床信息
  • 结构化放射科或病理科报告
  • 从临床记录中提取药物、诊断结果或症状
  • 分析文学文本中的角色、情感、关系
  • 构建特定领域的提取流水线
  • 与Gemini、OpenAI或本地模型(Ollama)配合使用
  • 无需微调即可生成符合 schema 的输出

Overview

概述

LangExtract is a Python library by Google for extracting structured information from unstructured text using large language models. It emphasizes:
  • Source Grounding: Every extraction maps to its exact location in source text
  • Structured Outputs: Schema-compliant results with controlled generation
  • Long Document Processing: Intelligent chunking and multi-pass extraction
  • Interactive Visualization: Self-contained HTML for reviewing extractions in context
  • Flexible LLM Support: Works with Gemini, OpenAI, and local models
  • Few-Shot Learning: Requires only quality examples, no expensive fine-tuning
Key Resources:
LangExtract 是谷歌推出的一款Python库,可借助大语言模型从非结构化文本中提取结构化信息。它的核心特性包括:
  • 来源溯源:每一项提取结果都映射到原文中的精确位置
  • 结构化输出:遵循schema的可控生成结果
  • 长文档处理:智能分块与多轮提取
  • 交互式可视化:独立的HTML文件,用于在上下文环境中查看提取结果
  • 灵活的LLM支持:兼容Gemini、OpenAI及本地模型
  • 少样本学习:仅需高质量示例,无需昂贵的微调
核心资源:

Installation

安装

Prerequisites

前置条件

  • Python 3.8 or higher
  • API key for Gemini (AI Studio), OpenAI, or local Ollama setup
  • Python 3.8或更高版本
  • Gemini(AI Studio)、OpenAI的API密钥,或本地Ollama环境

Basic Installation

基础安装

bash
undefined
bash
undefined

Install from PyPI (recommended)

从PyPI安装(推荐)

pip install langextract
pip install langextract

Install with OpenAI support

安装带OpenAI支持的版本

pip install langextract[openai]
pip install langextract[openai]

Install with development tools

安装带开发工具的版本

pip install langextract[dev]
undefined
pip install langextract[dev]
undefined

Install from Source

从源码安装

bash
git clone https://github.com/google/langextract.git
cd langextract
pip install -e .
bash
git clone https://github.com/google/langextract.git
cd langextract
pip install -e .

For development with testing

安装带测试依赖的开发版本

pip install -e ".[test]"
undefined
pip install -e ".[test]"
undefined

Docker Installation

Docker安装

bash
undefined
bash
undefined

Build Docker image

构建Docker镜像

docker build -t langextract .
docker build -t langextract .

Run with API key

使用API密钥运行

docker run --rm
-e LANGEXTRACT_API_KEY="your-api-key"
langextract python your_script.py
undefined
docker run --rm
-e LANGEXTRACT_API_KEY="your-api-key"
langextract python your_script.py
undefined

API Key Setup

API密钥配置

Gemini (Google AI Studio):
bash
export LANGEXTRACT_API_KEY="your-gemini-api-key"
Get keys from: https://ai.google.dev/
OpenAI:
bash
export OPENAI_API_KEY="your-openai-api-key"
Vertex AI (Enterprise):
bash
undefined
Gemini(Google AI Studio):
bash
export LANGEXTRACT_API_KEY="your-gemini-api-key"
获取密钥地址:https://ai.google.dev/
OpenAI:
bash
export OPENAI_API_KEY="your-openai-api-key"
Vertex AI(企业版):
bash
undefined

Use service account authentication

使用服务账号认证

Set project in language_model_params

在language_model_params中设置项目


**.env File (Development):**
```bash

**.env文件(开发环境):**
```bash

Create .env file

创建.env文件

echo "LANGEXTRACT_API_KEY=your-key-here" > .env
undefined
echo "LANGEXTRACT_API_KEY=your-key-here" > .env
undefined

Quick Start

快速开始

Basic Extraction Example

基础提取示例

python
import langextract as lx
import textwrap
python
import langextract as lx
import textwrap

1. Define extraction task

1. 定义提取任务

prompt = textwrap.dedent("""
Extract all medications mentioned in the clinical note. Include medication name, dosage, and frequency. Use exact text from the document.""")
prompt = textwrap.dedent("""
Extract all medications mentioned in the clinical note. Include medication name, dosage, and frequency. Use exact text from the document.""")

2. Provide examples (few-shot learning)

2. 提供示例(少样本学习)

examples = [ lx.data.ExampleData( text="Patient prescribed Lisinopril 10mg daily for hypertension.", extractions=[ lx.data.Extraction( extraction_class="medication", extraction_text="Lisinopril 10mg daily", attributes={ "name": "Lisinopril", "dosage": "10mg", "frequency": "daily", "indication": "hypertension" } ) ] ) ]
examples = [ lx.data.ExampleData( text="Patient prescribed Lisinopril 10mg daily for hypertension.", extractions=[ lx.data.Extraction( extraction_class="medication", extraction_text="Lisinopril 10mg daily", attributes={ "name": "Lisinopril", "dosage": "10mg", "frequency": "daily", "indication": "hypertension" } ) ] ) ]

3. Input text to extract from

3. 待提取的输入文本

input_text = """ Patient continues on Metformin 500mg twice daily for diabetes management. Started on Amlodipine 5mg once daily for blood pressure control. Discontinued Aspirin 81mg due to side effects. """
input_text = """ Patient continues on Metformin 500mg twice daily for diabetes management. Started on Amlodipine 5mg once daily for blood pressure control. Discontinued Aspirin 81mg due to side effects. """

4. Run extraction

4. 执行提取

result = lx.extract( text_or_documents=input_text, prompt_description=prompt, examples=examples, model_id="gemini-2.0-flash-exp" )
result = lx.extract( text_or_documents=input_text, prompt_description=prompt, examples=examples, model_id="gemini-2.0-flash-exp" )

5. Access results

5. 访问结果

for extraction in result.extractions: print(f"Medication: {extraction.extraction_text}") print(f" Name: {extraction.attributes.get('name')}") print(f" Dosage: {extraction.attributes.get('dosage')}") print(f" Frequency: {extraction.attributes.get('frequency')}") print(f" Location: {extraction.start_char}-{extraction.end_char}") print()
for extraction in result.extractions: print(f"Medication: {extraction.extraction_text}") print(f" Name: {extraction.attributes.get('name')}") print(f" Dosage: {extraction.attributes.get('dosage')}") print(f" Frequency: {extraction.attributes.get('frequency')}") print(f" Location: {extraction.start_char}-{extraction.end_char}") print()

6. Save and visualize

6. 保存与可视化

lx.io.save_annotated_documents( [result], output_name="medications.jsonl", output_dir="." )
html_content = lx.visualize("medications.jsonl") with open("medications.html", "w") as f: f.write(html_content)
undefined
lx.io.save_annotated_documents( [result], output_name="medications.jsonl", output_dir="." )
html_content = lx.visualize("medications.jsonl") with open("medications.html", "w") as f: f.write(html_content)
undefined

Literary Text Example

文学文本示例

python
import langextract as lx

prompt = """Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities."""

examples = [
    lx.data.ExampleData(
        text="ROMEO entered the garden, filled with wonder at JULIET's beauty.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="JULIET",
                attributes={}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="ROMEO ... JULIET's beauty",
                attributes={
                    "subject": "ROMEO",
                    "relation": "admires",
                    "object": "JULIET"
                }
            )
        ]
    )
]

text = """Act 2, Scene 2: The Capulet's orchard.
ROMEO appears beneath JULIET's balcony, gazing upward with longing.
JULIET steps onto the balcony, unaware of ROMEO's presence below."""

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.0-flash-exp"
)
python
import langextract as lx

prompt = """Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities."""

examples = [
    lx.data.ExampleData(
        text="ROMEO entered the garden, filled with wonder at JULIET's beauty.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="JULIET",
                attributes={}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="ROMEO ... JULIET's beauty",
                attributes={
                    "subject": "ROMEO",
                    "relation": "admires",
                    "object": "JULIET"
                }
            )
        ]
    )
]

text = """Act 2, Scene 2: The Capulet's orchard.
ROMEO appears beneath JULIET's balcony, gazing upward with longing.
JULIET steps onto the balcony, unaware of ROMEO's presence below."""

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.0-flash-exp"
)

Core Concepts

核心概念

1. Extraction Classes

1. 提取类别

Define categories of entities to extract:
python
undefined
定义要提取的实体类别:
python
undefined

Single class

单一类别

extraction_class="medication"
extraction_class="medication"

Multiple classes via examples

通过示例定义多类别

examples = [ lx.data.ExampleData( text="...", extractions=[ lx.data.Extraction(extraction_class="diagnosis", ...), lx.data.Extraction(extraction_class="symptom", ...), lx.data.Extraction(extraction_class="medication", ...) ] ) ]
undefined
examples = [ lx.data.ExampleData( text="...", extractions=[ lx.data.Extraction(extraction_class="diagnosis", ...), lx.data.Extraction(extraction_class="symptom", ...), lx.data.Extraction(extraction_class="medication", ...) ] ) ]
undefined

2. Source Grounding

2. 来源溯源

Every extraction includes precise text location:
python
extraction = result.extractions[0]
print(f"Text: {extraction.extraction_text}")
print(f"Start: {extraction.start_char}")
print(f"End: {extraction.end_char}")
每一项提取结果都包含精确的文本位置:
python
extraction = result.extractions[0]
print(f"Text: {extraction.extraction_text}")
print(f"Start: {extraction.start_char}")
print(f"End: {extraction.end_char}")

Extract from original document

从原文中提取对应片段

original_text = input_text[extraction.start_char:extraction.end_char]
undefined
original_text = input_text[extraction.start_char:extraction.end_char]
undefined

3. Attributes

3. 属性

Add structured metadata to extractions:
python
lx.data.Extraction(
    extraction_class="medication",
    extraction_text="Lisinopril 10mg daily",
    attributes={
        "name": "Lisinopril",
        "dosage": "10mg",
        "frequency": "daily",
        "route": "oral",
        "indication": "hypertension"
    }
)
为提取结果添加结构化元数据:
python
lx.data.Extraction(
    extraction_class="medication",
    extraction_text="Lisinopril 10mg daily",
    attributes={
        "name": "Lisinopril",
        "dosage": "10mg",
        "frequency": "daily",
        "route": "oral",
        "indication": "hypertension"
    }
)

4. Few-Shot Learning

4. 少样本学习

Provide 1-5 quality examples instead of fine-tuning:
python
undefined
提供1-5个高质量示例,替代微调:
python
undefined

Minimal examples (1-2) for simple tasks

简单任务的最少示例(1-2个)

examples = [example1]
examples = [example1]

More examples (3-5) for complex schemas

复杂schema的更多示例(3-5个)

examples = [example1, example2, example3, example4, example5]
undefined
examples = [example1, example2, example3, example4, example5]
undefined

5. Long Document Processing

5. 长文档处理

Automatic chunking for documents beyond token limits:
python
result = lx.extract(
    text_or_documents=long_document,  # Any length
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.0-flash-exp",
    extraction_passes=3,  # Multiple passes for better recall
    max_workers=20,        # Parallel processing
    max_char_buffer=1000   # Chunk overlap for continuity
)
自动分块处理超出令牌限制的文档:
python
result = lx.extract(
    text_or_documents=long_document,  # 任意长度
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.0-flash-exp",
    extraction_passes=3,  # 多轮提取提升召回率
    max_workers=20,        # 并行处理
    max_char_buffer=1000   # 分块重叠保证连续性
)

Configuration

配置

Model Selection

模型选择

python
undefined
python
undefined

Gemini models (recommended)

Gemini模型(推荐)

model_id="gemini-2.0-flash-exp" # Fast, cost-effective model_id="gemini-2.0-flash-thinking-exp" # Complex reasoning model_id="gemini-1.5-pro" # Legacy
model_id="gemini-2.0-flash-exp" # 快速、经济 model_id="gemini-2.0-flash-thinking-exp" # 复杂推理 model_id="gemini-1.5-pro" # 旧版本

OpenAI models

OpenAI模型

model_id="gpt-4o" # GPT-4 Optimized model_id="gpt-4o-mini" # Smaller, faster
model_id="gpt-4o" # GPT-4优化版 model_id="gpt-4o-mini" # 轻量、快速

Local models via Ollama

通过Ollama使用本地模型

model_id="gemma2:2b" # Local inference model_url="http://localhost:11434"
undefined
model_id="gemma2:2b" # 本地推理 model_url="http://localhost:11434"
undefined

Scaling Parameters

扩展参数

python
result = lx.extract(
    text_or_documents=documents,
    prompt_description=prompt,
    examples=examples,

    # Multi-pass extraction for better recall
    extraction_passes=3,

    # Parallel processing
    max_workers=20,

    # Chunk size tuning
    max_char_buffer=1000,

    # Model configuration
    model_id="gemini-2.0-flash-exp"
)
python
result = lx.extract(
    text_or_documents=documents,
    prompt_description=prompt,
    examples=examples,

    # 多轮提取提升召回率
    extraction_passes=3,

    # 并行处理
    max_workers=20,

    # 分块大小调整
    max_char_buffer=1000,

    # 模型配置
    model_id="gemini-2.0-flash-exp"
)

Backend Configuration

后端配置

Vertex AI:
python
result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.0-flash-exp",
    language_model_params={
        "vertexai": True,
        "project": "your-gcp-project-id",
        "location": "us-central1"
    }
)
Batch Processing:
python
language_model_params={
    "batch": {
        "enabled": True
    }
}
OpenAI Configuration:
python
result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gpt-4o",
    fence_output=True,  # Required for OpenAI
    use_schema_constraints=False  # Disable Gemini-specific features
)
Local Ollama:
python
result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemma2:2b",
    model_url="http://localhost:11434",
    use_schema_constraints=False
)
Vertex AI:
python
result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.0-flash-exp",
    language_model_params={
        "vertexai": True,
        "project": "your-gcp-project-id",
        "location": "us-central1"
    }
)
批量处理:
python
language_model_params={
    "batch": {
        "enabled": True
    }
}
OpenAI配置:
python
result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gpt-4o",
    fence_output=True,  # OpenAI必需
    use_schema_constraints=False  # 禁用Gemini专属特性
)
本地Ollama:
python
result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemma2:2b",
    model_url="http://localhost:11434",
    use_schema_constraints=False
)

Environment Variables

环境变量

bash
undefined
bash
undefined

API Keys

API密钥

LANGEXTRACT_API_KEY="gemini-api-key" OPENAI_API_KEY="openai-api-key"
LANGEXTRACT_API_KEY="gemini-api-key" OPENAI_API_KEY="openai-api-key"

Vertex AI

Vertex AI

GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

Model configuration

模型配置

LANGEXTRACT_MODEL_ID="gemini-2.0-flash-exp" LANGEXTRACT_MODEL_URL="http://localhost:11434"
undefined
LANGEXTRACT_MODEL_ID="gemini-2.0-flash-exp" LANGEXTRACT_MODEL_URL="http://localhost:11434"
undefined

Common Patterns

常见模式

Pattern 1: Clinical Note Extraction

模式1:临床记录提取

python
import langextract as lx

prompt = """Extract diagnoses, symptoms, and medications from clinical notes.
Include ICD-10 codes when available. Use exact medical terminology."""

examples = [
    lx.data.ExampleData(
        text="Patient presents with Type 2 Diabetes Mellitus (E11.9). Started on Metformin 500mg BID. Reports fatigue and increased thirst.",
        extractions=[
            lx.data.Extraction(
                extraction_class="diagnosis",
                extraction_text="Type 2 Diabetes Mellitus (E11.9)",
                attributes={"condition": "Type 2 Diabetes Mellitus", "icd10": "E11.9"}
            ),
            lx.data.Extraction(
                extraction_class="medication",
                extraction_text="Metformin 500mg BID",
                attributes={"name": "Metformin", "dosage": "500mg", "frequency": "BID"}
            ),
            lx.data.Extraction(
                extraction_class="symptom",
                extraction_text="fatigue",
                attributes={"symptom": "fatigue"}
            ),
            lx.data.Extraction(
                extraction_class="symptom",
                extraction_text="increased thirst",
                attributes={"symptom": "polydipsia"}
            )
        ]
    )
]
python
import langextract as lx

prompt = """Extract diagnoses, symptoms, and medications from clinical notes.
Include ICD-10 codes when available. Use exact medical terminology."""

examples = [
    lx.data.ExampleData(
        text="Patient presents with Type 2 Diabetes Mellitus (E11.9). Started on Metformin 500mg BID. Reports fatigue and increased thirst.",
        extractions=[
            lx.data.Extraction(
                extraction_class="diagnosis",
                extraction_text="Type 2 Diabetes Mellitus (E11.9)",
                attributes={"condition": "Type 2 Diabetes Mellitus", "icd10": "E11.9"}
            ),
            lx.data.Extraction(
                extraction_class="medication",
                extraction_text="Metformin 500mg BID",
                attributes={"name": "Metformin", "dosage": "500mg", "frequency": "BID"}
            ),
            lx.data.Extraction(
                extraction_class="symptom",
                extraction_text="fatigue",
                attributes={"symptom": "fatigue"}
            ),
            lx.data.Extraction(
                extraction_class="symptom",
                extraction_text="increased thirst",
                attributes={"symptom": "polydipsia"}
            )
        ]
    )
]

Process multiple clinical notes

处理多条临床记录

clinical_notes = [ "Note 1: Patient presents with...", "Note 2: Follow-up visit for...", "Note 3: New onset chest pain..." ]
results = lx.extract( text_or_documents=clinical_notes, prompt_description=prompt, examples=examples, model_id="gemini-2.0-flash-exp", extraction_passes=2, max_workers=10 )
clinical_notes = [ "Note 1: Patient presents with...", "Note 2: Follow-up visit for...", "Note 3: New onset chest pain..." ]
results = lx.extract( text_or_documents=clinical_notes, prompt_description=prompt, examples=examples, model_id="gemini-2.0-flash-exp", extraction_passes=2, max_workers=10 )

Save structured output

保存结构化输出

lx.io.save_annotated_documents( results, output_name="clinical_extractions.jsonl", output_dir="./output" )
undefined
lx.io.save_annotated_documents( results, output_name="clinical_extractions.jsonl", output_dir="./output" )
undefined

Pattern 2: Radiology Report Structuring

模式2:放射科报告结构化

python
prompt = """Extract findings, impressions, and recommendations from radiology reports.
Include anatomical location, abnormality type, and severity."""

examples = [
    lx.data.ExampleData(
        text="FINDINGS: 3.2cm mass in right upper lobe. IMPRESSION: Suspicious for malignancy. RECOMMENDATION: Biopsy recommended.",
        extractions=[
            lx.data.Extraction(
                extraction_class="finding",
                extraction_text="3.2cm mass in right upper lobe",
                attributes={
                    "location": "right upper lobe",
                    "type": "mass",
                    "size": "3.2cm"
                }
            ),
            lx.data.Extraction(
                extraction_class="impression",
                extraction_text="Suspicious for malignancy",
                attributes={"diagnosis": "possible malignancy", "certainty": "suspicious"}
            ),
            lx.data.Extraction(
                extraction_class="recommendation",
                extraction_text="Biopsy recommended",
                attributes={"action": "biopsy"}
            )
        ]
    )
]
python
prompt = """Extract findings, impressions, and recommendations from radiology reports.
Include anatomical location, abnormality type, and severity."""

examples = [
    lx.data.ExampleData(
        text="FINDINGS: 3.2cm mass in right upper lobe. IMPRESSION: Suspicious for malignancy. RECOMMENDATION: Biopsy recommended.",
        extractions=[
            lx.data.Extraction(
                extraction_class="finding",
                extraction_text="3.2cm mass in right upper lobe",
                attributes={
                    "location": "right upper lobe",
                    "type": "mass",
                    "size": "3.2cm"
                }
            ),
            lx.data.Extraction(
                extraction_class="impression",
                extraction_text="Suspicious for malignancy",
                attributes={"diagnosis": "possible malignancy", "certainty": "suspicious"}
            ),
            lx.data.Extraction(
                extraction_class="recommendation",
                extraction_text="Biopsy recommended",
                attributes={"action": "biopsy"}
            )
        ]
    )
]

Pattern 3: Multi-Document Processing

模式3:多文档处理

python
import langextract as lx
from pathlib import Path
python
import langextract as lx
from pathlib import Path

Load multiple documents

加载多个文档

documents = [] for file_path in Path("./documents").glob("*.txt"): with open(file_path, "r") as f: documents.append(f.read())
documents = [] for file_path in Path("./documents").glob("*.txt"): with open(file_path, "r") as f: documents.append(f.read())

Extract from all documents

从所有文档中提取

results = lx.extract( text_or_documents=documents, prompt_description=prompt, examples=examples, model_id="gemini-2.0-flash-exp", extraction_passes=3, max_workers=20 )
results = lx.extract( text_or_documents=documents, prompt_description=prompt, examples=examples, model_id="gemini-2.0-flash-exp", extraction_passes=3, max_workers=20 )

Results is a list of AnnotatedDocument objects

Results是AnnotatedDocument对象的列表

for i, result in enumerate(results): print(f"\nDocument {i+1}: {len(result.extractions)} extractions") for extraction in result.extractions: print(f" - {extraction.extraction_class}: {extraction.extraction_text}")
undefined
for i, result in enumerate(results): print(f"\nDocument {i+1}: {len(result.extractions)} extractions") for extraction in result.extractions: print(f" - {extraction.extraction_class}: {extraction.extraction_text}")
undefined

Pattern 4: Interactive Visualization

模式4:交互式可视化

python
undefined
python
undefined

Generate interactive HTML

生成交互式HTML

html_content = lx.visualize("extractions.jsonl")
html_content = lx.visualize("extractions.jsonl")

Save to file

保存到文件

with open("interactive_results.html", "w") as f: f.write(html_content)
with open("interactive_results.html", "w") as f: f.write(html_content)

Open in browser (optional)

在浏览器中打开(可选)

import webbrowser webbrowser.open("interactive_results.html")
undefined
import webbrowser webbrowser.open("interactive_results.html")
undefined

Pattern 5: Custom Provider Plugin

模式5:自定义提供方插件

python
undefined
python
undefined

See examples/custom_provider_plugin/ for full implementation

完整实现请查看examples/custom_provider_plugin/

from langextract.providers import ProviderPlugin
class CustomProvider(ProviderPlugin): def extract(self, text, prompt, examples, **kwargs): # Custom extraction logic return extractions
def supports_schema_constraints(self):
    return False
from langextract.providers import ProviderPlugin
class CustomProvider(ProviderPlugin): def extract(self, text, prompt, examples, **kwargs): # 自定义提取逻辑 return extractions
def supports_schema_constraints(self):
    return False

Register custom provider

注册自定义提供方

lx.register_provider("custom", CustomProvider())
lx.register_provider("custom", CustomProvider())

Use custom provider

使用自定义提供方

result = lx.extract( text_or_documents=text, prompt_description=prompt, examples=examples, model_id="custom", provider="custom" )
undefined
result = lx.extract( text_or_documents=text, prompt_description=prompt, examples=examples, model_id="custom", provider="custom" )
undefined

API Reference

API参考

Core Functions

核心函数

lx.extract()

lx.extract()

Main extraction function.
python
result = lx.extract(
    text_or_documents,           # str or list of str
    prompt_description,          # str: extraction instructions
    examples,                    # list of ExampleData
    model_id="gemini-2.0-flash-exp",  # str: model identifier
    extraction_passes=1,         # int: number of passes
    max_workers=None,            # int: parallel workers
    max_char_buffer=1000,        # int: chunk overlap
    language_model_params=None,  # dict: model config
    fence_output=False,          # bool: required for OpenAI
    use_schema_constraints=True, # bool: use schema enforcement
    model_url=None,              # str: custom model endpoint
    api_key=None                 # str: API key (prefer env var)
)
Returns:
AnnotatedDocument
or
list[AnnotatedDocument]
主提取函数。
python
result = lx.extract(
    text_or_documents,           # str或str列表
    prompt_description,          # str: 提取指令
    examples,                    # ExampleData列表
    model_id="gemini-2.0-flash-exp",  # str: 模型标识符
    extraction_passes=1,         # int: 提取轮次
    max_workers=None,            # int: 并行工作进程数
    max_char_buffer=1000,        # int: 分块重叠字符数
    language_model_params=None,  # dict: 模型配置
    fence_output=False,          # bool: OpenAI必需
    use_schema_constraints=True, # bool: 使用schema约束
    model_url=None,              # str: 自定义模型端点
    api_key=None                 # str: API密钥(优先使用环境变量)
)
返回值
AnnotatedDocument
list[AnnotatedDocument]

lx.visualize()

lx.visualize()

Generate interactive HTML visualization.
python
html_content = lx.visualize(
    jsonl_file_path,           # str: path to JSONL file
    title="Extraction Results", # str: HTML page title
    show_attributes=True       # bool: display attributes
)
Returns:
str
(HTML content)
生成交互式HTML可视化内容。
python
html_content = lx.visualize(
    jsonl_file_path,           # str: JSONL文件路径
    title="Extraction Results", # str: HTML页面标题
    show_attributes=True       # bool: 显示属性
)
返回值
str
(HTML内容)

lx.io.save_annotated_documents()

lx.io.save_annotated_documents()

Save results to JSONL format.
python
lx.io.save_annotated_documents(
    annotated_documents,  # list of AnnotatedDocument
    output_name,          # str: filename (e.g., "results.jsonl")
    output_dir="."        # str: output directory
)
将结果保存为JSONL格式。
python
lx.io.save_annotated_documents(
    annotated_documents,  # AnnotatedDocument列表
    output_name,          # str: 文件名(如"results.jsonl")
    output_dir="."        # str: 输出目录
)

Data Classes

数据类

ExampleData

ExampleData

Few-shot example definition.
python
example = lx.data.ExampleData(
    text="Example text here",
    extractions=[
        lx.data.Extraction(...)
    ]
)
少样本示例定义。
python
example = lx.data.ExampleData(
    text="Example text here",
    extractions=[
        lx.data.Extraction(...)
    ]
)

Extraction

Extraction

Single extraction definition.
python
extraction = lx.data.Extraction(
    extraction_class="medication",  # str: entity type
    extraction_text="Aspirin 81mg",  # str: exact text
    attributes={                     # dict: metadata
        "name": "Aspirin",
        "dosage": "81mg"
    },
    start_char=0,                    # int: start position (auto-set)
    end_char=13                      # int: end position (auto-set)
)
单一提取结果定义。
python
extraction = lx.data.Extraction(
    extraction_class="medication",  # str: 实体类型
    extraction_text="Aspirin 81mg",  # str: 原文文本
    attributes={                     # dict: 元数据
        "name": "Aspirin",
        "dosage": "81mg"
    },
    start_char=0,                    # int: 起始位置(自动设置)
    end_char=13                      # int: 结束位置(自动设置)
)

AnnotatedDocument

AnnotatedDocument

Extraction results for a document.
python
result.text                  # str: original text
result.extractions          # list of Extraction
result.metadata             # dict: additional info
单篇文档的提取结果。
python
result.text                  # str: 原文
result.extractions          # Extraction列表
result.metadata             # dict: 附加信息

Best Practices

最佳实践

Extraction Design

提取设计

  1. Write Clear Prompts: Be specific about what to extract and how
    python
    # Good
    prompt = "Extract medications with dosage, frequency, and route of administration. Use exact medical terminology."
    
    # Avoid
    prompt = "Extract medications."
  2. Provide Quality Examples: 1-5 well-crafted examples beat many poor ones
    python
    # Include edge cases in examples
    examples = [
        normal_case_example,
        edge_case_example,
        complex_case_example
    ]
  3. Use Exact Text: Extract verbatim from source for accurate grounding
    python
    # Good
    extraction_text="Lisinopril 10mg daily"
    
    # Avoid paraphrasing
    extraction_text="10mg lisinopril taken once per day"
  4. Define Attributes Clearly: Structure metadata consistently
    python
    attributes={
        "name": "Lisinopril",        # Drug name
        "dosage": "10mg",             # Amount
        "frequency": "daily",         # How often
        "route": "oral"               # How taken
    }
  1. 编写清晰的提示词:明确说明要提取的内容和方式
    python
    # 良好示例
    prompt = "Extract medications with dosage, frequency, and route of administration. Use exact medical terminology."
    
    # 避免
    prompt = "Extract medications."
  2. 提供高质量示例:1-5个精心设计的示例优于大量劣质示例
    python
    # 在示例中包含边缘情况
    examples = [
        normal_case_example,
        edge_case_example,
        complex_case_example
    ]
  3. 使用原文文本:提取原文内容以保证准确的来源溯源
    python
    # 良好示例
    extraction_text="Lisinopril 10mg daily"
    
    # 避免意译
    extraction_text="10mg lisinopril taken once per day"
  4. 清晰定义属性:保持元数据结构一致
    python
    attributes={
        "name": "Lisinopril",        # 药物名称
        "dosage": "10mg",             # 剂量
        "frequency": "daily",         # 频次
        "route": "oral"               # 给药途径
    }

Performance Optimization

性能优化

  1. Multi-Pass for Long Documents: Improves recall
    python
    extraction_passes=3  # 2-3 passes recommended for thorough extraction
  2. Parallel Processing: Speed up batch operations
    python
    max_workers=20  # Adjust based on API rate limits
  3. Chunk Size Tuning: Balance accuracy and context
    python
    max_char_buffer=1000  # Larger for context, smaller for speed
  4. Model Selection: Choose based on task complexity
    python
    # Simple extraction
    model_id="gemini-2.0-flash-exp"
    
    # Complex reasoning
    model_id="gemini-2.0-flash-thinking-exp"
  1. 长文档使用多轮提取:提升召回率
    python
    extraction_passes=3  # 推荐2-3轮提取以保证全面性
  2. 并行处理:加速批量操作
    python
    max_workers=20  # 根据API速率限制调整
  3. 分块大小调整:平衡准确性与上下文
    python
    max_char_buffer=1000  # 更大的值保留更多上下文,更小的值提升速度
  4. 模型选择:根据任务复杂度选择
    python
    # 简单提取任务
    model_id="gemini-2.0-flash-exp"
    
    # 复杂推理任务
    model_id="gemini-2.0-flash-thinking-exp"

Production Deployment

生产部署

  1. API Key Security: Never hardcode keys
    python
    # Good: Use environment variables
    import os
    api_key = os.getenv("LANGEXTRACT_API_KEY")
    
    # Avoid: Hardcoding
    api_key = "AIza..."  # Never do this
  2. Error Handling: Handle API failures gracefully
    python
    try:
        result = lx.extract(...)
    except Exception as e:
        logger.error(f"Extraction failed: {e}")
        # Implement retry logic or fallback
  3. Cost Management: Monitor API usage
    python
    # Use cheaper models for bulk processing
    model_id="gemini-2.0-flash-exp"  # vs "gemini-1.5-pro"
    
    # Batch processing for cost efficiency
    language_model_params={"batch": {"enabled": True}}
  4. Validation: Verify extraction quality
    python
    for extraction in result.extractions:
        # Validate extraction is within document bounds
        assert 0 <= extraction.start_char < len(result.text)
        assert extraction.end_char <= len(result.text)
    
        # Verify text matches
        extracted = result.text[extraction.start_char:extraction.end_char]
        assert extracted == extraction.extraction_text
  1. API密钥安全:切勿硬编码密钥
    python
    # 良好实践:使用环境变量
    import os
    api_key = os.getenv("LANGEXTRACT_API_KEY")
    
    # 避免:硬编码
    api_key = "AIza..."  # 绝对不要这样做
  2. 错误处理:优雅处理API失败
    python
    try:
        result = lx.extract(...)
    except Exception as e:
        logger.error(f"Extraction failed: {e}")
        # 实现重试逻辑或备选方案
  3. 成本管理:监控API使用情况
    python
    # 批量处理使用更经济的模型
    model_id="gemini-2.0-flash-exp"  # 对比"gemini-1.5-pro"
    
    # 批量处理提升成本效率
    language_model_params={"batch": {"enabled": True}}
  4. 验证:验证提取质量
    python
    for extraction in result.extractions:
        # 验证提取位置在文档范围内
        assert 0 <= extraction.start_char < len(result.text)
        assert extraction.end_char <= len(result.text)
    
        # 验证文本匹配
        extracted = result.text[extraction.start_char:extraction.end_char]
        assert extracted == extraction.extraction_text

Common Pitfalls

常见陷阱

  1. Overlapping Extractions
    • Issue: Extractions overlap or duplicate
    • Solution: Specify in prompt "Do not overlap entities"
  2. Paraphrasing Instead of Exact Text
    • Issue: Extracted text doesn't match original
    • Solution: Prompt "Use exact text from document. Do not paraphrase."
  3. Insufficient Examples
    • Issue: Poor extraction quality
    • Solution: Provide 3-5 diverse examples covering edge cases
  4. Model Limitations
    • Issue: Schema constraints not supported on all models
    • Solution: Set
      use_schema_constraints=False
      for OpenAI/Ollama
  1. 重叠提取
    • 问题:提取结果重叠或重复
    • 解决方案:在提示词中明确说明“不要重叠实体”
  2. 意译而非使用原文
    • 问题:提取文本与原文不匹配
    • 解决方案:在提示词中要求“使用文档中的原文,不要意译”
  3. 示例不足
    • 问题:提取质量差
    • 解决方案:提供3-5个涵盖边缘情况的多样化示例
  4. 模型限制
    • 问题:部分模型不支持schema约束
    • 解决方案:对OpenAI/Ollama设置
      use_schema_constraints=False

Troubleshooting

故障排除

Common Issues

常见问题

Issue 1: API Authentication Failed

问题1:API认证失败

Symptoms:
  • AuthenticationError: Invalid API key
  • Permission denied
    errors
Solution:
bash
undefined
症状:
  • AuthenticationError: Invalid API key
  • Permission denied
    错误
解决方案:
bash
undefined

Verify API key is set

验证API密钥是否已设置

echo $LANGEXTRACT_API_KEY
echo $LANGEXTRACT_API_KEY

Set API key

设置API密钥

export LANGEXTRACT_API_KEY="your-key-here"
export LANGEXTRACT_API_KEY="your-key-here"

For OpenAI

针对OpenAI

export OPENAI_API_KEY="your-openai-key"
export OPENAI_API_KEY="your-openai-key"

Verify key works

验证密钥是否生效

python -c "import os; print(os.getenv('LANGEXTRACT_API_KEY'))"
undefined
python -c "import os; print(os.getenv('LANGEXTRACT_API_KEY'))"
undefined

Issue 2: Schema Constraints Error

问题2:Schema约束错误

Symptoms:
  • Schema constraints not supported
    error
  • Malformed output with OpenAI or Ollama
Solution:
python
undefined
症状:
  • Schema constraints not supported
    错误
  • 使用OpenAI或Ollama时输出格式错误
解决方案:
python
undefined

Disable schema constraints for non-Gemini models

对非Gemini模型禁用schema约束

result = lx.extract( text_or_documents=text, prompt_description=prompt, examples=examples, model_id="gpt-4o", use_schema_constraints=False, # Disable for OpenAI fence_output=True # Enable for OpenAI )
undefined
result = lx.extract( text_or_documents=text, prompt_description=prompt, examples=examples, model_id="gpt-4o", use_schema_constraints=False, # 对OpenAI禁用 fence_output=True # 对OpenAI启用 )
undefined

Issue 3: Token Limit Exceeded

问题3:令牌限制超出

Symptoms:
  • Token limit exceeded
    error
  • Truncated results
Solution:
python
undefined
症状:
  • Token limit exceeded
    错误
  • 结果被截断
解决方案:
python
undefined

Use multi-pass extraction

使用多轮提取

result = lx.extract( text_or_documents=long_text, prompt_description=prompt, examples=examples, extraction_passes=3, # Multiple passes max_char_buffer=1000, # Adjust chunk size max_workers=10 # Parallel processing )
undefined
result = lx.extract( text_or_documents=long_text, prompt_description=prompt, examples=examples, extraction_passes=3, # 多轮提取 max_char_buffer=1000, # 调整分块大小 max_workers=10 # 并行处理 )
undefined

Issue 4: Poor Extraction Quality

问题4:提取质量差

Symptoms:
  • Missing entities
  • Incorrect extractions
  • Paraphrased text
Solution:
python
undefined
症状:
  • 遗漏实体
  • 提取结果错误
  • 文本被意译
解决方案:
python
undefined

Improve prompt specificity

提升提示词的具体性

prompt = """Extract medications with exact dosage and frequency. Use exact text from document. Do not paraphrase. Include generic and brand names. Extract discontinued medications as well."""
prompt = """Extract medications with exact dosage and frequency. Use exact text from document. Do not paraphrase. Include generic and brand names. Extract discontinued medications as well."""

Add more diverse examples

添加更多多样化示例

examples = [ normal_case, edge_case_1, edge_case_2, complex_case ]
examples = [ normal_case, edge_case_1, edge_case_2, complex_case ]

Increase extraction passes

增加提取轮次

extraction_passes=3
extraction_passes=3

Try more capable model

尝试更强大的模型

model_id="gemini-2.0-flash-thinking-exp"
undefined
model_id="gemini-2.0-flash-thinking-exp"
undefined

Issue 5: Ollama Connection Failed

问题5:Ollama连接失败

Symptoms:
  • Connection refused
    to localhost:11434
  • Ollama model not found
Solution:
bash
undefined
症状:
  • 连接localhost:11434被拒绝
  • Ollama模型未找到
解决方案:
bash
undefined

Start Ollama server

启动Ollama服务器

ollama serve
ollama serve

Pull required model

拉取所需模型

ollama pull gemma2:2b
ollama pull gemma2:2b

Verify Ollama is running

验证Ollama是否运行

Use in langextract

在langextract中使用

python -c " import langextract as lx result = lx.extract( text_or_documents='test', prompt_description='Extract entities', examples=[], model_id='gemma2:2b', model_url='http://localhost:11434', use_schema_constraints=False ) "
undefined
python -c " import langextract as lx result = lx.extract( text_or_documents='test', prompt_description='Extract entities', examples=[], model_id='gemma2:2b', model_url='http://localhost:11434', use_schema_constraints=False ) "
undefined

Debugging Tips

调试技巧

  1. Enable Verbose Logging
    python
    import logging
    logging.basicConfig(level=logging.DEBUG)
  2. Inspect Intermediate Results
    python
    # Save each pass separately
    for i, result in enumerate(results):
        lx.io.save_annotated_documents(
            [result],
            output_name=f"pass_{i}.jsonl",
            output_dir="./debug"
        )
  3. Validate Examples
    python
    # Check examples match expected format
    for example in examples:
        for extraction in example.extractions:
            # Verify text is in example text
            assert extraction.extraction_text in example.text
            print(f"✓ {extraction.extraction_class}: {extraction.extraction_text}")
  4. Test with Simple Input First
    python
    # Start with minimal test
    test_result = lx.extract(
        text_or_documents="Patient on Aspirin 81mg daily.",
        prompt_description="Extract medications.",
        examples=[simple_example],
        model_id="gemini-2.0-flash-exp"
    )
    print(f"Extractions: {len(test_result.extractions)}")
  1. 启用详细日志
    python
    import logging
    logging.basicConfig(level=logging.DEBUG)
  2. 检查中间结果
    python
    # 单独保存每一轮的结果
    for i, result in enumerate(results):
        lx.io.save_annotated_documents(
            [result],
            output_name=f"pass_{i}.jsonl",
            output_dir="./debug"
        )
  3. 验证示例
    python
    # 检查示例是否符合预期格式
    for example in examples:
        for extraction in example.extractions:
            # 验证文本存在于示例文本中
            assert extraction.extraction_text in example.text
            print(f"✓ {extraction.extraction_class}: {extraction.extraction_text}")
  4. 先使用简单输入测试
    python
    # 从最小测试开始
    test_result = lx.extract(
        text_or_documents="Patient on Aspirin 81mg daily.",
        prompt_description="Extract medications.",
        examples=[simple_example],
        model_id="gemini-2.0-flash-exp"
    )
    print(f"Extractions: {len(test_result.extractions)}")

Advanced Topics

高级主题

Custom Extraction Schemas

自定义提取Schema

Define complex nested structures:
python
examples = [
    lx.data.ExampleData(
        text="Patient presents with chest pain. ECG shows ST elevation. Diagnosed with STEMI.",
        extractions=[
            lx.data.Extraction(
                extraction_class="clinical_event",
                extraction_text="Patient presents with chest pain. ECG shows ST elevation. Diagnosed with STEMI.",
                attributes={
                    "symptom": "chest pain",
                    "diagnostic_test": "ECG",
                    "finding": "ST elevation",
                    "diagnosis": "STEMI",
                    "severity": "severe",
                    "timeline": [
                        {"event": "symptom_onset", "description": "chest pain"},
                        {"event": "diagnostic", "description": "ECG shows ST elevation"},
                        {"event": "diagnosis", "description": "STEMI"}
                    ]
                }
            )
        ]
    )
]
定义复杂的嵌套结构:
python
examples = [
    lx.data.ExampleData(
        text="Patient presents with chest pain. ECG shows ST elevation. Diagnosed with STEMI.",
        extractions=[
            lx.data.Extraction(
                extraction_class="clinical_event",
                extraction_text="Patient presents with chest pain. ECG shows ST elevation. Diagnosed with STEMI.",
                attributes={
                    "symptom": "chest pain",
                    "diagnostic_test": "ECG",
                    "finding": "ST elevation",
                    "diagnosis": "STEMI",
                    "severity": "severe",
                    "timeline": [
                        {"event": "symptom_onset", "description": "chest pain"},
                        {"event": "diagnostic", "description": "ECG shows ST elevation"},
                        {"event": "diagnosis", "description": "STEMI"}
                    ]
                }
            )
        ]
    )
]

Batch Processing with Progress Tracking

带进度跟踪的批量处理

python
from tqdm import tqdm
import langextract as lx

documents = load_documents()  # List of documents
results = []

for i, doc in enumerate(tqdm(documents)):
    try:
        result = lx.extract(
            text_or_documents=doc,
            prompt_description=prompt,
            examples=examples,
            model_id="gemini-2.0-flash-exp"
        )
        results.append(result)

        # Save incrementally
        if (i + 1) % 100 == 0:
            lx.io.save_annotated_documents(
                results,
                output_name=f"batch_{i+1}.jsonl",
                output_dir="./batches"
            )
            results = []  # Clear for next batch
    except Exception as e:
        print(f"Failed on document {i}: {e}")
        continue
python
from tqdm import tqdm
import langextract as lx

documents = load_documents()  # 文档列表
results = []

for i, doc in enumerate(tqdm(documents)):
    try:
        result = lx.extract(
            text_or_documents=doc,
            prompt_description=prompt,
            examples=examples,
            model_id="gemini-2.0-flash-exp"
        )
        results.append(result)

        # 增量保存
        if (i + 1) % 100 == 0:
            lx.io.save_annotated_documents(
                results,
                output_name=f"batch_{i+1}.jsonl",
                output_dir="./batches"
            )
            results = []  # 清空以准备下一批
    except Exception as e:
        print(f"处理第{i}篇文档失败: {e}")
        continue

Integration with Data Pipelines

与数据流水线集成

python
import langextract as lx
import pandas as pd
python
import langextract as lx
import pandas as pd

Load data

加载数据

df = pd.read_csv("clinical_notes.csv")
df = pd.read_csv("clinical_notes.csv")

Extract from each note

从每篇记录中提取

extractions_data = []
for idx, row in df.iterrows(): result = lx.extract( text_or_documents=row['note_text'], prompt_description=prompt, examples=examples, model_id="gemini-2.0-flash-exp" )
for extraction in result.extractions:
    extractions_data.append({
        'patient_id': row['patient_id'],
        'note_date': row['note_date'],
        'extraction_class': extraction.extraction_class,
        'extraction_text': extraction.extraction_text,
        **extraction.attributes
    })
extractions_data = []
for idx, row in df.iterrows(): result = lx.extract( text_or_documents=row['note_text'], prompt_description=prompt, examples=examples, model_id="gemini-2.0-flash-exp" )
for extraction in result.extractions:
    extractions_data.append({
        'patient_id': row['patient_id'],
        'note_date': row['note_date'],
        'extraction_class': extraction.extraction_class,
        'extraction_text': extraction.extraction_text,
        **extraction.attributes
    })

Create structured DataFrame

创建结构化DataFrame

extractions_df = pd.DataFrame(extractions_data) extractions_df.to_csv("structured_extractions.csv", index=False)
undefined
extractions_df = pd.DataFrame(extractions_data) extractions_df.to_csv("structured_extractions.csv", index=False)
undefined

Performance Benchmarking

性能基准测试

python
import time
import langextract as lx

def benchmark_extraction(documents, model_id, passes=1):
    start = time.time()

    results = lx.extract(
        text_or_documents=documents,
        prompt_description=prompt,
        examples=examples,
        model_id=model_id,
        extraction_passes=passes,
        max_workers=20
    )

    elapsed = time.time() - start
    total_extractions = sum(len(r.extractions) for r in results)

    print(f"Model: {model_id}")
    print(f"Passes: {passes}")
    print(f"Documents: {len(documents)}")
    print(f"Total extractions: {total_extractions}")
    print(f"Time: {elapsed:.2f}s")
    print(f"Throughput: {len(documents)/elapsed:.2f} docs/sec")
    print()
python
import time
import langextract as lx

def benchmark_extraction(documents, model_id, passes=1):
    start = time.time()

    results = lx.extract(
        text_or_documents=documents,
        prompt_description=prompt,
        examples=examples,
        model_id=model_id,
        extraction_passes=passes,
        max_workers=20
    )

    elapsed = time.time() - start
    total_extractions = sum(len(r.extractions) for r in results)

    print(f"模型: {model_id}")
    print(f"提取轮次: {passes}")
    print(f"文档数量: {len(documents)}")
    print(f"总提取数: {total_extractions}")
    print(f"耗时: {elapsed:.2f}秒")
    print(f"吞吐量: {len(documents)/elapsed:.2f} 文档/秒")
    print()

Compare models

对比不同模型

benchmark_extraction(docs, "gemini-2.0-flash-exp", passes=1) benchmark_extraction(docs, "gemini-2.0-flash-exp", passes=3) benchmark_extraction(docs, "gpt-4o", passes=1)
undefined
benchmark_extraction(docs, "gemini-2.0-flash-exp", passes=1) benchmark_extraction(docs, "gemini-2.0-flash-exp", passes=3) benchmark_extraction(docs, "gpt-4o", passes=1)
undefined

Examples

示例

Example Projects

示例项目

The repository includes several example implementations:
  1. Custom Provider Plugin (
    examples/custom_provider_plugin/
    )
    • How to create custom extraction backends
    • Integration with proprietary models
  2. Jupyter Notebooks (
    examples/notebooks/
    )
    • Interactive extraction workflows
    • Visualization and analysis
  3. Ollama Integration (
    examples/ollama/
    )
    • Local model usage
    • Privacy-preserving extraction
仓库中包含多个示例实现:
  1. 自定义提供方插件
    examples/custom_provider_plugin/
    • 如何创建自定义提取后端
    • 与专有模型集成
  2. Jupyter笔记本
    examples/notebooks/
    • 交互式提取工作流
    • 可视化与分析
  3. Ollama集成
    examples/ollama/
    • 本地模型使用
    • 隐私保护的提取

Medical Use Case

医疗用例

See
examples/clinical_extraction.py
for a complete medical extraction pipeline.
查看
examples/clinical_extraction.py
获取完整的医疗提取流水线。

Literary Analysis

文学分析

See
examples/literary_extraction.py
for character and relationship extraction from novels.
查看
examples/literary_extraction.py
获取从小说中提取角色与关系的示例。

Testing

测试

Running Tests

运行测试

bash
undefined
bash
undefined

Install test dependencies

安装测试依赖

pip install -e ".[test]"
pip install -e ".[test]"

Run all tests

运行所有测试

pytest tests
pytest tests

Run with coverage

运行测试并查看覆盖率

pytest tests --cov=langextract
pytest tests --cov=langextract

Run specific test

运行特定测试

pytest tests/test_extraction.py
pytest tests/test_extraction.py

Run integration tests

运行集成测试

pytest tests/integration/
undefined
pytest tests/integration/
undefined

Integration Testing with Ollama

与Ollama的集成测试

bash
undefined
bash
undefined

Install tox

安装tox

pip install tox
pip install tox

Run Ollama integration tests

运行Ollama集成测试

tox -e ollama-integration
undefined
tox -e ollama-integration
undefined

Writing Tests

编写测试

python
import langextract as lx

def test_basic_extraction():
    prompt = "Extract names."
    examples = [
        lx.data.ExampleData(
            text="John Smith visited the clinic.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="name",
                    extraction_text="John Smith"
                )
            ]
        )
    ]

    result = lx.extract(
        text_or_documents="Mary Johnson was the doctor.",
        prompt_description=prompt,
        examples=examples,
        model_id="gemini-2.0-flash-exp"
    )

    assert len(result.extractions) >= 1
    assert result.extractions[0].extraction_class == "name"
python
import langextract as lx

def test_basic_extraction():
    prompt = "Extract names."
    examples = [
        lx.data.ExampleData(
            text="John Smith visited the clinic.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="name",
                    extraction_text="John Smith"
                )
            ]
        )
    ]

    result = lx.extract(
        text_or_documents="Mary Johnson was the doctor.",
        prompt_description=prompt,
        examples=examples,
        model_id="gemini-2.0-flash-exp"
    )

    assert len(result.extractions) >= 1
    assert result.extractions[0].extraction_class == "name"

Resources

资源

Official Documentation

官方文档

Model Documentation

模型文档

Related Tools

相关工具

  • Google AI Studio: Web interface for Gemini models
  • Vertex AI Workbench: Enterprise AI development
  • LangChain: LLM application framework
  • Instructor: Structured outputs library
  • Google AI Studio:Gemini模型的Web界面
  • Vertex AI Workbench:企业级AI开发平台
  • LangChain:LLM应用框架
  • Instructor:结构化输出库

Use Case Examples

用例示例

  • Clinical information extraction
  • Legal document analysis
  • Scientific literature mining
  • Customer feedback structuring
  • Contract entity extraction
  • 临床信息提取
  • 法律文档分析
  • 科学文献挖掘
  • 客户反馈结构化
  • 合同实体提取

Contributing

贡献

Contributions welcome! See the official repository for guidelines: https://github.com/google/langextract
欢迎贡献代码!请查看官方仓库的贡献指南: https://github.com/google/langextract

Development Setup

开发环境设置

bash
git clone https://github.com/google/langextract.git
cd langextract
pip install -e ".[dev]"
pre-commit install
bash
git clone https://github.com/google/langextract.git
cd langextract
pip install -e ".[dev]"
pre-commit install

Running CI Locally

本地运行CI

bash
undefined
bash
undefined

Full test matrix

完整测试矩阵

tox
tox

Specific Python version

特定Python版本

tox -e py310
tox -e py310

Code formatting

代码格式化

black langextract/ isort langextract/
black langextract/ isort langextract/

Linting

代码检查

flake8 langextract/ mypy langextract/
undefined
flake8 langextract/ mypy langextract/
undefined

Version Information

版本信息

Last Updated: 2025-12-25 Skill Version: 1.0.0 LangExtract Version: Latest (check PyPI)

This skill provides comprehensive guidance for LangExtract based on official documentation and examples. For the latest updates, refer to the GitHub repository.
最后更新:2025-12-25 Skill版本:1.0.0 LangExtract版本:最新版(请查看PyPI)

本Skill基于官方文档与示例提供LangExtract的全面使用指南。如需最新更新,请参考GitHub仓库。