langextract
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLangExtract - Structured Information Extraction
LangExtract - 结构化信息提取
Expert assistance for extracting structured, source-grounded information from unstructured text using large language models.
借助大语言模型从非结构化文本中提取带来源溯源的结构化信息的专业工具。
When to Use This Skill
何时使用该Skill
Use this skill when you need to:
- Extract structured entities from unstructured text (medical notes, reports, documents)
- Maintain precise source grounding (map extracted data to original text locations)
- Process long documents beyond LLM token limits
- Visualize extraction results with interactive HTML highlighting
- Extract clinical information from medical records
- Structure radiology or pathology reports
- Extract medications, diagnoses, or symptoms from clinical notes
- Analyze literary texts for characters, emotions, relationships
- Build domain-specific extraction pipelines
- Work with Gemini, OpenAI, or local models (Ollama)
- Generate schema-compliant outputs without fine-tuning
当你需要以下功能时,可使用本Skill:
- 从非结构化文本(医疗记录、报告、文档)中提取结构化实体
- 保持精准的来源溯源(将提取的数据映射到原文位置)
- 处理超出LLM令牌限制的长文档
- 通过交互式HTML高亮可视化提取结果
- 从医疗记录中提取临床信息
- 结构化放射科或病理科报告
- 从临床记录中提取药物、诊断结果或症状
- 分析文学文本中的角色、情感、关系
- 构建特定领域的提取流水线
- 与Gemini、OpenAI或本地模型(Ollama)配合使用
- 无需微调即可生成符合 schema 的输出
Overview
概述
LangExtract is a Python library by Google for extracting structured information from unstructured text using large language models. It emphasizes:
- Source Grounding: Every extraction maps to its exact location in source text
- Structured Outputs: Schema-compliant results with controlled generation
- Long Document Processing: Intelligent chunking and multi-pass extraction
- Interactive Visualization: Self-contained HTML for reviewing extractions in context
- Flexible LLM Support: Works with Gemini, OpenAI, and local models
- Few-Shot Learning: Requires only quality examples, no expensive fine-tuning
Key Resources:
LangExtract 是谷歌推出的一款Python库,可借助大语言模型从非结构化文本中提取结构化信息。它的核心特性包括:
- 来源溯源:每一项提取结果都映射到原文中的精确位置
- 结构化输出:遵循schema的可控生成结果
- 长文档处理:智能分块与多轮提取
- 交互式可视化:独立的HTML文件,用于在上下文环境中查看提取结果
- 灵活的LLM支持:兼容Gemini、OpenAI及本地模型
- 少样本学习:仅需高质量示例,无需昂贵的微调
核心资源:
Installation
安装
Prerequisites
前置条件
- Python 3.8 or higher
- API key for Gemini (AI Studio), OpenAI, or local Ollama setup
- Python 3.8或更高版本
- Gemini(AI Studio)、OpenAI的API密钥,或本地Ollama环境
Basic Installation
基础安装
bash
undefinedbash
undefinedInstall from PyPI (recommended)
从PyPI安装(推荐)
pip install langextract
pip install langextract
Install with OpenAI support
安装带OpenAI支持的版本
pip install langextract[openai]
pip install langextract[openai]
Install with development tools
安装带开发工具的版本
pip install langextract[dev]
undefinedpip install langextract[dev]
undefinedInstall from Source
从源码安装
bash
git clone https://github.com/google/langextract.git
cd langextract
pip install -e .bash
git clone https://github.com/google/langextract.git
cd langextract
pip install -e .For development with testing
安装带测试依赖的开发版本
pip install -e ".[test]"
undefinedpip install -e ".[test]"
undefinedDocker Installation
Docker安装
bash
undefinedbash
undefinedBuild Docker image
构建Docker镜像
docker build -t langextract .
docker build -t langextract .
Run with API key
使用API密钥运行
docker run --rm
-e LANGEXTRACT_API_KEY="your-api-key"
langextract python your_script.py
-e LANGEXTRACT_API_KEY="your-api-key"
langextract python your_script.py
undefineddocker run --rm
-e LANGEXTRACT_API_KEY="your-api-key"
langextract python your_script.py
-e LANGEXTRACT_API_KEY="your-api-key"
langextract python your_script.py
undefinedAPI Key Setup
API密钥配置
Gemini (Google AI Studio):
bash
export LANGEXTRACT_API_KEY="your-gemini-api-key"Get keys from: https://ai.google.dev/
OpenAI:
bash
export OPENAI_API_KEY="your-openai-api-key"Vertex AI (Enterprise):
bash
undefinedGemini(Google AI Studio):
bash
export LANGEXTRACT_API_KEY="your-gemini-api-key"获取密钥地址:https://ai.google.dev/
OpenAI:
bash
export OPENAI_API_KEY="your-openai-api-key"Vertex AI(企业版):
bash
undefinedUse service account authentication
使用服务账号认证
Set project in language_model_params
在language_model_params中设置项目
**.env File (Development):**
```bash
**.env文件(开发环境):**
```bashCreate .env file
创建.env文件
echo "LANGEXTRACT_API_KEY=your-key-here" > .env
undefinedecho "LANGEXTRACT_API_KEY=your-key-here" > .env
undefinedQuick Start
快速开始
Basic Extraction Example
基础提取示例
python
import langextract as lx
import textwrappython
import langextract as lx
import textwrap1. Define extraction task
1. 定义提取任务
prompt = textwrap.dedent("""
Extract all medications mentioned in the clinical note. Include medication name, dosage, and frequency. Use exact text from the document.""")
Extract all medications mentioned in the clinical note. Include medication name, dosage, and frequency. Use exact text from the document.""")
prompt = textwrap.dedent("""
Extract all medications mentioned in the clinical note. Include medication name, dosage, and frequency. Use exact text from the document.""")
Extract all medications mentioned in the clinical note. Include medication name, dosage, and frequency. Use exact text from the document.""")
2. Provide examples (few-shot learning)
2. 提供示例(少样本学习)
examples = [
lx.data.ExampleData(
text="Patient prescribed Lisinopril 10mg daily for hypertension.",
extractions=[
lx.data.Extraction(
extraction_class="medication",
extraction_text="Lisinopril 10mg daily",
attributes={
"name": "Lisinopril",
"dosage": "10mg",
"frequency": "daily",
"indication": "hypertension"
}
)
]
)
]
examples = [
lx.data.ExampleData(
text="Patient prescribed Lisinopril 10mg daily for hypertension.",
extractions=[
lx.data.Extraction(
extraction_class="medication",
extraction_text="Lisinopril 10mg daily",
attributes={
"name": "Lisinopril",
"dosage": "10mg",
"frequency": "daily",
"indication": "hypertension"
}
)
]
)
]
3. Input text to extract from
3. 待提取的输入文本
input_text = """
Patient continues on Metformin 500mg twice daily for diabetes management.
Started on Amlodipine 5mg once daily for blood pressure control.
Discontinued Aspirin 81mg due to side effects.
"""
input_text = """
Patient continues on Metformin 500mg twice daily for diabetes management.
Started on Amlodipine 5mg once daily for blood pressure control.
Discontinued Aspirin 81mg due to side effects.
"""
4. Run extraction
4. 执行提取
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp"
)
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp"
)
5. Access results
5. 访问结果
for extraction in result.extractions:
print(f"Medication: {extraction.extraction_text}")
print(f" Name: {extraction.attributes.get('name')}")
print(f" Dosage: {extraction.attributes.get('dosage')}")
print(f" Frequency: {extraction.attributes.get('frequency')}")
print(f" Location: {extraction.start_char}-{extraction.end_char}")
print()
for extraction in result.extractions:
print(f"Medication: {extraction.extraction_text}")
print(f" Name: {extraction.attributes.get('name')}")
print(f" Dosage: {extraction.attributes.get('dosage')}")
print(f" Frequency: {extraction.attributes.get('frequency')}")
print(f" Location: {extraction.start_char}-{extraction.end_char}")
print()
6. Save and visualize
6. 保存与可视化
lx.io.save_annotated_documents(
[result],
output_name="medications.jsonl",
output_dir="."
)
html_content = lx.visualize("medications.jsonl")
with open("medications.html", "w") as f:
f.write(html_content)
undefinedlx.io.save_annotated_documents(
[result],
output_name="medications.jsonl",
output_dir="."
)
html_content = lx.visualize("medications.jsonl")
with open("medications.html", "w") as f:
f.write(html_content)
undefinedLiterary Text Example
文学文本示例
python
import langextract as lx
prompt = """Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities."""
examples = [
lx.data.ExampleData(
text="ROMEO entered the garden, filled with wonder at JULIET's beauty.",
extractions=[
lx.data.Extraction(
extraction_class="character",
extraction_text="ROMEO",
attributes={"emotional_state": "wonder"}
),
lx.data.Extraction(
extraction_class="character",
extraction_text="JULIET",
attributes={}
),
lx.data.Extraction(
extraction_class="relationship",
extraction_text="ROMEO ... JULIET's beauty",
attributes={
"subject": "ROMEO",
"relation": "admires",
"object": "JULIET"
}
)
]
)
]
text = """Act 2, Scene 2: The Capulet's orchard.
ROMEO appears beneath JULIET's balcony, gazing upward with longing.
JULIET steps onto the balcony, unaware of ROMEO's presence below."""
result = lx.extract(
text_or_documents=text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp"
)python
import langextract as lx
prompt = """Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities."""
examples = [
lx.data.ExampleData(
text="ROMEO entered the garden, filled with wonder at JULIET's beauty.",
extractions=[
lx.data.Extraction(
extraction_class="character",
extraction_text="ROMEO",
attributes={"emotional_state": "wonder"}
),
lx.data.Extraction(
extraction_class="character",
extraction_text="JULIET",
attributes={}
),
lx.data.Extraction(
extraction_class="relationship",
extraction_text="ROMEO ... JULIET's beauty",
attributes={
"subject": "ROMEO",
"relation": "admires",
"object": "JULIET"
}
)
]
)
]
text = """Act 2, Scene 2: The Capulet's orchard.
ROMEO appears beneath JULIET's balcony, gazing upward with longing.
JULIET steps onto the balcony, unaware of ROMEO's presence below."""
result = lx.extract(
text_or_documents=text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp"
)Core Concepts
核心概念
1. Extraction Classes
1. 提取类别
Define categories of entities to extract:
python
undefined定义要提取的实体类别:
python
undefinedSingle class
单一类别
extraction_class="medication"
extraction_class="medication"
Multiple classes via examples
通过示例定义多类别
examples = [
lx.data.ExampleData(
text="...",
extractions=[
lx.data.Extraction(extraction_class="diagnosis", ...),
lx.data.Extraction(extraction_class="symptom", ...),
lx.data.Extraction(extraction_class="medication", ...)
]
)
]
undefinedexamples = [
lx.data.ExampleData(
text="...",
extractions=[
lx.data.Extraction(extraction_class="diagnosis", ...),
lx.data.Extraction(extraction_class="symptom", ...),
lx.data.Extraction(extraction_class="medication", ...)
]
)
]
undefined2. Source Grounding
2. 来源溯源
Every extraction includes precise text location:
python
extraction = result.extractions[0]
print(f"Text: {extraction.extraction_text}")
print(f"Start: {extraction.start_char}")
print(f"End: {extraction.end_char}")每一项提取结果都包含精确的文本位置:
python
extraction = result.extractions[0]
print(f"Text: {extraction.extraction_text}")
print(f"Start: {extraction.start_char}")
print(f"End: {extraction.end_char}")Extract from original document
从原文中提取对应片段
original_text = input_text[extraction.start_char:extraction.end_char]
undefinedoriginal_text = input_text[extraction.start_char:extraction.end_char]
undefined3. Attributes
3. 属性
Add structured metadata to extractions:
python
lx.data.Extraction(
extraction_class="medication",
extraction_text="Lisinopril 10mg daily",
attributes={
"name": "Lisinopril",
"dosage": "10mg",
"frequency": "daily",
"route": "oral",
"indication": "hypertension"
}
)为提取结果添加结构化元数据:
python
lx.data.Extraction(
extraction_class="medication",
extraction_text="Lisinopril 10mg daily",
attributes={
"name": "Lisinopril",
"dosage": "10mg",
"frequency": "daily",
"route": "oral",
"indication": "hypertension"
}
)4. Few-Shot Learning
4. 少样本学习
Provide 1-5 quality examples instead of fine-tuning:
python
undefined提供1-5个高质量示例,替代微调:
python
undefinedMinimal examples (1-2) for simple tasks
简单任务的最少示例(1-2个)
examples = [example1]
examples = [example1]
More examples (3-5) for complex schemas
复杂schema的更多示例(3-5个)
examples = [example1, example2, example3, example4, example5]
undefinedexamples = [example1, example2, example3, example4, example5]
undefined5. Long Document Processing
5. 长文档处理
Automatic chunking for documents beyond token limits:
python
result = lx.extract(
text_or_documents=long_document, # Any length
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp",
extraction_passes=3, # Multiple passes for better recall
max_workers=20, # Parallel processing
max_char_buffer=1000 # Chunk overlap for continuity
)自动分块处理超出令牌限制的文档:
python
result = lx.extract(
text_or_documents=long_document, # 任意长度
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp",
extraction_passes=3, # 多轮提取提升召回率
max_workers=20, # 并行处理
max_char_buffer=1000 # 分块重叠保证连续性
)Configuration
配置
Model Selection
模型选择
python
undefinedpython
undefinedGemini models (recommended)
Gemini模型(推荐)
model_id="gemini-2.0-flash-exp" # Fast, cost-effective
model_id="gemini-2.0-flash-thinking-exp" # Complex reasoning
model_id="gemini-1.5-pro" # Legacy
model_id="gemini-2.0-flash-exp" # 快速、经济
model_id="gemini-2.0-flash-thinking-exp" # 复杂推理
model_id="gemini-1.5-pro" # 旧版本
OpenAI models
OpenAI模型
model_id="gpt-4o" # GPT-4 Optimized
model_id="gpt-4o-mini" # Smaller, faster
model_id="gpt-4o" # GPT-4优化版
model_id="gpt-4o-mini" # 轻量、快速
Local models via Ollama
通过Ollama使用本地模型
model_id="gemma2:2b" # Local inference
model_url="http://localhost:11434"
undefinedmodel_id="gemma2:2b" # 本地推理
model_url="http://localhost:11434"
undefinedScaling Parameters
扩展参数
python
result = lx.extract(
text_or_documents=documents,
prompt_description=prompt,
examples=examples,
# Multi-pass extraction for better recall
extraction_passes=3,
# Parallel processing
max_workers=20,
# Chunk size tuning
max_char_buffer=1000,
# Model configuration
model_id="gemini-2.0-flash-exp"
)python
result = lx.extract(
text_or_documents=documents,
prompt_description=prompt,
examples=examples,
# 多轮提取提升召回率
extraction_passes=3,
# 并行处理
max_workers=20,
# 分块大小调整
max_char_buffer=1000,
# 模型配置
model_id="gemini-2.0-flash-exp"
)Backend Configuration
后端配置
Vertex AI:
python
result = lx.extract(
text_or_documents=text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp",
language_model_params={
"vertexai": True,
"project": "your-gcp-project-id",
"location": "us-central1"
}
)Batch Processing:
python
language_model_params={
"batch": {
"enabled": True
}
}OpenAI Configuration:
python
result = lx.extract(
text_or_documents=text,
prompt_description=prompt,
examples=examples,
model_id="gpt-4o",
fence_output=True, # Required for OpenAI
use_schema_constraints=False # Disable Gemini-specific features
)Local Ollama:
python
result = lx.extract(
text_or_documents=text,
prompt_description=prompt,
examples=examples,
model_id="gemma2:2b",
model_url="http://localhost:11434",
use_schema_constraints=False
)Vertex AI:
python
result = lx.extract(
text_or_documents=text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp",
language_model_params={
"vertexai": True,
"project": "your-gcp-project-id",
"location": "us-central1"
}
)批量处理:
python
language_model_params={
"batch": {
"enabled": True
}
}OpenAI配置:
python
result = lx.extract(
text_or_documents=text,
prompt_description=prompt,
examples=examples,
model_id="gpt-4o",
fence_output=True, # OpenAI必需
use_schema_constraints=False # 禁用Gemini专属特性
)本地Ollama:
python
result = lx.extract(
text_or_documents=text,
prompt_description=prompt,
examples=examples,
model_id="gemma2:2b",
model_url="http://localhost:11434",
use_schema_constraints=False
)Environment Variables
环境变量
bash
undefinedbash
undefinedAPI Keys
API密钥
LANGEXTRACT_API_KEY="gemini-api-key"
OPENAI_API_KEY="openai-api-key"
LANGEXTRACT_API_KEY="gemini-api-key"
OPENAI_API_KEY="openai-api-key"
Vertex AI
Vertex AI
GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
Model configuration
模型配置
LANGEXTRACT_MODEL_ID="gemini-2.0-flash-exp"
LANGEXTRACT_MODEL_URL="http://localhost:11434"
undefinedLANGEXTRACT_MODEL_ID="gemini-2.0-flash-exp"
LANGEXTRACT_MODEL_URL="http://localhost:11434"
undefinedCommon Patterns
常见模式
Pattern 1: Clinical Note Extraction
模式1:临床记录提取
python
import langextract as lx
prompt = """Extract diagnoses, symptoms, and medications from clinical notes.
Include ICD-10 codes when available. Use exact medical terminology."""
examples = [
lx.data.ExampleData(
text="Patient presents with Type 2 Diabetes Mellitus (E11.9). Started on Metformin 500mg BID. Reports fatigue and increased thirst.",
extractions=[
lx.data.Extraction(
extraction_class="diagnosis",
extraction_text="Type 2 Diabetes Mellitus (E11.9)",
attributes={"condition": "Type 2 Diabetes Mellitus", "icd10": "E11.9"}
),
lx.data.Extraction(
extraction_class="medication",
extraction_text="Metformin 500mg BID",
attributes={"name": "Metformin", "dosage": "500mg", "frequency": "BID"}
),
lx.data.Extraction(
extraction_class="symptom",
extraction_text="fatigue",
attributes={"symptom": "fatigue"}
),
lx.data.Extraction(
extraction_class="symptom",
extraction_text="increased thirst",
attributes={"symptom": "polydipsia"}
)
]
)
]python
import langextract as lx
prompt = """Extract diagnoses, symptoms, and medications from clinical notes.
Include ICD-10 codes when available. Use exact medical terminology."""
examples = [
lx.data.ExampleData(
text="Patient presents with Type 2 Diabetes Mellitus (E11.9). Started on Metformin 500mg BID. Reports fatigue and increased thirst.",
extractions=[
lx.data.Extraction(
extraction_class="diagnosis",
extraction_text="Type 2 Diabetes Mellitus (E11.9)",
attributes={"condition": "Type 2 Diabetes Mellitus", "icd10": "E11.9"}
),
lx.data.Extraction(
extraction_class="medication",
extraction_text="Metformin 500mg BID",
attributes={"name": "Metformin", "dosage": "500mg", "frequency": "BID"}
),
lx.data.Extraction(
extraction_class="symptom",
extraction_text="fatigue",
attributes={"symptom": "fatigue"}
),
lx.data.Extraction(
extraction_class="symptom",
extraction_text="increased thirst",
attributes={"symptom": "polydipsia"}
)
]
)
]Process multiple clinical notes
处理多条临床记录
clinical_notes = [
"Note 1: Patient presents with...",
"Note 2: Follow-up visit for...",
"Note 3: New onset chest pain..."
]
results = lx.extract(
text_or_documents=clinical_notes,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp",
extraction_passes=2,
max_workers=10
)
clinical_notes = [
"Note 1: Patient presents with...",
"Note 2: Follow-up visit for...",
"Note 3: New onset chest pain..."
]
results = lx.extract(
text_or_documents=clinical_notes,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp",
extraction_passes=2,
max_workers=10
)
Save structured output
保存结构化输出
lx.io.save_annotated_documents(
results,
output_name="clinical_extractions.jsonl",
output_dir="./output"
)
undefinedlx.io.save_annotated_documents(
results,
output_name="clinical_extractions.jsonl",
output_dir="./output"
)
undefinedPattern 2: Radiology Report Structuring
模式2:放射科报告结构化
python
prompt = """Extract findings, impressions, and recommendations from radiology reports.
Include anatomical location, abnormality type, and severity."""
examples = [
lx.data.ExampleData(
text="FINDINGS: 3.2cm mass in right upper lobe. IMPRESSION: Suspicious for malignancy. RECOMMENDATION: Biopsy recommended.",
extractions=[
lx.data.Extraction(
extraction_class="finding",
extraction_text="3.2cm mass in right upper lobe",
attributes={
"location": "right upper lobe",
"type": "mass",
"size": "3.2cm"
}
),
lx.data.Extraction(
extraction_class="impression",
extraction_text="Suspicious for malignancy",
attributes={"diagnosis": "possible malignancy", "certainty": "suspicious"}
),
lx.data.Extraction(
extraction_class="recommendation",
extraction_text="Biopsy recommended",
attributes={"action": "biopsy"}
)
]
)
]python
prompt = """Extract findings, impressions, and recommendations from radiology reports.
Include anatomical location, abnormality type, and severity."""
examples = [
lx.data.ExampleData(
text="FINDINGS: 3.2cm mass in right upper lobe. IMPRESSION: Suspicious for malignancy. RECOMMENDATION: Biopsy recommended.",
extractions=[
lx.data.Extraction(
extraction_class="finding",
extraction_text="3.2cm mass in right upper lobe",
attributes={
"location": "right upper lobe",
"type": "mass",
"size": "3.2cm"
}
),
lx.data.Extraction(
extraction_class="impression",
extraction_text="Suspicious for malignancy",
attributes={"diagnosis": "possible malignancy", "certainty": "suspicious"}
),
lx.data.Extraction(
extraction_class="recommendation",
extraction_text="Biopsy recommended",
attributes={"action": "biopsy"}
)
]
)
]Pattern 3: Multi-Document Processing
模式3:多文档处理
python
import langextract as lx
from pathlib import Pathpython
import langextract as lx
from pathlib import PathLoad multiple documents
加载多个文档
documents = []
for file_path in Path("./documents").glob("*.txt"):
with open(file_path, "r") as f:
documents.append(f.read())
documents = []
for file_path in Path("./documents").glob("*.txt"):
with open(file_path, "r") as f:
documents.append(f.read())
Extract from all documents
从所有文档中提取
results = lx.extract(
text_or_documents=documents,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp",
extraction_passes=3,
max_workers=20
)
results = lx.extract(
text_or_documents=documents,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp",
extraction_passes=3,
max_workers=20
)
Results is a list of AnnotatedDocument objects
Results是AnnotatedDocument对象的列表
for i, result in enumerate(results):
print(f"\nDocument {i+1}: {len(result.extractions)} extractions")
for extraction in result.extractions:
print(f" - {extraction.extraction_class}: {extraction.extraction_text}")
undefinedfor i, result in enumerate(results):
print(f"\nDocument {i+1}: {len(result.extractions)} extractions")
for extraction in result.extractions:
print(f" - {extraction.extraction_class}: {extraction.extraction_text}")
undefinedPattern 4: Interactive Visualization
模式4:交互式可视化
python
undefinedpython
undefinedGenerate interactive HTML
生成交互式HTML
html_content = lx.visualize("extractions.jsonl")
html_content = lx.visualize("extractions.jsonl")
Save to file
保存到文件
with open("interactive_results.html", "w") as f:
f.write(html_content)
with open("interactive_results.html", "w") as f:
f.write(html_content)
Open in browser (optional)
在浏览器中打开(可选)
import webbrowser
webbrowser.open("interactive_results.html")
undefinedimport webbrowser
webbrowser.open("interactive_results.html")
undefinedPattern 5: Custom Provider Plugin
模式5:自定义提供方插件
python
undefinedpython
undefinedSee examples/custom_provider_plugin/ for full implementation
完整实现请查看examples/custom_provider_plugin/
from langextract.providers import ProviderPlugin
class CustomProvider(ProviderPlugin):
def extract(self, text, prompt, examples, **kwargs):
# Custom extraction logic
return extractions
def supports_schema_constraints(self):
return Falsefrom langextract.providers import ProviderPlugin
class CustomProvider(ProviderPlugin):
def extract(self, text, prompt, examples, **kwargs):
# 自定义提取逻辑
return extractions
def supports_schema_constraints(self):
return FalseRegister custom provider
注册自定义提供方
lx.register_provider("custom", CustomProvider())
lx.register_provider("custom", CustomProvider())
Use custom provider
使用自定义提供方
result = lx.extract(
text_or_documents=text,
prompt_description=prompt,
examples=examples,
model_id="custom",
provider="custom"
)
undefinedresult = lx.extract(
text_or_documents=text,
prompt_description=prompt,
examples=examples,
model_id="custom",
provider="custom"
)
undefinedAPI Reference
API参考
Core Functions
核心函数
lx.extract()
lx.extract()lx.extract()
lx.extract()Main extraction function.
python
result = lx.extract(
text_or_documents, # str or list of str
prompt_description, # str: extraction instructions
examples, # list of ExampleData
model_id="gemini-2.0-flash-exp", # str: model identifier
extraction_passes=1, # int: number of passes
max_workers=None, # int: parallel workers
max_char_buffer=1000, # int: chunk overlap
language_model_params=None, # dict: model config
fence_output=False, # bool: required for OpenAI
use_schema_constraints=True, # bool: use schema enforcement
model_url=None, # str: custom model endpoint
api_key=None # str: API key (prefer env var)
)Returns: or
AnnotatedDocumentlist[AnnotatedDocument]主提取函数。
python
result = lx.extract(
text_or_documents, # str或str列表
prompt_description, # str: 提取指令
examples, # ExampleData列表
model_id="gemini-2.0-flash-exp", # str: 模型标识符
extraction_passes=1, # int: 提取轮次
max_workers=None, # int: 并行工作进程数
max_char_buffer=1000, # int: 分块重叠字符数
language_model_params=None, # dict: 模型配置
fence_output=False, # bool: OpenAI必需
use_schema_constraints=True, # bool: 使用schema约束
model_url=None, # str: 自定义模型端点
api_key=None # str: API密钥(优先使用环境变量)
)返回值:或
AnnotatedDocumentlist[AnnotatedDocument]lx.visualize()
lx.visualize()lx.visualize()
lx.visualize()Generate interactive HTML visualization.
python
html_content = lx.visualize(
jsonl_file_path, # str: path to JSONL file
title="Extraction Results", # str: HTML page title
show_attributes=True # bool: display attributes
)Returns: (HTML content)
str生成交互式HTML可视化内容。
python
html_content = lx.visualize(
jsonl_file_path, # str: JSONL文件路径
title="Extraction Results", # str: HTML页面标题
show_attributes=True # bool: 显示属性
)返回值:(HTML内容)
strlx.io.save_annotated_documents()
lx.io.save_annotated_documents()lx.io.save_annotated_documents()
lx.io.save_annotated_documents()Save results to JSONL format.
python
lx.io.save_annotated_documents(
annotated_documents, # list of AnnotatedDocument
output_name, # str: filename (e.g., "results.jsonl")
output_dir="." # str: output directory
)将结果保存为JSONL格式。
python
lx.io.save_annotated_documents(
annotated_documents, # AnnotatedDocument列表
output_name, # str: 文件名(如"results.jsonl")
output_dir="." # str: 输出目录
)Data Classes
数据类
ExampleData
ExampleDataExampleData
ExampleDataFew-shot example definition.
python
example = lx.data.ExampleData(
text="Example text here",
extractions=[
lx.data.Extraction(...)
]
)少样本示例定义。
python
example = lx.data.ExampleData(
text="Example text here",
extractions=[
lx.data.Extraction(...)
]
)Extraction
ExtractionExtraction
ExtractionSingle extraction definition.
python
extraction = lx.data.Extraction(
extraction_class="medication", # str: entity type
extraction_text="Aspirin 81mg", # str: exact text
attributes={ # dict: metadata
"name": "Aspirin",
"dosage": "81mg"
},
start_char=0, # int: start position (auto-set)
end_char=13 # int: end position (auto-set)
)单一提取结果定义。
python
extraction = lx.data.Extraction(
extraction_class="medication", # str: 实体类型
extraction_text="Aspirin 81mg", # str: 原文文本
attributes={ # dict: 元数据
"name": "Aspirin",
"dosage": "81mg"
},
start_char=0, # int: 起始位置(自动设置)
end_char=13 # int: 结束位置(自动设置)
)AnnotatedDocument
AnnotatedDocumentAnnotatedDocument
AnnotatedDocumentExtraction results for a document.
python
result.text # str: original text
result.extractions # list of Extraction
result.metadata # dict: additional info单篇文档的提取结果。
python
result.text # str: 原文
result.extractions # Extraction列表
result.metadata # dict: 附加信息Best Practices
最佳实践
Extraction Design
提取设计
-
Write Clear Prompts: Be specific about what to extract and howpython
# Good prompt = "Extract medications with dosage, frequency, and route of administration. Use exact medical terminology." # Avoid prompt = "Extract medications." -
Provide Quality Examples: 1-5 well-crafted examples beat many poor onespython
# Include edge cases in examples examples = [ normal_case_example, edge_case_example, complex_case_example ] -
Use Exact Text: Extract verbatim from source for accurate groundingpython
# Good extraction_text="Lisinopril 10mg daily" # Avoid paraphrasing extraction_text="10mg lisinopril taken once per day" -
Define Attributes Clearly: Structure metadata consistentlypython
attributes={ "name": "Lisinopril", # Drug name "dosage": "10mg", # Amount "frequency": "daily", # How often "route": "oral" # How taken }
-
编写清晰的提示词:明确说明要提取的内容和方式python
# 良好示例 prompt = "Extract medications with dosage, frequency, and route of administration. Use exact medical terminology." # 避免 prompt = "Extract medications." -
提供高质量示例:1-5个精心设计的示例优于大量劣质示例python
# 在示例中包含边缘情况 examples = [ normal_case_example, edge_case_example, complex_case_example ] -
使用原文文本:提取原文内容以保证准确的来源溯源python
# 良好示例 extraction_text="Lisinopril 10mg daily" # 避免意译 extraction_text="10mg lisinopril taken once per day" -
清晰定义属性:保持元数据结构一致python
attributes={ "name": "Lisinopril", # 药物名称 "dosage": "10mg", # 剂量 "frequency": "daily", # 频次 "route": "oral" # 给药途径 }
Performance Optimization
性能优化
-
Multi-Pass for Long Documents: Improves recallpython
extraction_passes=3 # 2-3 passes recommended for thorough extraction -
Parallel Processing: Speed up batch operationspython
max_workers=20 # Adjust based on API rate limits -
Chunk Size Tuning: Balance accuracy and contextpython
max_char_buffer=1000 # Larger for context, smaller for speed -
Model Selection: Choose based on task complexitypython
# Simple extraction model_id="gemini-2.0-flash-exp" # Complex reasoning model_id="gemini-2.0-flash-thinking-exp"
-
长文档使用多轮提取:提升召回率python
extraction_passes=3 # 推荐2-3轮提取以保证全面性 -
并行处理:加速批量操作python
max_workers=20 # 根据API速率限制调整 -
分块大小调整:平衡准确性与上下文python
max_char_buffer=1000 # 更大的值保留更多上下文,更小的值提升速度 -
模型选择:根据任务复杂度选择python
# 简单提取任务 model_id="gemini-2.0-flash-exp" # 复杂推理任务 model_id="gemini-2.0-flash-thinking-exp"
Production Deployment
生产部署
-
API Key Security: Never hardcode keyspython
# Good: Use environment variables import os api_key = os.getenv("LANGEXTRACT_API_KEY") # Avoid: Hardcoding api_key = "AIza..." # Never do this -
Error Handling: Handle API failures gracefullypython
try: result = lx.extract(...) except Exception as e: logger.error(f"Extraction failed: {e}") # Implement retry logic or fallback -
Cost Management: Monitor API usagepython
# Use cheaper models for bulk processing model_id="gemini-2.0-flash-exp" # vs "gemini-1.5-pro" # Batch processing for cost efficiency language_model_params={"batch": {"enabled": True}} -
Validation: Verify extraction qualitypython
for extraction in result.extractions: # Validate extraction is within document bounds assert 0 <= extraction.start_char < len(result.text) assert extraction.end_char <= len(result.text) # Verify text matches extracted = result.text[extraction.start_char:extraction.end_char] assert extracted == extraction.extraction_text
-
API密钥安全:切勿硬编码密钥python
# 良好实践:使用环境变量 import os api_key = os.getenv("LANGEXTRACT_API_KEY") # 避免:硬编码 api_key = "AIza..." # 绝对不要这样做 -
错误处理:优雅处理API失败python
try: result = lx.extract(...) except Exception as e: logger.error(f"Extraction failed: {e}") # 实现重试逻辑或备选方案 -
成本管理:监控API使用情况python
# 批量处理使用更经济的模型 model_id="gemini-2.0-flash-exp" # 对比"gemini-1.5-pro" # 批量处理提升成本效率 language_model_params={"batch": {"enabled": True}} -
验证:验证提取质量python
for extraction in result.extractions: # 验证提取位置在文档范围内 assert 0 <= extraction.start_char < len(result.text) assert extraction.end_char <= len(result.text) # 验证文本匹配 extracted = result.text[extraction.start_char:extraction.end_char] assert extracted == extraction.extraction_text
Common Pitfalls
常见陷阱
-
Overlapping Extractions
- Issue: Extractions overlap or duplicate
- Solution: Specify in prompt "Do not overlap entities"
-
Paraphrasing Instead of Exact Text
- Issue: Extracted text doesn't match original
- Solution: Prompt "Use exact text from document. Do not paraphrase."
-
Insufficient Examples
- Issue: Poor extraction quality
- Solution: Provide 3-5 diverse examples covering edge cases
-
Model Limitations
- Issue: Schema constraints not supported on all models
- Solution: Set for OpenAI/Ollama
use_schema_constraints=False
-
重叠提取
- 问题:提取结果重叠或重复
- 解决方案:在提示词中明确说明“不要重叠实体”
-
意译而非使用原文
- 问题:提取文本与原文不匹配
- 解决方案:在提示词中要求“使用文档中的原文,不要意译”
-
示例不足
- 问题:提取质量差
- 解决方案:提供3-5个涵盖边缘情况的多样化示例
-
模型限制
- 问题:部分模型不支持schema约束
- 解决方案:对OpenAI/Ollama设置
use_schema_constraints=False
Troubleshooting
故障排除
Common Issues
常见问题
Issue 1: API Authentication Failed
问题1:API认证失败
Symptoms:
AuthenticationError: Invalid API key- errors
Permission denied
Solution:
bash
undefined症状:
AuthenticationError: Invalid API key- 错误
Permission denied
解决方案:
bash
undefinedVerify API key is set
验证API密钥是否已设置
echo $LANGEXTRACT_API_KEY
echo $LANGEXTRACT_API_KEY
Set API key
设置API密钥
export LANGEXTRACT_API_KEY="your-key-here"
export LANGEXTRACT_API_KEY="your-key-here"
For OpenAI
针对OpenAI
export OPENAI_API_KEY="your-openai-key"
export OPENAI_API_KEY="your-openai-key"
Verify key works
验证密钥是否生效
python -c "import os; print(os.getenv('LANGEXTRACT_API_KEY'))"
undefinedpython -c "import os; print(os.getenv('LANGEXTRACT_API_KEY'))"
undefinedIssue 2: Schema Constraints Error
问题2:Schema约束错误
Symptoms:
- error
Schema constraints not supported - Malformed output with OpenAI or Ollama
Solution:
python
undefined症状:
- 错误
Schema constraints not supported - 使用OpenAI或Ollama时输出格式错误
解决方案:
python
undefinedDisable schema constraints for non-Gemini models
对非Gemini模型禁用schema约束
result = lx.extract(
text_or_documents=text,
prompt_description=prompt,
examples=examples,
model_id="gpt-4o",
use_schema_constraints=False, # Disable for OpenAI
fence_output=True # Enable for OpenAI
)
undefinedresult = lx.extract(
text_or_documents=text,
prompt_description=prompt,
examples=examples,
model_id="gpt-4o",
use_schema_constraints=False, # 对OpenAI禁用
fence_output=True # 对OpenAI启用
)
undefinedIssue 3: Token Limit Exceeded
问题3:令牌限制超出
Symptoms:
- error
Token limit exceeded - Truncated results
Solution:
python
undefined症状:
- 错误
Token limit exceeded - 结果被截断
解决方案:
python
undefinedUse multi-pass extraction
使用多轮提取
result = lx.extract(
text_or_documents=long_text,
prompt_description=prompt,
examples=examples,
extraction_passes=3, # Multiple passes
max_char_buffer=1000, # Adjust chunk size
max_workers=10 # Parallel processing
)
undefinedresult = lx.extract(
text_or_documents=long_text,
prompt_description=prompt,
examples=examples,
extraction_passes=3, # 多轮提取
max_char_buffer=1000, # 调整分块大小
max_workers=10 # 并行处理
)
undefinedIssue 4: Poor Extraction Quality
问题4:提取质量差
Symptoms:
- Missing entities
- Incorrect extractions
- Paraphrased text
Solution:
python
undefined症状:
- 遗漏实体
- 提取结果错误
- 文本被意译
解决方案:
python
undefinedImprove prompt specificity
提升提示词的具体性
prompt = """Extract medications with exact dosage and frequency.
Use exact text from document. Do not paraphrase.
Include generic and brand names.
Extract discontinued medications as well."""
prompt = """Extract medications with exact dosage and frequency.
Use exact text from document. Do not paraphrase.
Include generic and brand names.
Extract discontinued medications as well."""
Add more diverse examples
添加更多多样化示例
examples = [
normal_case,
edge_case_1,
edge_case_2,
complex_case
]
examples = [
normal_case,
edge_case_1,
edge_case_2,
complex_case
]
Increase extraction passes
增加提取轮次
extraction_passes=3
extraction_passes=3
Try more capable model
尝试更强大的模型
model_id="gemini-2.0-flash-thinking-exp"
undefinedmodel_id="gemini-2.0-flash-thinking-exp"
undefinedIssue 5: Ollama Connection Failed
问题5:Ollama连接失败
Symptoms:
- to localhost:11434
Connection refused - Ollama model not found
Solution:
bash
undefined症状:
- 连接localhost:11434被拒绝
- Ollama模型未找到
解决方案:
bash
undefinedStart Ollama server
启动Ollama服务器
ollama serve
ollama serve
Pull required model
拉取所需模型
ollama pull gemma2:2b
ollama pull gemma2:2b
Verify Ollama is running
验证Ollama是否运行
Use in langextract
在langextract中使用
python -c "
import langextract as lx
result = lx.extract(
text_or_documents='test',
prompt_description='Extract entities',
examples=[],
model_id='gemma2:2b',
model_url='http://localhost:11434',
use_schema_constraints=False
)
"
undefinedpython -c "
import langextract as lx
result = lx.extract(
text_or_documents='test',
prompt_description='Extract entities',
examples=[],
model_id='gemma2:2b',
model_url='http://localhost:11434',
use_schema_constraints=False
)
"
undefinedDebugging Tips
调试技巧
-
Enable Verbose Loggingpython
import logging logging.basicConfig(level=logging.DEBUG) -
Inspect Intermediate Resultspython
# Save each pass separately for i, result in enumerate(results): lx.io.save_annotated_documents( [result], output_name=f"pass_{i}.jsonl", output_dir="./debug" ) -
Validate Examplespython
# Check examples match expected format for example in examples: for extraction in example.extractions: # Verify text is in example text assert extraction.extraction_text in example.text print(f"✓ {extraction.extraction_class}: {extraction.extraction_text}") -
Test with Simple Input Firstpython
# Start with minimal test test_result = lx.extract( text_or_documents="Patient on Aspirin 81mg daily.", prompt_description="Extract medications.", examples=[simple_example], model_id="gemini-2.0-flash-exp" ) print(f"Extractions: {len(test_result.extractions)}")
-
启用详细日志python
import logging logging.basicConfig(level=logging.DEBUG) -
检查中间结果python
# 单独保存每一轮的结果 for i, result in enumerate(results): lx.io.save_annotated_documents( [result], output_name=f"pass_{i}.jsonl", output_dir="./debug" ) -
验证示例python
# 检查示例是否符合预期格式 for example in examples: for extraction in example.extractions: # 验证文本存在于示例文本中 assert extraction.extraction_text in example.text print(f"✓ {extraction.extraction_class}: {extraction.extraction_text}") -
先使用简单输入测试python
# 从最小测试开始 test_result = lx.extract( text_or_documents="Patient on Aspirin 81mg daily.", prompt_description="Extract medications.", examples=[simple_example], model_id="gemini-2.0-flash-exp" ) print(f"Extractions: {len(test_result.extractions)}")
Advanced Topics
高级主题
Custom Extraction Schemas
自定义提取Schema
Define complex nested structures:
python
examples = [
lx.data.ExampleData(
text="Patient presents with chest pain. ECG shows ST elevation. Diagnosed with STEMI.",
extractions=[
lx.data.Extraction(
extraction_class="clinical_event",
extraction_text="Patient presents with chest pain. ECG shows ST elevation. Diagnosed with STEMI.",
attributes={
"symptom": "chest pain",
"diagnostic_test": "ECG",
"finding": "ST elevation",
"diagnosis": "STEMI",
"severity": "severe",
"timeline": [
{"event": "symptom_onset", "description": "chest pain"},
{"event": "diagnostic", "description": "ECG shows ST elevation"},
{"event": "diagnosis", "description": "STEMI"}
]
}
)
]
)
]定义复杂的嵌套结构:
python
examples = [
lx.data.ExampleData(
text="Patient presents with chest pain. ECG shows ST elevation. Diagnosed with STEMI.",
extractions=[
lx.data.Extraction(
extraction_class="clinical_event",
extraction_text="Patient presents with chest pain. ECG shows ST elevation. Diagnosed with STEMI.",
attributes={
"symptom": "chest pain",
"diagnostic_test": "ECG",
"finding": "ST elevation",
"diagnosis": "STEMI",
"severity": "severe",
"timeline": [
{"event": "symptom_onset", "description": "chest pain"},
{"event": "diagnostic", "description": "ECG shows ST elevation"},
{"event": "diagnosis", "description": "STEMI"}
]
}
)
]
)
]Batch Processing with Progress Tracking
带进度跟踪的批量处理
python
from tqdm import tqdm
import langextract as lx
documents = load_documents() # List of documents
results = []
for i, doc in enumerate(tqdm(documents)):
try:
result = lx.extract(
text_or_documents=doc,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp"
)
results.append(result)
# Save incrementally
if (i + 1) % 100 == 0:
lx.io.save_annotated_documents(
results,
output_name=f"batch_{i+1}.jsonl",
output_dir="./batches"
)
results = [] # Clear for next batch
except Exception as e:
print(f"Failed on document {i}: {e}")
continuepython
from tqdm import tqdm
import langextract as lx
documents = load_documents() # 文档列表
results = []
for i, doc in enumerate(tqdm(documents)):
try:
result = lx.extract(
text_or_documents=doc,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp"
)
results.append(result)
# 增量保存
if (i + 1) % 100 == 0:
lx.io.save_annotated_documents(
results,
output_name=f"batch_{i+1}.jsonl",
output_dir="./batches"
)
results = [] # 清空以准备下一批
except Exception as e:
print(f"处理第{i}篇文档失败: {e}")
continueIntegration with Data Pipelines
与数据流水线集成
python
import langextract as lx
import pandas as pdpython
import langextract as lx
import pandas as pdLoad data
加载数据
df = pd.read_csv("clinical_notes.csv")
df = pd.read_csv("clinical_notes.csv")
Extract from each note
从每篇记录中提取
extractions_data = []
for idx, row in df.iterrows():
result = lx.extract(
text_or_documents=row['note_text'],
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp"
)
for extraction in result.extractions:
extractions_data.append({
'patient_id': row['patient_id'],
'note_date': row['note_date'],
'extraction_class': extraction.extraction_class,
'extraction_text': extraction.extraction_text,
**extraction.attributes
})extractions_data = []
for idx, row in df.iterrows():
result = lx.extract(
text_or_documents=row['note_text'],
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp"
)
for extraction in result.extractions:
extractions_data.append({
'patient_id': row['patient_id'],
'note_date': row['note_date'],
'extraction_class': extraction.extraction_class,
'extraction_text': extraction.extraction_text,
**extraction.attributes
})Create structured DataFrame
创建结构化DataFrame
extractions_df = pd.DataFrame(extractions_data)
extractions_df.to_csv("structured_extractions.csv", index=False)
undefinedextractions_df = pd.DataFrame(extractions_data)
extractions_df.to_csv("structured_extractions.csv", index=False)
undefinedPerformance Benchmarking
性能基准测试
python
import time
import langextract as lx
def benchmark_extraction(documents, model_id, passes=1):
start = time.time()
results = lx.extract(
text_or_documents=documents,
prompt_description=prompt,
examples=examples,
model_id=model_id,
extraction_passes=passes,
max_workers=20
)
elapsed = time.time() - start
total_extractions = sum(len(r.extractions) for r in results)
print(f"Model: {model_id}")
print(f"Passes: {passes}")
print(f"Documents: {len(documents)}")
print(f"Total extractions: {total_extractions}")
print(f"Time: {elapsed:.2f}s")
print(f"Throughput: {len(documents)/elapsed:.2f} docs/sec")
print()python
import time
import langextract as lx
def benchmark_extraction(documents, model_id, passes=1):
start = time.time()
results = lx.extract(
text_or_documents=documents,
prompt_description=prompt,
examples=examples,
model_id=model_id,
extraction_passes=passes,
max_workers=20
)
elapsed = time.time() - start
total_extractions = sum(len(r.extractions) for r in results)
print(f"模型: {model_id}")
print(f"提取轮次: {passes}")
print(f"文档数量: {len(documents)}")
print(f"总提取数: {total_extractions}")
print(f"耗时: {elapsed:.2f}秒")
print(f"吞吐量: {len(documents)/elapsed:.2f} 文档/秒")
print()Compare models
对比不同模型
benchmark_extraction(docs, "gemini-2.0-flash-exp", passes=1)
benchmark_extraction(docs, "gemini-2.0-flash-exp", passes=3)
benchmark_extraction(docs, "gpt-4o", passes=1)
undefinedbenchmark_extraction(docs, "gemini-2.0-flash-exp", passes=1)
benchmark_extraction(docs, "gemini-2.0-flash-exp", passes=3)
benchmark_extraction(docs, "gpt-4o", passes=1)
undefinedExamples
示例
Example Projects
示例项目
The repository includes several example implementations:
-
Custom Provider Plugin ()
examples/custom_provider_plugin/- How to create custom extraction backends
- Integration with proprietary models
-
Jupyter Notebooks ()
examples/notebooks/- Interactive extraction workflows
- Visualization and analysis
-
Ollama Integration ()
examples/ollama/- Local model usage
- Privacy-preserving extraction
仓库中包含多个示例实现:
-
自定义提供方插件()
examples/custom_provider_plugin/- 如何创建自定义提取后端
- 与专有模型集成
-
Jupyter笔记本()
examples/notebooks/- 交互式提取工作流
- 可视化与分析
-
Ollama集成()
examples/ollama/- 本地模型使用
- 隐私保护的提取
Medical Use Case
医疗用例
See for a complete medical extraction pipeline.
examples/clinical_extraction.py查看获取完整的医疗提取流水线。
examples/clinical_extraction.pyLiterary Analysis
文学分析
See for character and relationship extraction from novels.
examples/literary_extraction.py查看获取从小说中提取角色与关系的示例。
examples/literary_extraction.pyTesting
测试
Running Tests
运行测试
bash
undefinedbash
undefinedInstall test dependencies
安装测试依赖
pip install -e ".[test]"
pip install -e ".[test]"
Run all tests
运行所有测试
pytest tests
pytest tests
Run with coverage
运行测试并查看覆盖率
pytest tests --cov=langextract
pytest tests --cov=langextract
Run specific test
运行特定测试
pytest tests/test_extraction.py
pytest tests/test_extraction.py
Run integration tests
运行集成测试
pytest tests/integration/
undefinedpytest tests/integration/
undefinedIntegration Testing with Ollama
与Ollama的集成测试
bash
undefinedbash
undefinedInstall tox
安装tox
pip install tox
pip install tox
Run Ollama integration tests
运行Ollama集成测试
tox -e ollama-integration
undefinedtox -e ollama-integration
undefinedWriting Tests
编写测试
python
import langextract as lx
def test_basic_extraction():
prompt = "Extract names."
examples = [
lx.data.ExampleData(
text="John Smith visited the clinic.",
extractions=[
lx.data.Extraction(
extraction_class="name",
extraction_text="John Smith"
)
]
)
]
result = lx.extract(
text_or_documents="Mary Johnson was the doctor.",
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp"
)
assert len(result.extractions) >= 1
assert result.extractions[0].extraction_class == "name"python
import langextract as lx
def test_basic_extraction():
prompt = "Extract names."
examples = [
lx.data.ExampleData(
text="John Smith visited the clinic.",
extractions=[
lx.data.Extraction(
extraction_class="name",
extraction_text="John Smith"
)
]
)
]
result = lx.extract(
text_or_documents="Mary Johnson was the doctor.",
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp"
)
assert len(result.extractions) >= 1
assert result.extractions[0].extraction_class == "name"Resources
资源
Official Documentation
官方文档
- GitHub Repository: https://github.com/google/langextract
- Examples Directory: https://github.com/google/langextract/tree/main/examples
- Documentation: https://github.com/google/langextract/tree/main/docs/examples
Model Documentation
模型文档
- Gemini API: https://ai.google.dev/
- Vertex AI: https://cloud.google.com/vertex-ai
- OpenAI API: https://platform.openai.com/
- Ollama: https://ollama.ai/
- Gemini API:https://ai.google.dev/
- Vertex AI:https://cloud.google.com/vertex-ai
- OpenAI API:https://platform.openai.com/
- Ollama:https://ollama.ai/
Related Tools
相关工具
- Google AI Studio: Web interface for Gemini models
- Vertex AI Workbench: Enterprise AI development
- LangChain: LLM application framework
- Instructor: Structured outputs library
- Google AI Studio:Gemini模型的Web界面
- Vertex AI Workbench:企业级AI开发平台
- LangChain:LLM应用框架
- Instructor:结构化输出库
Use Case Examples
用例示例
- Clinical information extraction
- Legal document analysis
- Scientific literature mining
- Customer feedback structuring
- Contract entity extraction
- 临床信息提取
- 法律文档分析
- 科学文献挖掘
- 客户反馈结构化
- 合同实体提取
Contributing
贡献
Contributions welcome! See the official repository for guidelines:
https://github.com/google/langextract
欢迎贡献代码!请查看官方仓库的贡献指南:
https://github.com/google/langextract
Development Setup
开发环境设置
bash
git clone https://github.com/google/langextract.git
cd langextract
pip install -e ".[dev]"
pre-commit installbash
git clone https://github.com/google/langextract.git
cd langextract
pip install -e ".[dev]"
pre-commit installRunning CI Locally
本地运行CI
bash
undefinedbash
undefinedFull test matrix
完整测试矩阵
tox
tox
Specific Python version
特定Python版本
tox -e py310
tox -e py310
Code formatting
代码格式化
black langextract/
isort langextract/
black langextract/
isort langextract/
Linting
代码检查
flake8 langextract/
mypy langextract/
undefinedflake8 langextract/
mypy langextract/
undefinedVersion Information
版本信息
Last Updated: 2025-12-25
Skill Version: 1.0.0
LangExtract Version: Latest (check PyPI)
This skill provides comprehensive guidance for LangExtract based on official documentation and examples. For the latest updates, refer to the GitHub repository.
最后更新:2025-12-25
Skill版本:1.0.0
LangExtract版本:最新版(请查看PyPI)
本Skill基于官方文档与示例提供LangExtract的全面使用指南。如需最新更新,请参考GitHub仓库。