langextract

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LangExtract - Structured Information Extraction

LangExtract - 结构化信息提取

Expert assistance for extracting structured, source-grounded information from unstructured text using large language models.

借助大语言模型从非结构化文本中提取带来源溯源的结构化信息的专业工具。

When to Use This Skill

何时使用该Skill

Use this skill when you need to:

Extract structured entities from unstructured text (medical notes, reports, documents)
Maintain precise source grounding (map extracted data to original text locations)
Process long documents beyond LLM token limits
Visualize extraction results with interactive HTML highlighting
Extract clinical information from medical records
Structure radiology or pathology reports
Extract medications, diagnoses, or symptoms from clinical notes
Analyze literary texts for characters, emotions, relationships
Build domain-specific extraction pipelines
Work with Gemini, OpenAI, or local models (Ollama)
Generate schema-compliant outputs without fine-tuning

当你需要以下功能时，可使用本Skill：

从非结构化文本（医疗记录、报告、文档）中提取结构化实体
保持精准的来源溯源（将提取的数据映射到原文位置）
处理超出LLM令牌限制的长文档
通过交互式HTML高亮可视化提取结果
从医疗记录中提取临床信息
结构化放射科或病理科报告
从临床记录中提取药物、诊断结果或症状
分析文学文本中的角色、情感、关系
构建特定领域的提取流水线
与Gemini、OpenAI或本地模型（Ollama）配合使用
无需微调即可生成符合 schema 的输出

Overview

概述

LangExtract is a Python library by Google for extracting structured information from unstructured text using large language models. It emphasizes:

Source Grounding: Every extraction maps to its exact location in source text
Structured Outputs: Schema-compliant results with controlled generation
Long Document Processing: Intelligent chunking and multi-pass extraction
Interactive Visualization: Self-contained HTML for reviewing extractions in context
Flexible LLM Support: Works with Gemini, OpenAI, and local models
Few-Shot Learning: Requires only quality examples, no expensive fine-tuning

Key Resources:

GitHub: https://github.com/google/langextract
Examples: https://github.com/google/langextract/tree/main/examples
Documentation: https://github.com/google/langextract/tree/main/docs/examples

LangExtract 是谷歌推出的一款Python库，可借助大语言模型从非结构化文本中提取结构化信息。它的核心特性包括：

来源溯源：每一项提取结果都映射到原文中的精确位置
结构化输出：遵循schema的可控生成结果
长文档处理：智能分块与多轮提取
交互式可视化：独立的HTML文件，用于在上下文环境中查看提取结果
灵活的LLM支持：兼容Gemini、OpenAI及本地模型
少样本学习：仅需高质量示例，无需昂贵的微调

核心资源：

GitHub：https://github.com/google/langextract
示例：https://github.com/google/langextract/tree/main/examples
文档：https://github.com/google/langextract/tree/main/docs/examples

Installation

安装

Prerequisites

前置条件

Python 3.8 or higher
API key for Gemini (AI Studio), OpenAI, or local Ollama setup

Python 3.8或更高版本
Gemini（AI Studio）、OpenAI的API密钥，或本地Ollama环境

Basic Installation

基础安装

bash

undefined

bash

undefined

Install from PyPI (recommended)

从PyPI安装（推荐）

pip install langextract

Install with OpenAI support

安装带OpenAI支持的版本

pip install langextract[openai]

Install with development tools

安装带开发工具的版本

pip install langextract[dev]

undefined

pip install langextract[dev]

undefined

Install from Source

从源码安装

bash

git clone https://github.com/google/langextract.git
cd langextract
pip install -e .

bash

git clone https://github.com/google/langextract.git
cd langextract
pip install -e .

For development with testing

安装带测试依赖的开发版本

pip install -e ".[test]"

undefined

pip install -e ".[test]"

undefined

Docker Installation

Docker安装

bash

undefined

bash

undefined

Build Docker image

构建Docker镜像

docker build -t langextract .

Run with API key

使用API密钥运行

docker run --rm
-e LANGEXTRACT_API_KEY="your-api-key"
langextract python your_script.py

undefined

docker run --rm
-e LANGEXTRACT_API_KEY="your-api-key"
langextract python your_script.py

undefined

API Key Setup

API密钥配置

Gemini (Google AI Studio):

bash

export LANGEXTRACT_API_KEY="your-gemini-api-key"

Get keys from: https://ai.google.dev/

OpenAI:

bash

export OPENAI_API_KEY="your-openai-api-key"

Vertex AI (Enterprise):

bash

undefined

Gemini（Google AI Studio）：

bash

export LANGEXTRACT_API_KEY="your-gemini-api-key"

获取密钥地址：https://ai.google.dev/

OpenAI：

bash

export OPENAI_API_KEY="your-openai-api-key"

Vertex AI（企业版）：

bash

undefined

Use service account authentication

使用服务账号认证

Set project in language_model_params

在language_model_params中设置项目


**.env File (Development):**
```bash


**.env文件（开发环境）：**
```bash

Create .env file

创建.env文件

echo "LANGEXTRACT_API_KEY=your-key-here" > .env

undefined

echo "LANGEXTRACT_API_KEY=your-key-here" > .env

undefined

Quick Start

快速开始

Basic Extraction Example

基础提取示例

python

import langextract as lx
import textwrap

python

import langextract as lx
import textwrap

1. Define extraction task

1. 定义提取任务

prompt = textwrap.dedent("""
Extract all medications mentioned in the clinical note. Include medication name, dosage, and frequency. Use exact text from the document.""")

2. Provide examples (few-shot learning)

2. 提供示例（少样本学习）

examples = [ lx.data.ExampleData( text="Patient prescribed Lisinopril 10mg daily for hypertension.", extractions=[ lx.data.Extraction( extraction_class="medication", extraction_text="Lisinopril 10mg daily", attributes={ "name": "Lisinopril", "dosage": "10mg", "frequency": "daily", "indication": "hypertension" } ) ] ) ]

3. Input text to extract from

3. 待提取的输入文本

input_text = """ Patient continues on Metformin 500mg twice daily for diabetes management. Started on Amlodipine 5mg once daily for blood pressure control. Discontinued Aspirin 81mg due to side effects. """

4. Run extraction

4. 执行提取

result = lx.extract( text_or_documents=input_text, prompt_description=prompt, examples=examples, model_id="gemini-2.0-flash-exp" )

5. Access results

5. 访问结果

for extraction in result.extractions: print(f"Medication: {extraction.extraction_text}") print(f" Name: {extraction.attributes.get('name')}") print(f" Dosage: {extraction.attributes.get('dosage')}") print(f" Frequency: {extraction.attributes.get('frequency')}") print(f" Location: {extraction.start_char}-{extraction.end_char}") print()

6. Save and visualize

6. 保存与可视化

lx.io.save_annotated_documents( [result], output_name="medications.jsonl", output_dir="." )

html_content = lx.visualize("medications.jsonl") with open("medications.html", "w") as f: f.write(html_content)

undefined

lx.io.save_annotated_documents( [result], output_name="medications.jsonl", output_dir="." )

html_content = lx.visualize("medications.jsonl") with open("medications.html", "w") as f: f.write(html_content)

undefined

Literary Text Example

文学文本示例

python

import langextract as lx

prompt = """Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities."""

examples = [
    lx.data.ExampleData(
        text="ROMEO entered the garden, filled with wonder at JULIET's beauty.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="JULIET",
                attributes={}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="ROMEO ... JULIET's beauty",
                attributes={
                    "subject": "ROMEO",
                    "relation": "admires",
                    "object": "JULIET"
                }
            )
        ]
    )
]

text = """Act 2, Scene 2: The Capulet's orchard.
ROMEO appears beneath JULIET's balcony, gazing upward with longing.
JULIET steps onto the balcony, unaware of ROMEO's presence below."""

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.0-flash-exp"
)

python

import langextract as lx

prompt = """Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities."""

examples = [
    lx.data.ExampleData(
        text="ROMEO entered the garden, filled with wonder at JULIET's beauty.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="JULIET",
                attributes={}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="ROMEO ... JULIET's beauty",
                attributes={
                    "subject": "ROMEO",
                    "relation": "admires",
                    "object": "JULIET"
                }
            )
        ]
    )
]

text = """Act 2, Scene 2: The Capulet's orchard.
ROMEO appears beneath JULIET's balcony, gazing upward with longing.
JULIET steps onto the balcony, unaware of ROMEO's presence below."""

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.0-flash-exp"
)

Core Concepts

核心概念

1. Extraction Classes

1. 提取类别

Define categories of entities to extract:

python

undefined

定义要提取的实体类别：

python

undefined

Single class

单一类别

extraction_class="medication"

Multiple classes via examples

通过示例定义多类别

examples = [ lx.data.ExampleData( text="...", extractions=[ lx.data.Extraction(extraction_class="diagnosis", ...), lx.data.Extraction(extraction_class="symptom", ...), lx.data.Extraction(extraction_class="medication", ...) ] ) ]

undefined

undefined

2. Source Grounding

2. 来源溯源

Every extraction includes precise text location:

python

extraction = result.extractions[0]
print(f"Text: {extraction.extraction_text}")
print(f"Start: {extraction.start_char}")
print(f"End: {extraction.end_char}")

每一项提取结果都包含精确的文本位置：

python

extraction = result.extractions[0]
print(f"Text: {extraction.extraction_text}")
print(f"Start: {extraction.start_char}")
print(f"End: {extraction.end_char}")

Extract from original document

从原文中提取对应片段

original_text = input_text[extraction.start_char:extraction.end_char]

undefined

original_text = input_text[extraction.start_char:extraction.end_char]

undefined

3. Attributes

3. 属性

Add structured metadata to extractions:

python

lx.data.Extraction(
    extraction_class="medication",
    extraction_text="Lisinopril 10mg daily",
    attributes={
        "name": "Lisinopril",
        "dosage": "10mg",
        "frequency": "daily",
        "route": "oral",
        "indication": "hypertension"
    }
)

为提取结果添加结构化元数据：

python

lx.data.Extraction(
    extraction_class="medication",
    extraction_text="Lisinopril 10mg daily",
    attributes={
        "name": "Lisinopril",
        "dosage": "10mg",
        "frequency": "daily",
        "route": "oral",
        "indication": "hypertension"
    }
)

4. Few-Shot Learning

4. 少样本学习

Provide 1-5 quality examples instead of fine-tuning:

python

undefined

提供1-5个高质量示例，替代微调：

python

undefined

Minimal examples (1-2) for simple tasks

简单任务的最少示例（1-2个）

examples = [example1]

More examples (3-5) for complex schemas

复杂schema的更多示例（3-5个）

examples = [example1, example2, example3, example4, example5]

undefined

examples = [example1, example2, example3, example4, example5]

undefined

5. Long Document Processing

5. 长文档处理

Automatic chunking for documents beyond token limits:

python

result = lx.extract(
    text_or_documents=long_document,  # Any length
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.0-flash-exp",
    extraction_passes=3,  # Multiple passes for better recall
    max_workers=20,        # Parallel processing
    max_char_buffer=1000   # Chunk overlap for continuity
)

自动分块处理超出令牌限制的文档：

python

result = lx.extract(
    text_or_documents=long_document,  # 任意长度
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.0-flash-exp",
    extraction_passes=3,  # 多轮提取提升召回率
    max_workers=20,        # 并行处理
    max_char_buffer=1000   # 分块重叠保证连续性
)

Configuration

配置

Model Selection

模型选择

python

undefined

python

undefined

Gemini models (recommended)

Gemini模型（推荐）

model_id="gemini-2.0-flash-exp" # Fast, cost-effective model_id="gemini-2.0-flash-thinking-exp" # Complex reasoning model_id="gemini-1.5-pro" # Legacy

model_id="gemini-2.0-flash-exp" # 快速、经济 model_id="gemini-2.0-flash-thinking-exp" # 复杂推理 model_id="gemini-1.5-pro" # 旧版本

OpenAI models

OpenAI模型

model_id="gpt-4o" # GPT-4 Optimized model_id="gpt-4o-mini" # Smaller, faster

model_id="gpt-4o" # GPT-4优化版 model_id="gpt-4o-mini" # 轻量、快速

Local models via Ollama

通过Ollama使用本地模型

model_id="gemma2:2b" # Local inference model_url="http://localhost:11434"

undefined

model_id="gemma2:2b" # 本地推理 model_url="http://localhost:11434"

undefined

Scaling Parameters

扩展参数

python

result = lx.extract(
    text_or_documents=documents,
    prompt_description=prompt,
    examples=examples,

    # Multi-pass extraction for better recall
    extraction_passes=3,

    # Parallel processing
    max_workers=20,

    # Chunk size tuning
    max_char_buffer=1000,

    # Model configuration
    model_id="gemini-2.0-flash-exp"
)

python

result = lx.extract(
    text_or_documents=documents,
    prompt_description=prompt,
    examples=examples,

    # 多轮提取提升召回率
    extraction_passes=3,

    # 并行处理
    max_workers=20,

    # 分块大小调整
    max_char_buffer=1000,

    # 模型配置
    model_id="gemini-2.0-flash-exp"
)

Backend Configuration

后端配置

Vertex AI:

python

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.0-flash-exp",
    language_model_params={
        "vertexai": True,
        "project": "your-gcp-project-id",
        "location": "us-central1"
    }
)

Batch Processing:

python

language_model_params={
    "batch": {
        "enabled": True
    }
}

OpenAI Configuration:

python

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gpt-4o",
    fence_output=True,  # Required for OpenAI
    use_schema_constraints=False  # Disable Gemini-specific features
)

Local Ollama:

python

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemma2:2b",
    model_url="http://localhost:11434",
    use_schema_constraints=False
)

Vertex AI：

python

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.0-flash-exp",
    language_model_params={
        "vertexai": True,
        "project": "your-gcp-project-id",
        "location": "us-central1"
    }
)

批量处理：

python

language_model_params={
    "batch": {
        "enabled": True
    }
}

OpenAI配置：

python

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gpt-4o",
    fence_output=True,  # OpenAI必需
    use_schema_constraints=False  # 禁用Gemini专属特性
)

本地Ollama：

python

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemma2:2b",
    model_url="http://localhost:11434",
    use_schema_constraints=False
)

Environment Variables

环境变量

bash

undefined

bash

undefined

API Keys

API密钥

LANGEXTRACT_API_KEY="gemini-api-key" OPENAI_API_KEY="openai-api-key"

Vertex AI

GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

Model configuration

模型配置

LANGEXTRACT_MODEL_ID="gemini-2.0-flash-exp" LANGEXTRACT_MODEL_URL="http://localhost:11434"

undefined

LANGEXTRACT_MODEL_ID="gemini-2.0-flash-exp" LANGEXTRACT_MODEL_URL="http://localhost:11434"

undefined

Common Patterns

常见模式

Pattern 1: Clinical Note Extraction

模式1：临床记录提取

python

import langextract as lx

prompt = """Extract diagnoses, symptoms, and medications from clinical notes.
Include ICD-10 codes when available. Use exact medical terminology."""

examples = [
    lx.data.ExampleData(
        text="Patient presents with Type 2 Diabetes Mellitus (E11.9). Started on Metformin 500mg BID. Reports fatigue and increased thirst.",
        extractions=[
            lx.data.Extraction(
                extraction_class="diagnosis",
                extraction_text="Type 2 Diabetes Mellitus (E11.9)",
                attributes={"condition": "Type 2 Diabetes Mellitus", "icd10": "E11.9"}
            ),
            lx.data.Extraction(
                extraction_class="medication",
                extraction_text="Metformin 500mg BID",
                attributes={"name": "Metformin", "dosage": "500mg", "frequency": "BID"}
            ),
            lx.data.Extraction(
                extraction_class="symptom",
                extraction_text="fatigue",
                attributes={"symptom": "fatigue"}
            ),
            lx.data.Extraction(
                extraction_class="symptom",
                extraction_text="increased thirst",
                attributes={"symptom": "polydipsia"}
            )
        ]
    )
]

python

import langextract as lx

prompt = """Extract diagnoses, symptoms, and medications from clinical notes.
Include ICD-10 codes when available. Use exact medical terminology."""

examples = [
    lx.data.ExampleData(
        text="Patient presents with Type 2 Diabetes Mellitus (E11.9). Started on Metformin 500mg BID. Reports fatigue and increased thirst.",
        extractions=[
            lx.data.Extraction(
                extraction_class="diagnosis",
                extraction_text="Type 2 Diabetes Mellitus (E11.9)",
                attributes={"condition": "Type 2 Diabetes Mellitus", "icd10": "E11.9"}
            ),
            lx.data.Extraction(
                extraction_class="medication",
                extraction_text="Metformin 500mg BID",
                attributes={"name": "Metformin", "dosage": "500mg", "frequency": "BID"}
            ),
            lx.data.Extraction(
                extraction_class="symptom",
                extraction_text="fatigue",
                attributes={"symptom": "fatigue"}
            ),
            lx.data.Extraction(
                extraction_class="symptom",
                extraction_text="increased thirst",
                attributes={"symptom": "polydipsia"}
            )
        ]
    )
]

Process multiple clinical notes

处理多条临床记录

clinical_notes = [ "Note 1: Patient presents with...", "Note 2: Follow-up visit for...", "Note 3: New onset chest pain..." ]

results = lx.extract( text_or_documents=clinical_notes, prompt_description=prompt, examples=examples, model_id="gemini-2.0-flash-exp", extraction_passes=2, max_workers=10 )

clinical_notes = [ "Note 1: Patient presents with...", "Note 2: Follow-up visit for...", "Note 3: New onset chest pain..." ]

results = lx.extract( text_or_documents=clinical_notes, prompt_description=prompt, examples=examples, model_id="gemini-2.0-flash-exp", extraction_passes=2, max_workers=10 )

Save structured output

保存结构化输出

lx.io.save_annotated_documents( results, output_name="clinical_extractions.jsonl", output_dir="./output" )

undefined

lx.io.save_annotated_documents( results, output_name="clinical_extractions.jsonl", output_dir="./output" )

undefined

Pattern 2: Radiology Report Structuring

模式2：放射科报告结构化

python

prompt = """Extract findings, impressions, and recommendations from radiology reports.
Include anatomical location, abnormality type, and severity."""

examples = [
    lx.data.ExampleData(
        text="FINDINGS: 3.2cm mass in right upper lobe. IMPRESSION: Suspicious for malignancy. RECOMMENDATION: Biopsy recommended.",
        extractions=[
            lx.data.Extraction(
                extraction_class="finding",
                extraction_text="3.2cm mass in right upper lobe",
                attributes={
                    "location": "right upper lobe",
                    "type": "mass",
                    "size": "3.2cm"
                }
            ),
            lx.data.Extraction(
                extraction_class="impression",
                extraction_text="Suspicious for malignancy",
                attributes={"diagnosis": "possible malignancy", "certainty": "suspicious"}
            ),
            lx.data.Extraction(
                extraction_class="recommendation",
                extraction_text="Biopsy recommended",
                attributes={"action": "biopsy"}
            )
        ]
    )
]

python

prompt = """Extract findings, impressions, and recommendations from radiology reports.
Include anatomical location, abnormality type, and severity."""

examples = [
    lx.data.ExampleData(
        text="FINDINGS: 3.2cm mass in right upper lobe. IMPRESSION: Suspicious for malignancy. RECOMMENDATION: Biopsy recommended.",
        extractions=[
            lx.data.Extraction(
                extraction_class="finding",
                extraction_text="3.2cm mass in right upper lobe",
                attributes={
                    "location": "right upper lobe",
                    "type": "mass",
                    "size": "3.2cm"
                }
            ),
            lx.data.Extraction(
                extraction_class="impression",
                extraction_text="Suspicious for malignancy",
                attributes={"diagnosis": "possible malignancy", "certainty": "suspicious"}
            ),
            lx.data.Extraction(
                extraction_class="recommendation",
                extraction_text="Biopsy recommended",
                attributes={"action": "biopsy"}
            )
        ]
    )
]

Pattern 3: Multi-Document Processing

模式3：多文档处理

python

import langextract as lx
from pathlib import Path

python

import langextract as lx
from pathlib import Path

Load multiple documents

加载多个文档

documents = [] for file_path in Path("./documents").glob("*.txt"): with open(file_path, "r") as f: documents.append(f.read())

Extract from all documents

从所有文档中提取

results = lx.extract( text_or_documents=documents, prompt_description=prompt, examples=examples, model_id="gemini-2.0-flash-exp", extraction_passes=3, max_workers=20 )

Results is a list of AnnotatedDocument objects

Results是AnnotatedDocument对象的列表

for i, result in enumerate(results): print(f"\nDocument {i+1}: {len(result.extractions)} extractions") for extraction in result.extractions: print(f" - {extraction.extraction_class}: {extraction.extraction_text}")

undefined

undefined

Pattern 4: Interactive Visualization

模式4：交互式可视化

python

undefined

python

undefined

Generate interactive HTML

生成交互式HTML

html_content = lx.visualize("extractions.jsonl")

Save to file

保存到文件

with open("interactive_results.html", "w") as f: f.write(html_content)

Open in browser (optional)

在浏览器中打开（可选）

import webbrowser webbrowser.open("interactive_results.html")

undefined

import webbrowser webbrowser.open("interactive_results.html")

undefined

Pattern 5: Custom Provider Plugin

模式5：自定义提供方插件

python

undefined

python

undefined

See examples/custom_provider_plugin/ for full implementation

完整实现请查看examples/custom_provider_plugin/

from langextract.providers import ProviderPlugin

class CustomProvider(ProviderPlugin): def extract(self, text, prompt, examples, **kwargs): # Custom extraction logic return extractions

def supports_schema_constraints(self):
    return False

from langextract.providers import ProviderPlugin

class CustomProvider(ProviderPlugin): def extract(self, text, prompt, examples, **kwargs): # 自定义提取逻辑 return extractions

def supports_schema_constraints(self):
    return False

Register custom provider

注册自定义提供方

lx.register_provider("custom", CustomProvider())

Use custom provider

使用自定义提供方

result = lx.extract( text_or_documents=text, prompt_description=prompt, examples=examples, model_id="custom", provider="custom" )

undefined

result = lx.extract( text_or_documents=text, prompt_description=prompt, examples=examples, model_id="custom", provider="custom" )

undefined

API Reference

API参考

Core Functions

核心函数

lx.extract()

lx.extract()

Main extraction function.

python

result = lx.extract(
    text_or_documents,           # str or list of str
    prompt_description,          # str: extraction instructions
    examples,                    # list of ExampleData
    model_id="gemini-2.0-flash-exp",  # str: model identifier
    extraction_passes=1,         # int: number of passes
    max_workers=None,            # int: parallel workers
    max_char_buffer=1000,        # int: chunk overlap
    language_model_params=None,  # dict: model config
    fence_output=False,          # bool: required for OpenAI
    use_schema_constraints=True, # bool: use schema enforcement
    model_url=None,              # str: custom model endpoint
    api_key=None                 # str: API key (prefer env var)
)

Returns:

AnnotatedDocument

list[AnnotatedDocument]

主提取函数。

python

result = lx.extract(
    text_or_documents,           # str或str列表
    prompt_description,          # str: 提取指令
    examples,                    # ExampleData列表
    model_id="gemini-2.0-flash-exp",  # str: 模型标识符
    extraction_passes=1,         # int: 提取轮次
    max_workers=None,            # int: 并行工作进程数
    max_char_buffer=1000,        # int: 分块重叠字符数
    language_model_params=None,  # dict: 模型配置
    fence_output=False,          # bool: OpenAI必需
    use_schema_constraints=True, # bool: 使用schema约束
    model_url=None,              # str: 自定义模型端点
    api_key=None                 # str: API密钥（优先使用环境变量）
)

返回值：

AnnotatedDocument

或

list[AnnotatedDocument]

lx.visualize()

lx.visualize()

Generate interactive HTML visualization.

python

html_content = lx.visualize(
    jsonl_file_path,           # str: path to JSONL file
    title="Extraction Results", # str: HTML page title
    show_attributes=True       # bool: display attributes
)

Returns:

str

(HTML content)

生成交互式HTML可视化内容。

python

html_content = lx.visualize(
    jsonl_file_path,           # str: JSONL文件路径
    title="Extraction Results", # str: HTML页面标题
    show_attributes=True       # bool: 显示属性
)

返回值：

str

（HTML内容）

lx.io.save_annotated_documents()

lx.io.save_annotated_documents()

Save results to JSONL format.

python

lx.io.save_annotated_documents(
    annotated_documents,  # list of AnnotatedDocument
    output_name,          # str: filename (e.g., "results.jsonl")
    output_dir="."        # str: output directory
)

将结果保存为JSONL格式。

python

lx.io.save_annotated_documents(
    annotated_documents,  # AnnotatedDocument列表
    output_name,          # str: 文件名（如"results.jsonl"）
    output_dir="."        # str: 输出目录
)

Data Classes

数据类

ExampleData

ExampleData

Few-shot example definition.

python

example = lx.data.ExampleData(
    text="Example text here",
    extractions=[
        lx.data.Extraction(...)
    ]
)

少样本示例定义。

python

example = lx.data.ExampleData(
    text="Example text here",
    extractions=[
        lx.data.Extraction(...)
    ]
)

Extraction

Extraction

Single extraction definition.

python

extraction = lx.data.Extraction(
    extraction_class="medication",  # str: entity type
    extraction_text="Aspirin 81mg",  # str: exact text
    attributes={                     # dict: metadata
        "name": "Aspirin",
        "dosage": "81mg"
    },
    start_char=0,                    # int: start position (auto-set)
    end_char=13                      # int: end position (auto-set)
)

单一提取结果定义。

python

extraction = lx.data.Extraction(
    extraction_class="medication",  # str: 实体类型
    extraction_text="Aspirin 81mg",  # str: 原文文本
    attributes={                     # dict: 元数据
        "name": "Aspirin",
        "dosage": "81mg"
    },
    start_char=0,                    # int: 起始位置（自动设置）
    end_char=13                      # int: 结束位置（自动设置）
)

AnnotatedDocument

AnnotatedDocument

Extraction results for a document.

python

result.text                  # str: original text
result.extractions          # list of Extraction
result.metadata             # dict: additional info

单篇文档的提取结果。

python

result.text                  # str: 原文
result.extractions          # Extraction列表
result.metadata             # dict: 附加信息

Best Practices

最佳实践

Extraction Design

提取设计

Write Clear Prompts: Be specific about what to extract and how

python

# Good
prompt = "Extract medications with dosage, frequency, and route of administration. Use exact medical terminology."

# Avoid
prompt = "Extract medications."

Provide Quality Examples: 1-5 well-crafted examples beat many poor ones

python

# Include edge cases in examples
examples = [
    normal_case_example,
    edge_case_example,
    complex_case_example
]

Use Exact Text: Extract verbatim from source for accurate grounding

python

# Good
extraction_text="Lisinopril 10mg daily"

# Avoid paraphrasing
extraction_text="10mg lisinopril taken once per day"

Define Attributes Clearly: Structure metadata consistently

python

attributes={
    "name": "Lisinopril",        # Drug name
    "dosage": "10mg",             # Amount
    "frequency": "daily",         # How often
    "route": "oral"               # How taken
}

编写清晰的提示词：明确说明要提取的内容和方式

python

# 良好示例
prompt = "Extract medications with dosage, frequency, and route of administration. Use exact medical terminology."

# 避免
prompt = "Extract medications."

提供高质量示例：1-5个精心设计的示例优于大量劣质示例

python

# 在示例中包含边缘情况
examples = [
    normal_case_example,
    edge_case_example,
    complex_case_example
]

使用原文文本：提取原文内容以保证准确的来源溯源

python

# 良好示例
extraction_text="Lisinopril 10mg daily"

# 避免意译
extraction_text="10mg lisinopril taken once per day"

清晰定义属性：保持元数据结构一致

python

attributes={
    "name": "Lisinopril",        # 药物名称
    "dosage": "10mg",             # 剂量
    "frequency": "daily",         # 频次
    "route": "oral"               # 给药途径
}

Performance Optimization

性能优化

Multi-Pass for Long Documents: Improves recall

python

extraction_passes=3  # 2-3 passes recommended for thorough extraction

Parallel Processing: Speed up batch operations
python
```
max_workers=20  # Adjust based on API rate limits
```

Chunk Size Tuning: Balance accuracy and context

python

max_char_buffer=1000  # Larger for context, smaller for speed

Model Selection: Choose based on task complexity

python

# Simple extraction
model_id="gemini-2.0-flash-exp"

# Complex reasoning
model_id="gemini-2.0-flash-thinking-exp"

长文档使用多轮提取：提升召回率

python

extraction_passes=3  # 推荐2-3轮提取以保证全面性

并行处理：加速批量操作

python

max_workers=20  # 根据API速率限制调整

分块大小调整：平衡准确性与上下文

python

max_char_buffer=1000  # 更大的值保留更多上下文，更小的值提升速度

模型选择：根据任务复杂度选择

python

# 简单提取任务
model_id="gemini-2.0-flash-exp"

# 复杂推理任务
model_id="gemini-2.0-flash-thinking-exp"

Production Deployment

生产部署

API Key Security: Never hardcode keys

python

# Good: Use environment variables
import os
api_key = os.getenv("LANGEXTRACT_API_KEY")

# Avoid: Hardcoding
api_key = "AIza..."  # Never do this

Error Handling: Handle API failures gracefully

python

try:
    result = lx.extract(...)
except Exception as e:
    logger.error(f"Extraction failed: {e}")
    # Implement retry logic or fallback

Cost Management: Monitor API usage

python

# Use cheaper models for bulk processing
model_id="gemini-2.0-flash-exp"  # vs "gemini-1.5-pro"

# Batch processing for cost efficiency
language_model_params={"batch": {"enabled": True}}

Validation: Verify extraction quality

python

for extraction in result.extractions:
    # Validate extraction is within document bounds
    assert 0 <= extraction.start_char < len(result.text)
    assert extraction.end_char <= len(result.text)

    # Verify text matches
    extracted = result.text[extraction.start_char:extraction.end_char]
    assert extracted == extraction.extraction_text

API密钥安全：切勿硬编码密钥

python

# 良好实践：使用环境变量
import os
api_key = os.getenv("LANGEXTRACT_API_KEY")

# 避免：硬编码
api_key = "AIza..."  # 绝对不要这样做

错误处理：优雅处理API失败

python

try:
    result = lx.extract(...)
except Exception as e:
    logger.error(f"Extraction failed: {e}")
    # 实现重试逻辑或备选方案

成本管理：监控API使用情况

python

# 批量处理使用更经济的模型
model_id="gemini-2.0-flash-exp"  # 对比"gemini-1.5-pro"

# 批量处理提升成本效率
language_model_params={"batch": {"enabled": True}}

验证：验证提取质量

python

for extraction in result.extractions:
    # 验证提取位置在文档范围内
    assert 0 <= extraction.start_char < len(result.text)
    assert extraction.end_char <= len(result.text)

    # 验证文本匹配
    extracted = result.text[extraction.start_char:extraction.end_char]
    assert extracted == extraction.extraction_text

Common Pitfalls

常见陷阱

Overlapping Extractions
- Issue: Extractions overlap or duplicate
- Solution: Specify in prompt "Do not overlap entities"
Paraphrasing Instead of Exact Text
- Issue: Extracted text doesn't match original
- Solution: Prompt "Use exact text from document. Do not paraphrase."
Insufficient Examples
- Issue: Poor extraction quality
- Solution: Provide 3-5 diverse examples covering edge cases
Model Limitations
- Issue: Schema constraints not supported on all models
- Solution: Set
```
use_schema_constraints=False
```
  for OpenAI/Ollama

重叠提取
- 问题：提取结果重叠或重复
- 解决方案：在提示词中明确说明“不要重叠实体”
意译而非使用原文
- 问题：提取文本与原文不匹配
- 解决方案：在提示词中要求“使用文档中的原文，不要意译”
示例不足
- 问题：提取质量差
- 解决方案：提供3-5个涵盖边缘情况的多样化示例
模型限制
- 问题：部分模型不支持schema约束
- 解决方案：对OpenAI/Ollama设置
```
use_schema_constraints=False
```

Troubleshooting

故障排除

Common Issues

常见问题

Issue 1: API Authentication Failed

问题1：API认证失败

Symptoms:

```
AuthenticationError: Invalid API key
```
```
Permission denied
```
errors

Solution:

bash

undefined

症状：

```
AuthenticationError: Invalid API key
```
```
Permission denied
```
错误

解决方案：

bash

undefined

Verify API key is set

验证API密钥是否已设置

echo $LANGEXTRACT_API_KEY

Set API key

设置API密钥

export LANGEXTRACT_API_KEY="your-key-here"

For OpenAI

针对OpenAI

export OPENAI_API_KEY="your-openai-key"

Verify key works

验证密钥是否生效

python -c "import os; print(os.getenv('LANGEXTRACT_API_KEY'))"

undefined

python -c "import os; print(os.getenv('LANGEXTRACT_API_KEY'))"

undefined

Issue 2: Schema Constraints Error

问题2：Schema约束错误

Symptoms:

```
Schema constraints not supported
```
error
Malformed output with OpenAI or Ollama

Solution:

python

undefined

症状：

```
Schema constraints not supported
```
错误
使用OpenAI或Ollama时输出格式错误

解决方案：

python

undefined

Disable schema constraints for non-Gemini models

对非Gemini模型禁用schema约束

result = lx.extract( text_or_documents=text, prompt_description=prompt, examples=examples, model_id="gpt-4o", use_schema_constraints=False, # Disable for OpenAI fence_output=True # Enable for OpenAI )

undefined

result = lx.extract( text_or_documents=text, prompt_description=prompt, examples=examples, model_id="gpt-4o", use_schema_constraints=False, # 对OpenAI禁用 fence_output=True # 对OpenAI启用 )

undefined

Issue 3: Token Limit Exceeded

问题3：令牌限制超出

Symptoms:

```
Token limit exceeded
```
error
Truncated results

Solution:

python

undefined

症状：

```
Token limit exceeded
```
错误
结果被截断

解决方案：

python

undefined

Use multi-pass extraction

使用多轮提取

result = lx.extract( text_or_documents=long_text, prompt_description=prompt, examples=examples, extraction_passes=3, # Multiple passes max_char_buffer=1000, # Adjust chunk size max_workers=10 # Parallel processing )

undefined

result = lx.extract( text_or_documents=long_text, prompt_description=prompt, examples=examples, extraction_passes=3, # 多轮提取 max_char_buffer=1000, # 调整分块大小 max_workers=10 # 并行处理 )

undefined

Issue 4: Poor Extraction Quality

问题4：提取质量差

Symptoms:

Missing entities
Incorrect extractions
Paraphrased text

Solution:

python

undefined

症状：

遗漏实体
提取结果错误
文本被意译

解决方案：

python

undefined

Improve prompt specificity

提升提示词的具体性

prompt = """Extract medications with exact dosage and frequency. Use exact text from document. Do not paraphrase. Include generic and brand names. Extract discontinued medications as well."""

Add more diverse examples

添加更多多样化示例

examples = [ normal_case, edge_case_1, edge_case_2, complex_case ]

Increase extraction passes

增加提取轮次

extraction_passes=3

Try more capable model

尝试更强大的模型

model_id="gemini-2.0-flash-thinking-exp"

undefined

model_id="gemini-2.0-flash-thinking-exp"

undefined

Issue 5: Ollama Connection Failed

问题5：Ollama连接失败

Symptoms:

```
Connection refused
```
to localhost:11434
Ollama model not found

Solution:

bash

undefined

症状：

连接localhost:11434被拒绝
Ollama模型未找到

解决方案：

bash

undefined

Start Ollama server

启动Ollama服务器

ollama serve

Pull required model

拉取所需模型

ollama pull gemma2:2b

Verify Ollama is running

验证Ollama是否运行

curl http://localhost:11434/api/tags

Use in langextract

在langextract中使用

python -c " import langextract as lx result = lx.extract( text_or_documents='test', prompt_description='Extract entities', examples=[], model_id='gemma2:2b', model_url='http://localhost:11434', use_schema_constraints=False ) "

undefined

undefined

Debugging Tips

调试技巧

Enable Verbose Logging

python

import logging
logging.basicConfig(level=logging.DEBUG)

Inspect Intermediate Results

python

# Save each pass separately
for i, result in enumerate(results):
    lx.io.save_annotated_documents(
        [result],
        output_name=f"pass_{i}.jsonl",
        output_dir="./debug"
    )

Validate Examples

python

# Check examples match expected format
for example in examples:
    for extraction in example.extractions:
        # Verify text is in example text
        assert extraction.extraction_text in example.text
        print(f"✓ {extraction.extraction_class}: {extraction.extraction_text}")

Test with Simple Input First

python

# Start with minimal test
test_result = lx.extract(
    text_or_documents="Patient on Aspirin 81mg daily.",
    prompt_description="Extract medications.",
    examples=[simple_example],
    model_id="gemini-2.0-flash-exp"
)
print(f"Extractions: {len(test_result.extractions)}")

启用详细日志

python

import logging
logging.basicConfig(level=logging.DEBUG)

检查中间结果

python

# 单独保存每一轮的结果
for i, result in enumerate(results):
    lx.io.save_annotated_documents(
        [result],
        output_name=f"pass_{i}.jsonl",
        output_dir="./debug"
    )

验证示例

python

# 检查示例是否符合预期格式
for example in examples:
    for extraction in example.extractions:
        # 验证文本存在于示例文本中
        assert extraction.extraction_text in example.text
        print(f"✓ {extraction.extraction_class}: {extraction.extraction_text}")

先使用简单输入测试

python

# 从最小测试开始
test_result = lx.extract(
    text_or_documents="Patient on Aspirin 81mg daily.",
    prompt_description="Extract medications.",
    examples=[simple_example],
    model_id="gemini-2.0-flash-exp"
)
print(f"Extractions: {len(test_result.extractions)}")

Advanced Topics

高级主题

Custom Extraction Schemas

自定义提取Schema

Define complex nested structures:

python

examples = [
    lx.data.ExampleData(
        text="Patient presents with chest pain. ECG shows ST elevation. Diagnosed with STEMI.",
        extractions=[
            lx.data.Extraction(
                extraction_class="clinical_event",
                extraction_text="Patient presents with chest pain. ECG shows ST elevation. Diagnosed with STEMI.",
                attributes={
                    "symptom": "chest pain",
                    "diagnostic_test": "ECG",
                    "finding": "ST elevation",
                    "diagnosis": "STEMI",
                    "severity": "severe",
                    "timeline": [
                        {"event": "symptom_onset", "description": "chest pain"},
                        {"event": "diagnostic", "description": "ECG shows ST elevation"},
                        {"event": "diagnosis", "description": "STEMI"}
                    ]
                }
            )
        ]
    )
]

定义复杂的嵌套结构：

python

examples = [
    lx.data.ExampleData(
        text="Patient presents with chest pain. ECG shows ST elevation. Diagnosed with STEMI.",
        extractions=[
            lx.data.Extraction(
                extraction_class="clinical_event",
                extraction_text="Patient presents with chest pain. ECG shows ST elevation. Diagnosed with STEMI.",
                attributes={
                    "symptom": "chest pain",
                    "diagnostic_test": "ECG",
                    "finding": "ST elevation",
                    "diagnosis": "STEMI",
                    "severity": "severe",
                    "timeline": [
                        {"event": "symptom_onset", "description": "chest pain"},
                        {"event": "diagnostic", "description": "ECG shows ST elevation"},
                        {"event": "diagnosis", "description": "STEMI"}
                    ]
                }
            )
        ]
    )
]

Batch Processing with Progress Tracking

带进度跟踪的批量处理

python

from tqdm import tqdm
import langextract as lx

documents = load_documents()  # List of documents
results = []

for i, doc in enumerate(tqdm(documents)):
    try:
        result = lx.extract(
            text_or_documents=doc,
            prompt_description=prompt,
            examples=examples,
            model_id="gemini-2.0-flash-exp"
        )
        results.append(result)

        # Save incrementally
        if (i + 1) % 100 == 0:
            lx.io.save_annotated_documents(
                results,
                output_name=f"batch_{i+1}.jsonl",
                output_dir="./batches"
            )
            results = []  # Clear for next batch
    except Exception as e:
        print(f"Failed on document {i}: {e}")
        continue

python

from tqdm import tqdm
import langextract as lx

documents = load_documents()  # 文档列表
results = []

for i, doc in enumerate(tqdm(documents)):
    try:
        result = lx.extract(
            text_or_documents=doc,
            prompt_description=prompt,
            examples=examples,
            model_id="gemini-2.0-flash-exp"
        )
        results.append(result)

        # 增量保存
        if (i + 1) % 100 == 0:
            lx.io.save_annotated_documents(
                results,
                output_name=f"batch_{i+1}.jsonl",
                output_dir="./batches"
            )
            results = []  # 清空以准备下一批
    except Exception as e:
        print(f"处理第{i}篇文档失败: {e}")
        continue

Integration with Data Pipelines

与数据流水线集成

python

import langextract as lx
import pandas as pd

python

import langextract as lx
import pandas as pd

Load data

加载数据

df = pd.read_csv("clinical_notes.csv")

Extract from each note

从每篇记录中提取

extractions_data = []

for idx, row in df.iterrows(): result = lx.extract( text_or_documents=row['note_text'], prompt_description=prompt, examples=examples, model_id="gemini-2.0-flash-exp" )

for extraction in result.extractions:
    extractions_data.append({
        'patient_id': row['patient_id'],
        'note_date': row['note_date'],
        'extraction_class': extraction.extraction_class,
        'extraction_text': extraction.extraction_text,
        **extraction.attributes
    })

extractions_data = []

for idx, row in df.iterrows(): result = lx.extract( text_or_documents=row['note_text'], prompt_description=prompt, examples=examples, model_id="gemini-2.0-flash-exp" )

for extraction in result.extractions:
    extractions_data.append({
        'patient_id': row['patient_id'],
        'note_date': row['note_date'],
        'extraction_class': extraction.extraction_class,
        'extraction_text': extraction.extraction_text,
        **extraction.attributes
    })

Create structured DataFrame

创建结构化DataFrame

extractions_df = pd.DataFrame(extractions_data) extractions_df.to_csv("structured_extractions.csv", index=False)

undefined

extractions_df = pd.DataFrame(extractions_data) extractions_df.to_csv("structured_extractions.csv", index=False)

undefined

Performance Benchmarking

性能基准测试

python

import time
import langextract as lx

def benchmark_extraction(documents, model_id, passes=1):
    start = time.time()

    results = lx.extract(
        text_or_documents=documents,
        prompt_description=prompt,
        examples=examples,
        model_id=model_id,
        extraction_passes=passes,
        max_workers=20
    )

    elapsed = time.time() - start
    total_extractions = sum(len(r.extractions) for r in results)

    print(f"Model: {model_id}")
    print(f"Passes: {passes}")
    print(f"Documents: {len(documents)}")
    print(f"Total extractions: {total_extractions}")
    print(f"Time: {elapsed:.2f}s")
    print(f"Throughput: {len(documents)/elapsed:.2f} docs/sec")
    print()

python

import time
import langextract as lx

def benchmark_extraction(documents, model_id, passes=1):
    start = time.time()

    results = lx.extract(
        text_or_documents=documents,
        prompt_description=prompt,
        examples=examples,
        model_id=model_id,
        extraction_passes=passes,
        max_workers=20
    )

    elapsed = time.time() - start
    total_extractions = sum(len(r.extractions) for r in results)

    print(f"模型: {model_id}")
    print(f"提取轮次: {passes}")
    print(f"文档数量: {len(documents)}")
    print(f"总提取数: {total_extractions}")
    print(f"耗时: {elapsed:.2f}秒")
    print(f"吞吐量: {len(documents)/elapsed:.2f} 文档/秒")
    print()

Compare models

对比不同模型

benchmark_extraction(docs, "gemini-2.0-flash-exp", passes=1) benchmark_extraction(docs, "gemini-2.0-flash-exp", passes=3) benchmark_extraction(docs, "gpt-4o", passes=1)

undefined

benchmark_extraction(docs, "gemini-2.0-flash-exp", passes=1) benchmark_extraction(docs, "gemini-2.0-flash-exp", passes=3) benchmark_extraction(docs, "gpt-4o", passes=1)

undefined

Examples

示例

Example Projects

示例项目

The repository includes several example implementations:

Custom Provider Plugin (
```
examples/custom_provider_plugin/
```
)
- How to create custom extraction backends
- Integration with proprietary models
Jupyter Notebooks (
```
examples/notebooks/
```
)
- Interactive extraction workflows
- Visualization and analysis
Ollama Integration (
```
examples/ollama/
```
)
- Local model usage
- Privacy-preserving extraction

仓库中包含多个示例实现：

自定义提供方插件（
```
examples/custom_provider_plugin/
```
）
- 如何创建自定义提取后端
- 与专有模型集成
Jupyter笔记本（
```
examples/notebooks/
```
）
- 交互式提取工作流
- 可视化与分析
Ollama集成（
```
examples/ollama/
```
）
- 本地模型使用
- 隐私保护的提取

Medical Use Case

医疗用例

See

examples/clinical_extraction.py

for a complete medical extraction pipeline.

查看

examples/clinical_extraction.py

获取完整的医疗提取流水线。

Literary Analysis

文学分析

See

examples/literary_extraction.py

for character and relationship extraction from novels.

查看

examples/literary_extraction.py

获取从小说中提取角色与关系的示例。

Testing

测试

Running Tests

运行测试

bash

undefined

bash

undefined

Install test dependencies

安装测试依赖

pip install -e ".[test]"

Run all tests

运行所有测试

pytest tests

Run with coverage

运行测试并查看覆盖率

pytest tests --cov=langextract

Run specific test

运行特定测试

pytest tests/test_extraction.py

Run integration tests

运行集成测试

pytest tests/integration/

undefined

pytest tests/integration/

undefined

Integration Testing with Ollama

与Ollama的集成测试

bash

undefined

bash

undefined

Install tox

安装tox

pip install tox

Run Ollama integration tests

运行Ollama集成测试

tox -e ollama-integration

undefined

tox -e ollama-integration

undefined

Writing Tests

编写测试

python

import langextract as lx

def test_basic_extraction():
    prompt = "Extract names."
    examples = [
        lx.data.ExampleData(
            text="John Smith visited the clinic.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="name",
                    extraction_text="John Smith"
                )
            ]
        )
    ]

    result = lx.extract(
        text_or_documents="Mary Johnson was the doctor.",
        prompt_description=prompt,
        examples=examples,
        model_id="gemini-2.0-flash-exp"
    )

    assert len(result.extractions) >= 1
    assert result.extractions[0].extraction_class == "name"

python

import langextract as lx

def test_basic_extraction():
    prompt = "Extract names."
    examples = [
        lx.data.ExampleData(
            text="John Smith visited the clinic.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="name",
                    extraction_text="John Smith"
                )
            ]
        )
    ]

    result = lx.extract(
        text_or_documents="Mary Johnson was the doctor.",
        prompt_description=prompt,
        examples=examples,
        model_id="gemini-2.0-flash-exp"
    )

    assert len(result.extractions) >= 1
    assert result.extractions[0].extraction_class == "name"

Resources

资源

Official Documentation

官方文档

GitHub Repository: https://github.com/google/langextract
Examples Directory: https://github.com/google/langextract/tree/main/examples
Documentation: https://github.com/google/langextract/tree/main/docs/examples

GitHub仓库：https://github.com/google/langextract
示例目录：https://github.com/google/langextract/tree/main/examples
文档：https://github.com/google/langextract/tree/main/docs/examples

Model Documentation

模型文档

Gemini API: https://ai.google.dev/
Vertex AI: https://cloud.google.com/vertex-ai
OpenAI API: https://platform.openai.com/
Ollama: https://ollama.ai/

Gemini API：https://ai.google.dev/
Vertex AI：https://cloud.google.com/vertex-ai
OpenAI API：https://platform.openai.com/
Ollama：https://ollama.ai/

Related Tools

Use Case Examples

用例示例

Clinical information extraction
Legal document analysis
Scientific literature mining
Customer feedback structuring
Contract entity extraction

临床信息提取
法律文档分析
科学文献挖掘
客户反馈结构化
合同实体提取

Contributing

贡献

Contributions welcome! See the official repository for guidelines: https://github.com/google/langextract

欢迎贡献代码！请查看官方仓库的贡献指南： https://github.com/google/langextract

Development Setup

开发环境设置

bash

git clone https://github.com/google/langextract.git
cd langextract
pip install -e ".[dev]"
pre-commit install

bash

git clone https://github.com/google/langextract.git
cd langextract
pip install -e ".[dev]"
pre-commit install

Running CI Locally

本地运行CI

bash

undefined

bash

undefined

Full test matrix

完整测试矩阵

tox

Specific Python version

特定Python版本

tox -e py310

Code formatting

代码格式化

black langextract/ isort langextract/

Linting

代码检查

flake8 langextract/ mypy langextract/

undefined

flake8 langextract/ mypy langextract/

undefined

Version Information

版本信息

Last Updated: 2025-12-25 Skill Version: 1.0.0 LangExtract Version: Latest (check PyPI)

This skill provides comprehensive guidance for LangExtract based on official documentation and examples. For the latest updates, refer to the GitHub repository.

最后更新：2025-12-25 Skill版本：1.0.0 LangExtract版本：最新版（请查看PyPI）

本Skill基于官方文档与示例提供LangExtract的全面使用指南。如需最新更新，请参考GitHub仓库。