ai-parsing-data

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Build an AI Data Parser

构建AI数据解析器

Guide the user through building AI that pulls structured data out of messy text. Uses DSPy extraction — define what you want, and the AI extracts it.

引导用户构建可从杂乱文本中提取结构化数据的AI。基于DSPy extraction实现——只需定义你需要提取的内容，AI就会完成提取工作。

Step 1: Define what to extract

步骤1：定义提取内容

Ask the user:

What are you parsing? (emails, invoices, resumes, articles, forms, etc.)
What fields do you need? (names, dates, amounts, entities, etc.)
What's the output format? (flat fields, list of objects, nested structure)

询问用户：

你要解析什么内容？（邮件、发票、简历、文章、表单等）
你需要提取哪些字段？（姓名、日期、金额、实体等）
输出格式是什么？（扁平字段、对象列表、嵌套结构）

Step 2: Build the parser

步骤2：构建解析器

Simple field extraction

简单字段提取

python

import dspy

class ParseContact(dspy.Signature):
    """Extract contact information from the text."""
    text: str = dspy.InputField(desc="Text containing contact information")
    name: str = dspy.OutputField(desc="Person's full name")
    email: str = dspy.OutputField(desc="Email address")
    phone: str = dspy.OutputField(desc="Phone number")

parser = dspy.ChainOfThought(ParseContact)

python

import dspy

class ParseContact(dspy.Signature):
    """Extract contact information from the text."""
    text: str = dspy.InputField(desc="Text containing contact information")
    name: str = dspy.OutputField(desc="Person's full name")
    email: str = dspy.OutputField(desc="Email address")
    phone: str = dspy.OutputField(desc="Phone number")

parser = dspy.ChainOfThought(ParseContact)

Structured output with Pydantic

基于Pydantic的结构化输出

For complex or nested output, use Pydantic models:

python

from pydantic import BaseModel, Field

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str

class Person(BaseModel):
    name: str
    age: int
    address: Address
    skills: list[str]

class ParsePerson(dspy.Signature):
    """Extract person details from the text."""
    text: str = dspy.InputField()
    person: Person = dspy.OutputField()

parser = dspy.ChainOfThought(ParsePerson)
result = parser(text="John Doe, 32, lives at 123 Main St, Springfield IL 62701. Expert in Python and SQL.")
print(result.person)  # Person(name='John Doe', age=32, ...)

对于复杂或嵌套的输出，使用Pydantic模型：

python

from pydantic import BaseModel, Field

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str

class Person(BaseModel):
    name: str
    age: int
    address: Address
    skills: list[str]

class ParsePerson(dspy.Signature):
    """Extract person details from the text."""
    text: str = dspy.InputField()
    person: Person = dspy.OutputField()

parser = dspy.ChainOfThought(ParsePerson)
result = parser(text="John Doe, 32, lives at 123 Main St, Springfield IL 62701. Expert in Python and SQL.")
print(result.person)  # Person(name='John Doe', age=32, ...)

List extraction

列表提取

Extract multiple items from text:

python

class Entity(BaseModel):
    name: str
    type: str = Field(description="Type: person, organization, location, or date")

class ParseEntities(dspy.Signature):
    """Extract all named entities from the text."""
    text: str = dspy.InputField()
    entities: list[Entity] = dspy.OutputField(desc="All entities found in the text")

parser = dspy.ChainOfThought(ParseEntities)

从文本中提取多个条目：

python

class Entity(BaseModel):
    name: str
    type: str = Field(description="Type: person, organization, location, or date")

class ParseEntities(dspy.Signature):
    """Extract all named entities from the text."""
    text: str = dspy.InputField()
    entities: list[Entity] = dspy.OutputField(desc="All entities found in the text")

parser = dspy.ChainOfThought(ParseEntities)

Step 3: Handle messy data

步骤3：处理杂乱数据

Use assertions to catch bad extractions:

python

class ValidatedParser(dspy.Module):
    def __init__(self):
        self.parse = dspy.ChainOfThought(ParseContact)

    def forward(self, text):
        result = self.parse(text=text)
        dspy.Suggest(
            "@" in result.email,
            "Email should contain @"
        )
        dspy.Suggest(
            len(result.phone) >= 10,
            "Phone number should have at least 10 digits"
        )
        return result

使用断言捕获错误的提取结果：

python

class ValidatedParser(dspy.Module):
    def __init__(self):
        self.parse = dspy.ChainOfThought(ParseContact)

    def forward(self, text):
        result = self.parse(text=text)
        dspy.Suggest(
            "@" in result.email,
            "Email should contain @"
        )
        dspy.Suggest(
            len(result.phone) >= 10,
            "Phone number should have at least 10 digits"
        )
        return result

Step 4: Test the quality

步骤4：测试解析质量

python

def parsing_metric(example, prediction, trace=None):
    """Score based on field-level accuracy."""
    correct = 0
    total = 0
    for field in ["name", "email", "phone"]:
        expected = getattr(example, field, None)
        predicted = getattr(prediction, field, None)
        if expected is not None:
            total += 1
            if predicted and expected.lower().strip() == predicted.lower().strip():
                correct += 1
    return correct / total if total > 0 else 0.0

For Pydantic outputs, compare the model objects directly or field-by-field.

python

def parsing_metric(example, prediction, trace=None):
    """Score based on field-level accuracy."""
    correct = 0
    total = 0
    for field in ["name", "email", "phone"]:
        expected = getattr(example, field, None)
        predicted = getattr(prediction, field, None)
        if expected is not None:
            total += 1
            if predicted and expected.lower().strip() == predicted.lower().strip():
                correct += 1
    return correct / total if total > 0 else 0.0

对于Pydantic输出，可以直接比较模型对象或逐字段对比。

Step 5: Improve accuracy

步骤5：提升解析准确率

python

optimizer = dspy.BootstrapFewShot(metric=parsing_metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(parser, trainset=trainset)

python

optimizer = dspy.BootstrapFewShot(metric=parsing_metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(parser, trainset=trainset)

Key patterns

核心模式

Use Pydantic models for complex structured output — DSPy handles serialization
Use
list[Model]
to extract variable-length lists of items
ChainOfThought
helps — reasoning through which text maps to which fields improves accuracy
Validate with assertions —
```
dspy.Suggest
```
and
```
dspy.Assert
```
catch malformed extractions
Partial credit metrics — score field-by-field rather than all-or-nothing

使用Pydantic模型处理复杂结构化输出——DSPy会自动处理序列化
**使用
```
list[Model]
```
**提取可变长度的条目列表
ChainOfThought
很有帮助——通过推理文本与字段的映射关系来提升准确率
使用断言进行验证——
```
dspy.Suggest
```
和
```
dspy.Assert
```
可以捕获格式错误的提取结果
部分评分指标——按字段逐一评分，而非采用全对或全错的方式

Additional resources

额外资源

For worked examples (invoices, resumes, entities), see examples.md
Need summaries instead of structured data? Use
```
/ai-summarizing
```
AI missing items on complex inputs? Use
```
/ai-decomposing-tasks
```
to break extraction into reliable subtasks
Next:
```
/ai-improving-accuracy
```
to measure and improve your parser

如需查看完整示例（发票、简历、实体提取），请参考examples.md
如需生成摘要而非结构化数据？请使用
```
/ai-summarizing
```
AI在处理复杂输入时遗漏内容？请使用
```
/ai-decomposing-tasks
```
将提取任务拆分为可靠的子任务
下一步：使用
```
/ai-improving-accuracy
```
来衡量并优化你的解析器