ai-parsing-data

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Build an AI Data Parser

构建AI数据解析器

Guide the user through building AI that pulls structured data out of messy text. Uses DSPy extraction — define what you want, and the AI extracts it.
引导用户构建可从杂乱文本中提取结构化数据的AI。基于DSPy extraction实现——只需定义你需要提取的内容,AI就会完成提取工作。

Step 1: Define what to extract

步骤1:定义提取内容

Ask the user:
  1. What are you parsing? (emails, invoices, resumes, articles, forms, etc.)
  2. What fields do you need? (names, dates, amounts, entities, etc.)
  3. What's the output format? (flat fields, list of objects, nested structure)
询问用户:
  1. 你要解析什么内容?(邮件、发票、简历、文章、表单等)
  2. 你需要提取哪些字段?(姓名、日期、金额、实体等)
  3. 输出格式是什么?(扁平字段、对象列表、嵌套结构)

Step 2: Build the parser

步骤2:构建解析器

Simple field extraction

简单字段提取

python
import dspy

class ParseContact(dspy.Signature):
    """Extract contact information from the text."""
    text: str = dspy.InputField(desc="Text containing contact information")
    name: str = dspy.OutputField(desc="Person's full name")
    email: str = dspy.OutputField(desc="Email address")
    phone: str = dspy.OutputField(desc="Phone number")

parser = dspy.ChainOfThought(ParseContact)
python
import dspy

class ParseContact(dspy.Signature):
    """Extract contact information from the text."""
    text: str = dspy.InputField(desc="Text containing contact information")
    name: str = dspy.OutputField(desc="Person's full name")
    email: str = dspy.OutputField(desc="Email address")
    phone: str = dspy.OutputField(desc="Phone number")

parser = dspy.ChainOfThought(ParseContact)

Structured output with Pydantic

基于Pydantic的结构化输出

For complex or nested output, use Pydantic models:
python
from pydantic import BaseModel, Field

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str

class Person(BaseModel):
    name: str
    age: int
    address: Address
    skills: list[str]

class ParsePerson(dspy.Signature):
    """Extract person details from the text."""
    text: str = dspy.InputField()
    person: Person = dspy.OutputField()

parser = dspy.ChainOfThought(ParsePerson)
result = parser(text="John Doe, 32, lives at 123 Main St, Springfield IL 62701. Expert in Python and SQL.")
print(result.person)  # Person(name='John Doe', age=32, ...)
对于复杂或嵌套的输出,使用Pydantic模型:
python
from pydantic import BaseModel, Field

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str

class Person(BaseModel):
    name: str
    age: int
    address: Address
    skills: list[str]

class ParsePerson(dspy.Signature):
    """Extract person details from the text."""
    text: str = dspy.InputField()
    person: Person = dspy.OutputField()

parser = dspy.ChainOfThought(ParsePerson)
result = parser(text="John Doe, 32, lives at 123 Main St, Springfield IL 62701. Expert in Python and SQL.")
print(result.person)  # Person(name='John Doe', age=32, ...)

List extraction

列表提取

Extract multiple items from text:
python
class Entity(BaseModel):
    name: str
    type: str = Field(description="Type: person, organization, location, or date")

class ParseEntities(dspy.Signature):
    """Extract all named entities from the text."""
    text: str = dspy.InputField()
    entities: list[Entity] = dspy.OutputField(desc="All entities found in the text")

parser = dspy.ChainOfThought(ParseEntities)
从文本中提取多个条目:
python
class Entity(BaseModel):
    name: str
    type: str = Field(description="Type: person, organization, location, or date")

class ParseEntities(dspy.Signature):
    """Extract all named entities from the text."""
    text: str = dspy.InputField()
    entities: list[Entity] = dspy.OutputField(desc="All entities found in the text")

parser = dspy.ChainOfThought(ParseEntities)

Step 3: Handle messy data

步骤3:处理杂乱数据

Use assertions to catch bad extractions:
python
class ValidatedParser(dspy.Module):
    def __init__(self):
        self.parse = dspy.ChainOfThought(ParseContact)

    def forward(self, text):
        result = self.parse(text=text)
        dspy.Suggest(
            "@" in result.email,
            "Email should contain @"
        )
        dspy.Suggest(
            len(result.phone) >= 10,
            "Phone number should have at least 10 digits"
        )
        return result
使用断言捕获错误的提取结果:
python
class ValidatedParser(dspy.Module):
    def __init__(self):
        self.parse = dspy.ChainOfThought(ParseContact)

    def forward(self, text):
        result = self.parse(text=text)
        dspy.Suggest(
            "@" in result.email,
            "Email should contain @"
        )
        dspy.Suggest(
            len(result.phone) >= 10,
            "Phone number should have at least 10 digits"
        )
        return result

Step 4: Test the quality

步骤4:测试解析质量

python
def parsing_metric(example, prediction, trace=None):
    """Score based on field-level accuracy."""
    correct = 0
    total = 0
    for field in ["name", "email", "phone"]:
        expected = getattr(example, field, None)
        predicted = getattr(prediction, field, None)
        if expected is not None:
            total += 1
            if predicted and expected.lower().strip() == predicted.lower().strip():
                correct += 1
    return correct / total if total > 0 else 0.0
For Pydantic outputs, compare the model objects directly or field-by-field.
python
def parsing_metric(example, prediction, trace=None):
    """Score based on field-level accuracy."""
    correct = 0
    total = 0
    for field in ["name", "email", "phone"]:
        expected = getattr(example, field, None)
        predicted = getattr(prediction, field, None)
        if expected is not None:
            total += 1
            if predicted and expected.lower().strip() == predicted.lower().strip():
                correct += 1
    return correct / total if total > 0 else 0.0
对于Pydantic输出,可以直接比较模型对象或逐字段对比。

Step 5: Improve accuracy

步骤5:提升解析准确率

python
optimizer = dspy.BootstrapFewShot(metric=parsing_metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(parser, trainset=trainset)
python
optimizer = dspy.BootstrapFewShot(metric=parsing_metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(parser, trainset=trainset)

Key patterns

核心模式

  • Use Pydantic models for complex structured output — DSPy handles serialization
  • Use
    list[Model]
    to extract variable-length lists of items
  • ChainOfThought
    helps
    — reasoning through which text maps to which fields improves accuracy
  • Validate with assertions
    dspy.Suggest
    and
    dspy.Assert
    catch malformed extractions
  • Partial credit metrics — score field-by-field rather than all-or-nothing
  • 使用Pydantic模型处理复杂结构化输出——DSPy会自动处理序列化
  • **使用
    list[Model]
    **提取可变长度的条目列表
  • ChainOfThought
    很有帮助
    ——通过推理文本与字段的映射关系来提升准确率
  • 使用断言进行验证——
    dspy.Suggest
    dspy.Assert
    可以捕获格式错误的提取结果
  • 部分评分指标——按字段逐一评分,而非采用全对或全错的方式

Additional resources

额外资源

  • For worked examples (invoices, resumes, entities), see examples.md
  • Need summaries instead of structured data? Use
    /ai-summarizing
  • AI missing items on complex inputs? Use
    /ai-decomposing-tasks
    to break extraction into reliable subtasks
  • Next:
    /ai-improving-accuracy
    to measure and improve your parser
  • 如需查看完整示例(发票、简历、实体提取),请参考examples.md
  • 如需生成摘要而非结构化数据?请使用
    /ai-summarizing
  • AI在处理复杂输入时遗漏内容?请使用
    /ai-decomposing-tasks
    将提取任务拆分为可靠的子任务
  • 下一步:使用
    /ai-improving-accuracy
    来衡量并优化你的解析器