ai-parsing-data
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBuild an AI Data Parser
构建AI数据解析器
Guide the user through building AI that pulls structured data out of messy text. Uses DSPy extraction — define what you want, and the AI extracts it.
引导用户构建可从杂乱文本中提取结构化数据的AI。基于DSPy extraction实现——只需定义你需要提取的内容,AI就会完成提取工作。
Step 1: Define what to extract
步骤1:定义提取内容
Ask the user:
- What are you parsing? (emails, invoices, resumes, articles, forms, etc.)
- What fields do you need? (names, dates, amounts, entities, etc.)
- What's the output format? (flat fields, list of objects, nested structure)
询问用户:
- 你要解析什么内容?(邮件、发票、简历、文章、表单等)
- 你需要提取哪些字段?(姓名、日期、金额、实体等)
- 输出格式是什么?(扁平字段、对象列表、嵌套结构)
Step 2: Build the parser
步骤2:构建解析器
Simple field extraction
简单字段提取
python
import dspy
class ParseContact(dspy.Signature):
"""Extract contact information from the text."""
text: str = dspy.InputField(desc="Text containing contact information")
name: str = dspy.OutputField(desc="Person's full name")
email: str = dspy.OutputField(desc="Email address")
phone: str = dspy.OutputField(desc="Phone number")
parser = dspy.ChainOfThought(ParseContact)python
import dspy
class ParseContact(dspy.Signature):
"""Extract contact information from the text."""
text: str = dspy.InputField(desc="Text containing contact information")
name: str = dspy.OutputField(desc="Person's full name")
email: str = dspy.OutputField(desc="Email address")
phone: str = dspy.OutputField(desc="Phone number")
parser = dspy.ChainOfThought(ParseContact)Structured output with Pydantic
基于Pydantic的结构化输出
For complex or nested output, use Pydantic models:
python
from pydantic import BaseModel, Field
class Address(BaseModel):
street: str
city: str
state: str
zip_code: str
class Person(BaseModel):
name: str
age: int
address: Address
skills: list[str]
class ParsePerson(dspy.Signature):
"""Extract person details from the text."""
text: str = dspy.InputField()
person: Person = dspy.OutputField()
parser = dspy.ChainOfThought(ParsePerson)
result = parser(text="John Doe, 32, lives at 123 Main St, Springfield IL 62701. Expert in Python and SQL.")
print(result.person) # Person(name='John Doe', age=32, ...)对于复杂或嵌套的输出,使用Pydantic模型:
python
from pydantic import BaseModel, Field
class Address(BaseModel):
street: str
city: str
state: str
zip_code: str
class Person(BaseModel):
name: str
age: int
address: Address
skills: list[str]
class ParsePerson(dspy.Signature):
"""Extract person details from the text."""
text: str = dspy.InputField()
person: Person = dspy.OutputField()
parser = dspy.ChainOfThought(ParsePerson)
result = parser(text="John Doe, 32, lives at 123 Main St, Springfield IL 62701. Expert in Python and SQL.")
print(result.person) # Person(name='John Doe', age=32, ...)List extraction
列表提取
Extract multiple items from text:
python
class Entity(BaseModel):
name: str
type: str = Field(description="Type: person, organization, location, or date")
class ParseEntities(dspy.Signature):
"""Extract all named entities from the text."""
text: str = dspy.InputField()
entities: list[Entity] = dspy.OutputField(desc="All entities found in the text")
parser = dspy.ChainOfThought(ParseEntities)从文本中提取多个条目:
python
class Entity(BaseModel):
name: str
type: str = Field(description="Type: person, organization, location, or date")
class ParseEntities(dspy.Signature):
"""Extract all named entities from the text."""
text: str = dspy.InputField()
entities: list[Entity] = dspy.OutputField(desc="All entities found in the text")
parser = dspy.ChainOfThought(ParseEntities)Step 3: Handle messy data
步骤3:处理杂乱数据
Use assertions to catch bad extractions:
python
class ValidatedParser(dspy.Module):
def __init__(self):
self.parse = dspy.ChainOfThought(ParseContact)
def forward(self, text):
result = self.parse(text=text)
dspy.Suggest(
"@" in result.email,
"Email should contain @"
)
dspy.Suggest(
len(result.phone) >= 10,
"Phone number should have at least 10 digits"
)
return result使用断言捕获错误的提取结果:
python
class ValidatedParser(dspy.Module):
def __init__(self):
self.parse = dspy.ChainOfThought(ParseContact)
def forward(self, text):
result = self.parse(text=text)
dspy.Suggest(
"@" in result.email,
"Email should contain @"
)
dspy.Suggest(
len(result.phone) >= 10,
"Phone number should have at least 10 digits"
)
return resultStep 4: Test the quality
步骤4:测试解析质量
python
def parsing_metric(example, prediction, trace=None):
"""Score based on field-level accuracy."""
correct = 0
total = 0
for field in ["name", "email", "phone"]:
expected = getattr(example, field, None)
predicted = getattr(prediction, field, None)
if expected is not None:
total += 1
if predicted and expected.lower().strip() == predicted.lower().strip():
correct += 1
return correct / total if total > 0 else 0.0For Pydantic outputs, compare the model objects directly or field-by-field.
python
def parsing_metric(example, prediction, trace=None):
"""Score based on field-level accuracy."""
correct = 0
total = 0
for field in ["name", "email", "phone"]:
expected = getattr(example, field, None)
predicted = getattr(prediction, field, None)
if expected is not None:
total += 1
if predicted and expected.lower().strip() == predicted.lower().strip():
correct += 1
return correct / total if total > 0 else 0.0对于Pydantic输出,可以直接比较模型对象或逐字段对比。
Step 5: Improve accuracy
步骤5:提升解析准确率
python
optimizer = dspy.BootstrapFewShot(metric=parsing_metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(parser, trainset=trainset)python
optimizer = dspy.BootstrapFewShot(metric=parsing_metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(parser, trainset=trainset)Key patterns
核心模式
- Use Pydantic models for complex structured output — DSPy handles serialization
- Use to extract variable-length lists of items
list[Model] - helps — reasoning through which text maps to which fields improves accuracy
ChainOfThought - Validate with assertions — and
dspy.Suggestcatch malformed extractionsdspy.Assert - Partial credit metrics — score field-by-field rather than all-or-nothing
- 使用Pydantic模型处理复杂结构化输出——DSPy会自动处理序列化
- **使用**提取可变长度的条目列表
list[Model] - 很有帮助——通过推理文本与字段的映射关系来提升准确率
ChainOfThought - 使用断言进行验证——和
dspy.Suggest可以捕获格式错误的提取结果dspy.Assert - 部分评分指标——按字段逐一评分,而非采用全对或全错的方式
Additional resources
额外资源
- For worked examples (invoices, resumes, entities), see examples.md
- Need summaries instead of structured data? Use
/ai-summarizing - AI missing items on complex inputs? Use to break extraction into reliable subtasks
/ai-decomposing-tasks - Next: to measure and improve your parser
/ai-improving-accuracy
- 如需查看完整示例(发票、简历、实体提取),请参考examples.md
- 如需生成摘要而非结构化数据?请使用
/ai-summarizing - AI在处理复杂输入时遗漏内容?请使用将提取任务拆分为可靠的子任务
/ai-decomposing-tasks - 下一步:使用来衡量并优化你的解析器
/ai-improving-accuracy