regex-vs-llm-structured-text

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Regex vs LLM for Structured Text Parsing

结构化文本解析:Regex 与 LLM 对比

A practical decision framework for parsing structured text (quizzes, forms, invoices, documents). The key insight: regex handles 95-98% of cases cheaply and deterministically. Reserve expensive LLM calls for the remaining edge cases.
本文提供了一套解析结构化文本(测验、表单、发票、文档)的实用决策框架。核心思路:regex可低成本、确定性地处理95-98%的场景,仅将昂贵的LLM调用留给剩余的边缘案例。

When to Activate

适用场景

  • Parsing structured text with repeating patterns (questions, forms, tables)
  • Deciding between regex and LLM for text extraction
  • Building hybrid pipelines that combine both approaches
  • Optimizing cost/accuracy tradeoffs in text processing
  • 解析具有重复模式的结构化文本(问题、表单、表格)
  • 为文本提取选择regex或LLM
  • 构建结合两种方式的混合流水线
  • 在文本处理中优化成本与准确率的平衡

Decision Framework

决策框架

Is the text format consistent and repeating?
├── Yes (>90% follows a pattern) → Start with Regex
│   ├── Regex handles 95%+ → Done, no LLM needed
│   └── Regex handles <95% → Add LLM for edge cases only
└── No (free-form, highly variable) → Use LLM directly
Is the text format consistent and repeating?
├── Yes (>90% follows a pattern) → Start with Regex
│   ├── Regex handles 95%+ → Done, no LLM needed
│   └── Regex handles <95% → Add LLM for edge cases only
└── No (free-form, highly variable) → Use LLM directly

Architecture Pattern

架构模式

Source Text
[Regex Parser] ─── Extracts structure (95-98% accuracy)
[Text Cleaner] ─── Removes noise (markers, page numbers, artifacts)
[Confidence Scorer] ─── Flags low-confidence extractions
    ├── High confidence (≥0.95) → Direct output
    └── Low confidence (<0.95) → [LLM Validator] → Output
Source Text
[Regex Parser] ─── Extracts structure (95-98% accuracy)
[Text Cleaner] ─── Removes noise (markers, page numbers, artifacts)
[Confidence Scorer] ─── Flags low-confidence extractions
    ├── High confidence (≥0.95) → Direct output
    └── Low confidence (<0.95) → [LLM Validator] → Output

Implementation

实现方案

1. Regex Parser (Handles the Majority)

1. Regex 解析器(处理绝大多数场景)

python
import re
from dataclasses import dataclass

@dataclass(frozen=True)
class ParsedItem:
    id: str
    text: str
    choices: tuple[str, ...]
    answer: str
    confidence: float = 1.0

def parse_structured_text(content: str) -> list[ParsedItem]:
    """Parse structured text using regex patterns."""
    pattern = re.compile(
        r"(?P<id>\d+)\.\s*(?P<text>.+?)\n"
        r"(?P<choices>(?:[A-D]\..+?\n)+)"
        r"Answer:\s*(?P<answer>[A-D])",
        re.MULTILINE | re.DOTALL,
    )
    items = []
    for match in pattern.finditer(content):
        choices = tuple(
            c.strip() for c in re.findall(r"[A-D]\.\s*(.+)", match.group("choices"))
        )
        items.append(ParsedItem(
            id=match.group("id"),
            text=match.group("text").strip(),
            choices=choices,
            answer=match.group("answer"),
        ))
    return items
python
import re
from dataclasses import dataclass

@dataclass(frozen=True)
class ParsedItem:
    id: str
    text: str
    choices: tuple[str, ...]
    answer: str
    confidence: float = 1.0

def parse_structured_text(content: str) -> list[ParsedItem]:
    """Parse structured text using regex patterns."""
    pattern = re.compile(
        r"(?P<id>\d+)\.\s*(?P<text>.+?)\n"
        r"(?P<choices>(?:[A-D]\..+?\n)+)"
        r"Answer:\s*(?P<answer>[A-D])",
        re.MULTILINE | re.DOTALL,
    )
    items = []
    for match in pattern.finditer(content):
        choices = tuple(
            c.strip() for c in re.findall(r"[A-D]\.\s*(.+)", match.group("choices"))
        )
        items.append(ParsedItem(
            id=match.group("id"),
            text=match.group("text").strip(),
            choices=choices,
            answer=match.group("answer"),
        ))
    return items

2. Confidence Scoring

2. 置信度评分

Flag items that may need LLM review:
python
@dataclass(frozen=True)
class ConfidenceFlag:
    item_id: str
    score: float
    reasons: tuple[str, ...]

def score_confidence(item: ParsedItem) -> ConfidenceFlag:
    """Score extraction confidence and flag issues."""
    reasons = []
    score = 1.0

    if len(item.choices) < 3:
        reasons.append("few_choices")
        score -= 0.3

    if not item.answer:
        reasons.append("missing_answer")
        score -= 0.5

    if len(item.text) < 10:
        reasons.append("short_text")
        score -= 0.2

    return ConfidenceFlag(
        item_id=item.id,
        score=max(0.0, score),
        reasons=tuple(reasons),
    )

def identify_low_confidence(
    items: list[ParsedItem],
    threshold: float = 0.95,
) -> list[ConfidenceFlag]:
    """Return items below confidence threshold."""
    flags = [score_confidence(item) for item in items]
    return [f for f in flags if f.score < threshold]
标记可能需要LLM审核的条目:
python
@dataclass(frozen=True)
class ConfidenceFlag:
    item_id: str
    score: float
    reasons: tuple[str, ...]

def score_confidence(item: ParsedItem) -> ConfidenceFlag:
    """Score extraction confidence and flag issues."""
    reasons = []
    score = 1.0

    if len(item.choices) < 3:
        reasons.append("few_choices")
        score -= 0.3

    if not item.answer:
        reasons.append("missing_answer")
        score -= 0.5

    if len(item.text) < 10:
        reasons.append("short_text")
        score -= 0.2

    return ConfidenceFlag(
        item_id=item.id,
        score=max(0.0, score),
        reasons=tuple(reasons),
    )

def identify_low_confidence(
    items: list[ParsedItem],
    threshold: float = 0.95,
) -> list[ConfidenceFlag]:
    """Return items below confidence threshold."""
    flags = [score_confidence(item) for item in items]
    return [f for f in flags if f.score < threshold]

3. LLM Validator (Edge Cases Only)

3. LLM 验证器(仅处理边缘案例)

python
def validate_with_llm(
    item: ParsedItem,
    original_text: str,
    client,
) -> ParsedItem:
    """Use LLM to fix low-confidence extractions."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Cheapest model for validation
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": (
                f"Extract the question, choices, and answer from this text.\n\n"
                f"Text: {original_text}\n\n"
                f"Current extraction: {item}\n\n"
                f"Return corrected JSON if needed, or 'CORRECT' if accurate."
            ),
        }],
    )
    # Parse LLM response and return corrected item...
    return corrected_item
python
def validate_with_llm(
    item: ParsedItem,
    original_text: str,
    client,
) -> ParsedItem:
    """Use LLM to fix low-confidence extractions."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Cheapest model for validation
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": (
                f"Extract the question, choices, and answer from this text.\n\n"
                f"Text: {original_text}\n\n"
                f"Current extraction: {item}\n\n"
                f"Return corrected JSON if needed, or 'CORRECT' if accurate."
            ),
        }],
    )
    # Parse LLM response and return corrected item...
    return corrected_item

4. Hybrid Pipeline

4. 混合处理流水线

python
def process_document(
    content: str,
    *,
    llm_client=None,
    confidence_threshold: float = 0.95,
) -> list[ParsedItem]:
    """Full pipeline: regex -> confidence check -> LLM for edge cases."""
    # Step 1: Regex extraction (handles 95-98%)
    items = parse_structured_text(content)

    # Step 2: Confidence scoring
    low_confidence = identify_low_confidence(items, confidence_threshold)

    if not low_confidence or llm_client is None:
        return items

    # Step 3: LLM validation (only for flagged items)
    low_conf_ids = {f.item_id for f in low_confidence}
    result = []
    for item in items:
        if item.id in low_conf_ids:
            result.append(validate_with_llm(item, content, llm_client))
        else:
            result.append(item)

    return result
python
def process_document(
    content: str,
    *,
    llm_client=None,
    confidence_threshold: float = 0.95,
) -> list[ParsedItem]:
    """Full pipeline: regex -> confidence check -> LLM for edge cases."""
    # Step 1: Regex extraction (handles 95-98%)
    items = parse_structured_text(content)

    # Step 2: Confidence scoring
    low_confidence = identify_low_confidence(items, confidence_threshold)

    if not low_confidence or llm_client is None:
        return items

    # Step 3: LLM validation (only for flagged items)
    low_conf_ids = {f.item_id for f in low_confidence}
    result = []
    for item in items:
        if item.id in low_conf_ids:
            result.append(validate_with_llm(item, content, llm_client))
        else:
            result.append(item)

    return result

Real-World Metrics

实际生产指标

From a production quiz parsing pipeline (410 items):
MetricValue
Regex success rate98.0%
Low confidence items8 (2.0%)
LLM calls needed~5
Cost savings vs all-LLM~95%
Test coverage93%
来自某生产环境下的测验解析流水线(410条条目):
指标数值
Regex 成功率98.0%
低置信度条目8(2.0%)
所需LLM调用次数~5
对比全LLM方案的成本节约~95%
测试覆盖率93%

Best Practices

最佳实践

  • Start with regex — even imperfect regex gives you a baseline to improve
  • Use confidence scoring to programmatically identify what needs LLM help
  • Use the cheapest LLM for validation (Haiku-class models are sufficient)
  • Never mutate parsed items — return new instances from cleaning/validation steps
  • TDD works well for parsers — write tests for known patterns first, then edge cases
  • Log metrics (regex success rate, LLM call count) to track pipeline health
  • 优先使用Regex — 即使是不够完善的Regex也能为你提供改进的基准
  • 使用置信度评分以程序化确定哪些场景需要LLM协助
  • 选用最便宜的LLM进行验证(Haiku级模型已足够)
  • 绝不修改已解析的条目 — 从清理/验证步骤返回新实例
  • 测试驱动开发(TDD)适用于解析器 — 先针对已知模式编写测试,再覆盖边缘案例
  • 记录指标(Regex成功率、LLM调用次数)以跟踪流水线健康状况

Anti-Patterns to Avoid

需避免的反模式

  • Sending all text to an LLM when regex handles 95%+ of cases (expensive and slow)
  • Using regex for free-form, highly variable text (LLM is better here)
  • Skipping confidence scoring and hoping regex "just works"
  • Mutating parsed objects during cleaning/validation steps
  • Not testing edge cases (malformed input, missing fields, encoding issues)
  • 当Regex可处理95%以上场景时,仍将所有文本发送给LLM(成本高且速度慢)
  • 对自由格式、高度可变的文本使用Regex(此时LLM更合适)
  • 跳过置信度评分,寄希望于Regex“完美运行”
  • 在清理/验证步骤中修改已解析的对象
  • 不测试边缘案例(格式错误的输入、缺失字段、编码问题)

When to Use

适用场景

  • Quiz/exam question parsing
  • Form data extraction
  • Invoice/receipt processing
  • Document structure parsing (headers, sections, tables)
  • Any structured text with repeating patterns where cost matters
  • 测验/考试题解析
  • 表单数据提取
  • 发票/收据处理
  • 文档结构解析(标题、章节、表格)
  • 任何具有重复模式且关注成本的结构化文本处理场景