regex-builder

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Regex Builder

Regex Builder

Transforms matching requirements (positive and negative examples) into tested regex patterns with component-by-component explanations, capture group documentation, edge case identification, and ready-to-use code in Python and JavaScript.
将匹配需求(正向和负向示例)转换为经过测试的regex模式,同时提供逐组件解释、捕获组说明、边界用例识别以及Python和JavaScript中的即用型代码。

Reference Files

参考文件

FileContentsLoad When
references/character-classes.md
Character class reference, Unicode categories, POSIX classesAlways
references/quantifiers.md
Quantifier behavior, greedy vs lazy vs possessive, backtrackingPattern needs repetition
references/common-patterns.md
Validated patterns for email, URL, phone, IP, date, UUID, etc.Common validation requested
references/flavor-differences.md
Syntax differences between Python, JavaScript, PCRE, POSIXMulti-language usage needed
文件内容加载时机
references/character-classes.md
字符类参考、Unicode分类、POSIX类始终加载
references/quantifiers.md
量词行为、贪婪/懒惰/占有模式、回溯模式需要重复时
references/common-patterns.md
经过验证的邮箱、URL、电话、IP、日期、UUID等模式请求常见验证场景时
references/flavor-differences.md
Python、JavaScript、PCRE、POSIX之间的语法差异需要多语言使用时

Prerequisites

前提条件

  • Clear specification: what should match and what should not
  • Target regex flavor (Python
    re
    , JavaScript, PCRE) — defaults to Python
  • 明确的需求说明:哪些内容应该匹配,哪些不应该
  • 目标regex风格(Python
    re
    、JavaScript、PCRE)——默认使用Python

Workflow

工作流程

Phase 1: Collect Examples

阶段1:收集示例

Gather positive (should match) and negative (should not match) examples:
  1. From user — Explicit examples provided
  2. From context — If the user says "match email addresses," infer standard positive and negative examples
  3. From data — If sample data is provided, identify the pattern within it
Minimum: 3 positive examples and 3 negative examples. Fewer examples risk overfitting the pattern to specific cases.
收集正向(应匹配)和负向(不应匹配)示例:
  1. 来自用户 —— 用户提供的明确示例
  2. 来自上下文 —— 如果用户说“匹配邮箱地址”,则推断标准的正向和负向示例
  3. 来自数据 —— 如果提供了样本数据,则识别其中的模式
最低要求:3个正向示例和3个负向示例。示例过少会导致模式过度拟合特定情况。

Phase 2: Infer Pattern

阶段2:推导模式

Analyze the examples to build a pattern:
  1. Identify fixed literals — Characters that appear in the same position across all positive examples
  2. Identify character classes — Positions where different characters appear but follow a pattern (digits, letters, alphanumeric)
  3. Identify repetition — Elements that appear a variable number of times
  4. Identify optional elements — Parts present in some positive examples but not others
  5. Identify anchoring — Must the pattern match the entire string or can it be a substring?
分析示例以构建模式:
  1. 识别固定字面量 —— 在所有正向示例中同一位置出现的字符
  2. 识别字符类 —— 位置上字符不同但遵循特定模式(数字、字母、字母数字)
  3. 识别重复规则 —— 出现次数可变的元素
  4. 识别可选元素 —— 在部分正向示例中存在但其他示例中不存在的部分
  5. 识别锚定规则 —— 模式是需要匹配整个字符串还是可以匹配子字符串?

Phase 3: Explain Pattern

阶段3:解释模式

Break down the pattern into a component table:
ComponentMeaning
^
Start of string
[A-Za-z]
One letter (upper or lower)
\d{3,5}
3 to 5 digits
$
End of string
Document capture groups separately if the pattern uses them.
将模式拆解为组件表格:
组件含义
^
字符串开头
[A-Za-z]
单个字母(大写或小写)
\d{3,5}
3到5个数字
$
字符串结尾
如果模式使用了捕获组,需单独说明。

Phase 4: Generate Edge Cases

阶段4:生成边界用例

For every pattern, identify inputs that are likely to cause problems:
  1. Empty string — Does the pattern handle it correctly?
  2. Almost-matching strings — One character off from a valid match
  3. Boundary lengths — Minimum and maximum valid lengths
  4. Special characters — Dots, brackets, backslashes in the input
  5. Unicode — Multi-byte characters, emoji, diacritics
  6. Catastrophic backtracking — Inputs that cause exponential matching time
针对每个模式,识别可能引发问题的输入:
  1. 空字符串 —— 模式能否正确处理?
  2. 近乎匹配的字符串 —— 与有效匹配仅差一个字符的内容
  3. 边界长度 —— 有效长度的最小值和最大值
  4. 特殊字符 —— 输入中的点、括号、反斜杠等
  5. Unicode字符 —— 多字节字符、表情符号、变音符号
  6. 灾难性回溯 —— 会导致匹配时间呈指数级增长的输入

Phase 5: Output

阶段5:输出

Produce the pattern, explanation, test cases, and usage examples.
生成模式、解释、测试用例和使用示例。

Output Format

输出格式

undefined
undefined

Regex Pattern: {Brief Description}

Regex Pattern: {简要描述}

Requirements

需求

  • Must match: {description of valid inputs}
  • Must reject: {description of invalid inputs}
  • Flavor: {Python re | JavaScript | PCRE}
  • 必须匹配: {有效输入的描述}
  • 必须拒绝: {无效输入的描述}
  • 风格: {Python re | JavaScript | PCRE}

Pattern

模式

regex
{pattern}
regex
{pattern}

Explanation

解释

ComponentMeaning
{component}
{what it matches and why}
组件含义
{component}
{匹配内容及原因}

Capture Groups

捕获组

GroupNameCapturesExample
1{name}{what}{example value}
名称捕获内容示例
1{name}{内容}{示例值}

Test Cases

测试用例

#InputShould MatchReason
1
{input}
Yes{why — happy path}
2
{input}
Yes{why — boundary}
3
{input}
No{why — invalid}
4
{input}
No{why — near-miss}
5`` (empty)NoEmpty input
序号输入是否应匹配原因
1
{input}
{原因 —— 正常路径}
2
{input}
{原因 —— 边界情况}
3
{input}
{原因 —— 无效内容}
4
{input}
{原因 —— 近乎匹配}
5`` (空)空输入

Edge Cases

边界用例

  • {Edge case 1}: {what to watch for}
  • {Edge case 2}: {what to watch for}
  • {边界用例1}: {注意事项}
  • {边界用例2}: {注意事项}

Usage

使用示例

Python:
python
import re

pattern = re.compile(r'{pattern}')
Python:
python
import re

pattern = re.compile(r'{pattern}')

Match entire string

匹配整个字符串

if pattern.fullmatch(text): ...
if pattern.fullmatch(text): ...

Search within string

在字符串内搜索

match = pattern.search(text) if match: captured = match.group(1)
match = pattern.search(text) if match: captured = match.group(1)

Find all matches

查找所有匹配项

matches = pattern.findall(text)

**JavaScript:**
```javascript
const pattern = /{pattern}/;

// Test
if (pattern.test(text)) { ... }

// Match
const match = text.match(pattern);
if (match) {
    const captured = match[1];
}

// Find all
const matches = [...text.matchAll(/{pattern}/g)];
undefined
matches = pattern.findall(text)

**JavaScript:**
```javascript
const pattern = /{pattern}/;

// 测试匹配
if (pattern.test(text)) { ... }

// 执行匹配
const match = text.match(pattern);
if (match) {
    const captured = match[1];
}

// 查找所有匹配项
const matches = [...text.matchAll(/{pattern}/g)];
undefined

Calibration Rules

校准规则

  1. Correctness over cleverness. A readable, slightly longer pattern is better than a cryptic short one.
    [A-Za-z0-9]
    is clearer than
    \w
    when you specifically mean alphanumeric without underscores.
  2. Test negatives as rigorously as positives. A pattern that matches everything technically matches all positive examples. Negative examples prevent over-matching.
  3. Anchor when appropriate.
    ^\d{3}$
    matches exactly 3 digits.
    \d{3}
    matches 3 digits anywhere in the string. State the anchoring intent explicitly.
  4. Avoid catastrophic backtracking. Nested quantifiers like
    (a+)+
    cause exponential time on non-matching input. Test with adversarial inputs.
  5. Named groups over numbered groups.
    (?P<year>\d{4})
    (Python) or
    (?<year>\d{4})
    (JS) is self-documenting. Use numbered groups only for simple patterns.
  6. Specify the flavor. Python
    re
    , JavaScript, and PCRE have different feature sets. Lookaheads, lookbehinds, and Unicode support vary.
  1. 正确性优先于技巧性。可读性强、稍长的模式比晦涩简短的模式更好。当明确表示字母数字且不含下划线时,
    [A-Za-z0-9]
    \w
    更清晰。
  2. 严格测试负向示例。一个能匹配所有内容的模式理论上也能匹配所有正向示例,但负向示例可避免过度匹配。
  3. 适时使用锚定
    ^\d{3}$
    精确匹配3个数字,
    \d{3}
    匹配字符串中任意位置的3个数字。需明确说明锚定意图。
  4. 避免灾难性回溯。像
    (a+)+
    这样的嵌套量词会在非匹配输入上导致指数级匹配时间。需使用对抗性输入进行测试。
  5. 优先使用命名组而非编号组。Python的
    (?P<year>\d{4})
    或JS的
    (?<year>\d{4})
    具备自文档性。仅在简单模式中使用编号组。
  6. 明确指定风格。Python
    re
    、JavaScript和PCRE的功能集不同。预查、后查和Unicode支持存在差异。

Error Handling

错误处理

ProblemResolution
Insufficient examplesAsk for more. Minimum 3 positive, 3 negative.
Contradictory examplesFlag the contradiction. Ask which examples are correct.
Requirements too complex for regexSuggest a parser instead. Regex cannot handle recursive structures (nested brackets, HTML).
Pattern causes backtrackingRewrite with atomic groups or possessive quantifiers. Test with worst-case input.
Unicode requirements unclearAsk if the pattern needs to handle non-ASCII. Default to ASCII unless specified.
Multiple valid patternsPresent the simplest one. Mention alternatives if they have meaningful tradeoffs (performance vs readability).
问题解决方法
示例不足请求更多示例。最低要求3个正向、3个负向示例。
示例矛盾标记矛盾点,询问哪些示例是正确的。
需求过于复杂,regex无法实现建议使用解析器。Regex无法处理递归结构(嵌套括号、HTML)。
模式导致回溯使用原子组或占有量词重写。使用最坏情况输入进行测试。
Unicode需求不明确询问模式是否需要处理非ASCII字符。默认使用ASCII,除非明确指定。
存在多个有效模式提供最简单的模式。若其他模式有显著的权衡(性能vs可读性),则提及替代方案。

When NOT to Build Regex

不应构建Regex的场景

Push back if:
  • The input requires parsing a recursive grammar (HTML, JSON, nested expressions) — use a parser
  • The validation is for a standard format with a library (email validation, URL parsing) — use the standard library
  • The pattern is for security-critical input validation as the sole defense — regex is a first filter, not a security boundary
  • The user wants to modify matched content in complex ways — regex replacement has limits; suggest code instead
在以下情况时需拒绝:
  • 输入需要解析递归语法(HTML、JSON、嵌套表达式)—— 使用解析器
  • 验证标准格式时已有可用库(邮箱验证、URL解析)—— 使用标准库
  • 模式作为安全关键输入验证的唯一防御手段—— regex只是第一层过滤,而非安全边界
  • 用户想要以复杂方式修改匹配内容—— regex替换存在局限性;建议使用代码实现