regex-builder
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRegex Builder
Regex Builder
Transforms matching requirements (positive and negative examples) into tested regex
patterns with component-by-component explanations, capture group documentation, edge
case identification, and ready-to-use code in Python and JavaScript.
将匹配需求(正向和负向示例)转换为经过测试的regex模式,同时提供逐组件解释、捕获组说明、边界用例识别以及Python和JavaScript中的即用型代码。
Reference Files
参考文件
| File | Contents | Load When |
|---|---|---|
| Character class reference, Unicode categories, POSIX classes | Always |
| Quantifier behavior, greedy vs lazy vs possessive, backtracking | Pattern needs repetition |
| Validated patterns for email, URL, phone, IP, date, UUID, etc. | Common validation requested |
| Syntax differences between Python, JavaScript, PCRE, POSIX | Multi-language usage needed |
| 文件 | 内容 | 加载时机 |
|---|---|---|
| 字符类参考、Unicode分类、POSIX类 | 始终加载 |
| 量词行为、贪婪/懒惰/占有模式、回溯 | 模式需要重复时 |
| 经过验证的邮箱、URL、电话、IP、日期、UUID等模式 | 请求常见验证场景时 |
| Python、JavaScript、PCRE、POSIX之间的语法差异 | 需要多语言使用时 |
Prerequisites
前提条件
- Clear specification: what should match and what should not
- Target regex flavor (Python , JavaScript, PCRE) — defaults to Python
re
- 明确的需求说明:哪些内容应该匹配,哪些不应该
- 目标regex风格(Python 、JavaScript、PCRE)——默认使用Python
re
Workflow
工作流程
Phase 1: Collect Examples
阶段1:收集示例
Gather positive (should match) and negative (should not match) examples:
- From user — Explicit examples provided
- From context — If the user says "match email addresses," infer standard positive and negative examples
- From data — If sample data is provided, identify the pattern within it
Minimum: 3 positive examples and 3 negative examples. Fewer examples risk overfitting
the pattern to specific cases.
收集正向(应匹配)和负向(不应匹配)示例:
- 来自用户 —— 用户提供的明确示例
- 来自上下文 —— 如果用户说“匹配邮箱地址”,则推断标准的正向和负向示例
- 来自数据 —— 如果提供了样本数据,则识别其中的模式
最低要求:3个正向示例和3个负向示例。示例过少会导致模式过度拟合特定情况。
Phase 2: Infer Pattern
阶段2:推导模式
Analyze the examples to build a pattern:
- Identify fixed literals — Characters that appear in the same position across all positive examples
- Identify character classes — Positions where different characters appear but follow a pattern (digits, letters, alphanumeric)
- Identify repetition — Elements that appear a variable number of times
- Identify optional elements — Parts present in some positive examples but not others
- Identify anchoring — Must the pattern match the entire string or can it be a substring?
分析示例以构建模式:
- 识别固定字面量 —— 在所有正向示例中同一位置出现的字符
- 识别字符类 —— 位置上字符不同但遵循特定模式(数字、字母、字母数字)
- 识别重复规则 —— 出现次数可变的元素
- 识别可选元素 —— 在部分正向示例中存在但其他示例中不存在的部分
- 识别锚定规则 —— 模式是需要匹配整个字符串还是可以匹配子字符串?
Phase 3: Explain Pattern
阶段3:解释模式
Break down the pattern into a component table:
| Component | Meaning |
|---|---|
| Start of string |
| One letter (upper or lower) |
| 3 to 5 digits |
| End of string |
Document capture groups separately if the pattern uses them.
将模式拆解为组件表格:
| 组件 | 含义 |
|---|---|
| 字符串开头 |
| 单个字母(大写或小写) |
| 3到5个数字 |
| 字符串结尾 |
如果模式使用了捕获组,需单独说明。
Phase 4: Generate Edge Cases
阶段4:生成边界用例
For every pattern, identify inputs that are likely to cause problems:
- Empty string — Does the pattern handle it correctly?
- Almost-matching strings — One character off from a valid match
- Boundary lengths — Minimum and maximum valid lengths
- Special characters — Dots, brackets, backslashes in the input
- Unicode — Multi-byte characters, emoji, diacritics
- Catastrophic backtracking — Inputs that cause exponential matching time
针对每个模式,识别可能引发问题的输入:
- 空字符串 —— 模式能否正确处理?
- 近乎匹配的字符串 —— 与有效匹配仅差一个字符的内容
- 边界长度 —— 有效长度的最小值和最大值
- 特殊字符 —— 输入中的点、括号、反斜杠等
- Unicode字符 —— 多字节字符、表情符号、变音符号
- 灾难性回溯 —— 会导致匹配时间呈指数级增长的输入
Phase 5: Output
阶段5:输出
Produce the pattern, explanation, test cases, and usage examples.
生成模式、解释、测试用例和使用示例。
Output Format
输出格式
undefinedundefinedRegex Pattern: {Brief Description}
Regex Pattern: {简要描述}
Requirements
需求
- Must match: {description of valid inputs}
- Must reject: {description of invalid inputs}
- Flavor: {Python re | JavaScript | PCRE}
- 必须匹配: {有效输入的描述}
- 必须拒绝: {无效输入的描述}
- 风格: {Python re | JavaScript | PCRE}
Pattern
模式
regex
{pattern}regex
{pattern}Explanation
解释
| Component | Meaning |
|---|---|
| {what it matches and why} |
| 组件 | 含义 |
|---|---|
| {匹配内容及原因} |
Capture Groups
捕获组
| Group | Name | Captures | Example |
|---|---|---|---|
| 1 | {name} | {what} | {example value} |
| 组 | 名称 | 捕获内容 | 示例 |
|---|---|---|---|
| 1 | {name} | {内容} | {示例值} |
Test Cases
测试用例
| # | Input | Should Match | Reason |
|---|---|---|---|
| 1 | | Yes | {why — happy path} |
| 2 | | Yes | {why — boundary} |
| 3 | | No | {why — invalid} |
| 4 | | No | {why — near-miss} |
| 5 | `` (empty) | No | Empty input |
| 序号 | 输入 | 是否应匹配 | 原因 |
|---|---|---|---|
| 1 | | 是 | {原因 —— 正常路径} |
| 2 | | 是 | {原因 —— 边界情况} |
| 3 | | 否 | {原因 —— 无效内容} |
| 4 | | 否 | {原因 —— 近乎匹配} |
| 5 | `` (空) | 否 | 空输入 |
Edge Cases
边界用例
- {Edge case 1}: {what to watch for}
- {Edge case 2}: {what to watch for}
- {边界用例1}: {注意事项}
- {边界用例2}: {注意事项}
Usage
使用示例
Python:
python
import re
pattern = re.compile(r'{pattern}')Python:
python
import re
pattern = re.compile(r'{pattern}')Match entire string
匹配整个字符串
if pattern.fullmatch(text):
...
if pattern.fullmatch(text):
...
Search within string
在字符串内搜索
match = pattern.search(text)
if match:
captured = match.group(1)
match = pattern.search(text)
if match:
captured = match.group(1)
Find all matches
查找所有匹配项
matches = pattern.findall(text)
**JavaScript:**
```javascript
const pattern = /{pattern}/;
// Test
if (pattern.test(text)) { ... }
// Match
const match = text.match(pattern);
if (match) {
const captured = match[1];
}
// Find all
const matches = [...text.matchAll(/{pattern}/g)];undefinedmatches = pattern.findall(text)
**JavaScript:**
```javascript
const pattern = /{pattern}/;
// 测试匹配
if (pattern.test(text)) { ... }
// 执行匹配
const match = text.match(pattern);
if (match) {
const captured = match[1];
}
// 查找所有匹配项
const matches = [...text.matchAll(/{pattern}/g)];undefinedCalibration Rules
校准规则
- Correctness over cleverness. A readable, slightly longer pattern is better than
a cryptic short one. is clearer than
[A-Za-z0-9]when you specifically mean alphanumeric without underscores.\w - Test negatives as rigorously as positives. A pattern that matches everything technically matches all positive examples. Negative examples prevent over-matching.
- Anchor when appropriate. matches exactly 3 digits.
^\d{3}$matches 3 digits anywhere in the string. State the anchoring intent explicitly.\d{3} - Avoid catastrophic backtracking. Nested quantifiers like cause exponential time on non-matching input. Test with adversarial inputs.
(a+)+ - Named groups over numbered groups. (Python) or
(?P<year>\d{4})(JS) is self-documenting. Use numbered groups only for simple patterns.(?<year>\d{4}) - Specify the flavor. Python , JavaScript, and PCRE have different feature sets. Lookaheads, lookbehinds, and Unicode support vary.
re
- 正确性优先于技巧性。可读性强、稍长的模式比晦涩简短的模式更好。当明确表示字母数字且不含下划线时,比
[A-Za-z0-9]更清晰。\w - 严格测试负向示例。一个能匹配所有内容的模式理论上也能匹配所有正向示例,但负向示例可避免过度匹配。
- 适时使用锚定。精确匹配3个数字,
^\d{3}$匹配字符串中任意位置的3个数字。需明确说明锚定意图。\d{3} - 避免灾难性回溯。像这样的嵌套量词会在非匹配输入上导致指数级匹配时间。需使用对抗性输入进行测试。
(a+)+ - 优先使用命名组而非编号组。Python的或JS的
(?P<year>\d{4})具备自文档性。仅在简单模式中使用编号组。(?<year>\d{4}) - 明确指定风格。Python 、JavaScript和PCRE的功能集不同。预查、后查和Unicode支持存在差异。
re
Error Handling
错误处理
| Problem | Resolution |
|---|---|
| Insufficient examples | Ask for more. Minimum 3 positive, 3 negative. |
| Contradictory examples | Flag the contradiction. Ask which examples are correct. |
| Requirements too complex for regex | Suggest a parser instead. Regex cannot handle recursive structures (nested brackets, HTML). |
| Pattern causes backtracking | Rewrite with atomic groups or possessive quantifiers. Test with worst-case input. |
| Unicode requirements unclear | Ask if the pattern needs to handle non-ASCII. Default to ASCII unless specified. |
| Multiple valid patterns | Present the simplest one. Mention alternatives if they have meaningful tradeoffs (performance vs readability). |
| 问题 | 解决方法 |
|---|---|
| 示例不足 | 请求更多示例。最低要求3个正向、3个负向示例。 |
| 示例矛盾 | 标记矛盾点,询问哪些示例是正确的。 |
| 需求过于复杂,regex无法实现 | 建议使用解析器。Regex无法处理递归结构(嵌套括号、HTML)。 |
| 模式导致回溯 | 使用原子组或占有量词重写。使用最坏情况输入进行测试。 |
| Unicode需求不明确 | 询问模式是否需要处理非ASCII字符。默认使用ASCII,除非明确指定。 |
| 存在多个有效模式 | 提供最简单的模式。若其他模式有显著的权衡(性能vs可读性),则提及替代方案。 |
When NOT to Build Regex
不应构建Regex的场景
Push back if:
- The input requires parsing a recursive grammar (HTML, JSON, nested expressions) — use a parser
- The validation is for a standard format with a library (email validation, URL parsing) — use the standard library
- The pattern is for security-critical input validation as the sole defense — regex is a first filter, not a security boundary
- The user wants to modify matched content in complex ways — regex replacement has limits; suggest code instead
在以下情况时需拒绝:
- 输入需要解析递归语法(HTML、JSON、嵌套表达式)—— 使用解析器
- 验证标准格式时已有可用库(邮箱验证、URL解析)—— 使用标准库
- 模式作为安全关键输入验证的唯一防御手段—— regex只是第一层过滤,而非安全边界
- 用户想要以复杂方式修改匹配内容—— regex替换存在局限性;建议使用代码实现