Regex Builder
Transforms matching requirements (positive and negative examples) into tested regex
patterns with component-by-component explanations, capture group documentation, edge
case identification, and ready-to-use code in Python and JavaScript.
Reference Files
| File | Contents | Load When |
|---|
references/character-classes.md
| Character class reference, Unicode categories, POSIX classes | Always |
references/quantifiers.md
| Quantifier behavior, greedy vs lazy vs possessive, backtracking | Pattern needs repetition |
references/common-patterns.md
| Validated patterns for email, URL, phone, IP, date, UUID, etc. | Common validation requested |
references/flavor-differences.md
| Syntax differences between Python, JavaScript, PCRE, POSIX | Multi-language usage needed |
Prerequisites
- Clear specification: what should match and what should not
- Target regex flavor (Python , JavaScript, PCRE) — defaults to Python
Workflow
Phase 1: Collect Examples
Gather positive (should match) and negative (should not match) examples:
- From user — Explicit examples provided
- From context — If the user says "match email addresses," infer standard positive
and negative examples
- From data — If sample data is provided, identify the pattern within it
Minimum: 3 positive examples and 3 negative examples. Fewer examples risk overfitting
the pattern to specific cases.
Phase 2: Infer Pattern
Analyze the examples to build a pattern:
- Identify fixed literals — Characters that appear in the same position across all
positive examples
- Identify character classes — Positions where different characters appear but follow
a pattern (digits, letters, alphanumeric)
- Identify repetition — Elements that appear a variable number of times
- Identify optional elements — Parts present in some positive examples but not others
- Identify anchoring — Must the pattern match the entire string or can it be a substring?
Phase 3: Explain Pattern
Break down the pattern into a component table:
| Component | Meaning |
|---|
| Start of string |
| One letter (upper or lower) |
| 3 to 5 digits |
| End of string |
Document capture groups separately if the pattern uses them.
Phase 4: Generate Edge Cases
For every pattern, identify inputs that are likely to cause problems:
- Empty string — Does the pattern handle it correctly?
- Almost-matching strings — One character off from a valid match
- Boundary lengths — Minimum and maximum valid lengths
- Special characters — Dots, brackets, backslashes in the input
- Unicode — Multi-byte characters, emoji, diacritics
- Catastrophic backtracking — Inputs that cause exponential matching time
Phase 5: Output
Produce the pattern, explanation, test cases, and usage examples.
Output Format
## Regex Pattern: {Brief Description}
### Requirements
- **Must match:** {description of valid inputs}
- **Must reject:** {description of invalid inputs}
- **Flavor:** {Python re | JavaScript | PCRE}
### Pattern
```regex
{pattern}
Explanation
| Component | Meaning |
|---|
| {what it matches and why} |
Capture Groups
| Group | Name | Captures | Example |
|---|
| 1 | {name} | {what} | {example value} |
Test Cases
| # | Input | Should Match | Reason |
|---|
| 1 | | Yes | {why — happy path} |
| 2 | | Yes | {why — boundary} |
| 3 | | No | {why — invalid} |
| 4 | | No | {why — near-miss} |
| 5 | `` (empty) | No | Empty input |
Edge Cases
- {Edge case 1}: {what to watch for}
- {Edge case 2}: {what to watch for}
Usage
Python:
python
import re
pattern = re.compile(r'{pattern}')
# Match entire string
if pattern.fullmatch(text):
...
# Search within string
match = pattern.search(text)
if match:
captured = match.group(1)
# Find all matches
matches = pattern.findall(text)
JavaScript:
javascript
const pattern = /{pattern}/;
// Test
if (pattern.test(text)) { ... }
// Match
const match = text.match(pattern);
if (match) {
const captured = match[1];
}
// Find all
const matches = [...text.matchAll(/{pattern}/g)];
## Calibration Rules
1. **Correctness over cleverness.** A readable, slightly longer pattern is better than
a cryptic short one. `[A-Za-z0-9]` is clearer than `\w` when you specifically mean
alphanumeric without underscores.
2. **Test negatives as rigorously as positives.** A pattern that matches everything
technically matches all positive examples. Negative examples prevent over-matching.
3. **Anchor when appropriate.** `^\d{3}$` matches exactly 3 digits. `\d{3}` matches
3 digits anywhere in the string. State the anchoring intent explicitly.
4. **Avoid catastrophic backtracking.** Nested quantifiers like `(a+)+` cause exponential
time on non-matching input. Test with adversarial inputs.
5. **Named groups over numbered groups.** `(?P<year>\d{4})` (Python) or `(?<year>\d{4})`
(JS) is self-documenting. Use numbered groups only for simple patterns.
6. **Specify the flavor.** Python `re`, JavaScript, and PCRE have different feature sets.
Lookaheads, lookbehinds, and Unicode support vary.
## Error Handling
| Problem | Resolution |
|---------|------------|
| Insufficient examples | Ask for more. Minimum 3 positive, 3 negative. |
| Contradictory examples | Flag the contradiction. Ask which examples are correct. |
| Requirements too complex for regex | Suggest a parser instead. Regex cannot handle recursive structures (nested brackets, HTML). |
| Pattern causes backtracking | Rewrite with atomic groups or possessive quantifiers. Test with worst-case input. |
| Unicode requirements unclear | Ask if the pattern needs to handle non-ASCII. Default to ASCII unless specified. |
| Multiple valid patterns | Present the simplest one. Mention alternatives if they have meaningful tradeoffs (performance vs readability). |
## When NOT to Build Regex
Push back if:
- The input requires parsing a recursive grammar (HTML, JSON, nested expressions) — use a parser
- The validation is for a standard format with a library (email validation, URL parsing) — use the standard library
- The pattern is for security-critical input validation as the sole defense — regex is a first filter, not a security boundary
- The user wants to modify matched content in complex ways — regex replacement has limits; suggest code instead