parser-development

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Purpose

用途

Use this skill when creating or modifying Biome's parsers. Covers grammar authoring with ungrammar, lexer implementation, error recovery strategies, and list parsing patterns.
在创建或修改Biome的解析器时使用本技能。涵盖使用ungrammar编写语法、词法分析器实现、错误恢复策略以及列表解析模式。

Prerequisites

前置条件

  1. Install required tools:
    just install-tools
  2. Understand the language syntax you're implementing
  3. Read
    crates/biome_parser/CONTRIBUTING.md
    for detailed concepts
  1. 安装所需工具:
    just install-tools
  2. 了解你要实现的语言语法
  3. 阅读
    crates/biome_parser/CONTRIBUTING.md
    以了解详细概念

Common Workflows

常见工作流

Create Grammar for New Language

为新语言创建语法

Create a
.ungram
file in
xtask/codegen/
(e.g.,
html.ungram
):
// html.ungram
// Legend:
//   Name =                -- non-terminal definition
//   'ident'               -- token (terminal)
//   A B                   -- sequence
//   A | B                 -- alternation
//   A*                    -- zero or more repetition
//   (A (',' A)* ','?)     -- repetition with separator and optional trailing comma
//   A?                    -- zero or one repetition
//   label:A               -- suggested name for field

HtmlRoot = element*

HtmlElement =
  '<'
  tag_name: HtmlName
  attributes: HtmlAttributeList
  '>'
  children: HtmlElementList
  '<' '/' close_tag_name: HtmlName '>'

HtmlAttributeList = HtmlAttribute*

HtmlAttribute =
  | HtmlSimpleAttribute
  | HtmlBogusAttribute

HtmlSimpleAttribute =
  name: HtmlName
  '='
  value: HtmlString

HtmlBogusAttribute = /* error recovery node */
Naming conventions:
  • Prefix all nodes with language name:
    HtmlElement
    ,
    CssRule
  • Unions start with
    Any
    :
    AnyHtmlAttribute
  • Error recovery nodes use
    Bogus
    :
    HtmlBogusAttribute
  • Lists end with
    List
    :
    HtmlAttributeList
  • Lists are mandatory (never optional), empty by default
xtask/codegen/
目录下创建
.ungram
文件(例如
html.ungram
):
// html.ungram
// 说明:
//   Name =                -- 非终结符定义
//   'ident'               -- 标记(终结符)
//   A B                   -- 序列
//   A | B                 -- 选择
//   A*                    -- 零次或多次重复
//   (A (',' A)* ','?)     -- 带分隔符且可选 trailing 逗号的重复
//   A?                    -- 零次或一次重复
//   label:A               -- 字段的建议名称

HtmlRoot = element*

HtmlElement =
  '<'
  tag_name: HtmlName
  attributes: HtmlAttributeList
  '>'
  children: HtmlElementList
  '<' '/' close_tag_name: HtmlName '>'

HtmlAttributeList = HtmlAttribute*

HtmlAttribute =
  | HtmlSimpleAttribute
  | HtmlBogusAttribute

HtmlSimpleAttribute =
  name: HtmlName
  '='
  value: HtmlString

HtmlBogusAttribute = /* 错误恢复节点 */
命名规范:
  • 所有节点以语言名称为前缀:
    HtmlElement
    CssRule
  • 联合类型以
    Any
    开头:
    AnyHtmlAttribute
  • 错误恢复节点使用
    Bogus
    HtmlBogusAttribute
  • 列表以
    List
    结尾:
    HtmlAttributeList
  • 列表是强制的(永远不可选),默认为空

Generate Parser from Grammar

从语法生成解析器

shell
undefined
shell
undefined

Generate for specific language

为特定语言生成

just gen-grammar html
just gen-grammar html

Generate for multiple languages

为多种语言生成

just gen-grammar html css
just gen-grammar html css

Generate all grammars

生成所有语法

just gen-grammar

This creates:
- `biome_html_syntax/src/generated/` - Node definitions
- `biome_html_factory/src/generated/` - Node construction helpers
- Parser skeleton files (you'll implement the actual parsing logic)
just gen-grammar

此操作会创建:
- `biome_html_syntax/src/generated/` - 节点定义
- `biome_html_factory/src/generated/` - 节点构造助手
- 解析器骨架文件(你需要实现实际的解析逻辑)

Implement a Lexer

实现词法分析器

Create
lexer/mod.rs
in your parser crate:
rust
use biome_html_syntax::HtmlSyntaxKind;
use biome_parser::{lexer::Lexer, ParseDiagnostic};

pub(crate) struct HtmlLexer<'source> {
    source: &'source str,
    position: usize,
    current_kind: HtmlSyntaxKind,
    diagnostics: Vec<ParseDiagnostic>,
}

impl<'source> Lexer<'source> for HtmlLexer<'source> {
    const NEWLINE: Self::Kind = HtmlSyntaxKind::NEWLINE;
    const WHITESPACE: Self::Kind = HtmlSyntaxKind::WHITESPACE;
    
    type Kind = HtmlSyntaxKind;
    type LexContext = ();
    type ReLexContext = ();

    fn source(&self) -> &'source str {
        self.source
    }

    fn current(&self) -> Self::Kind {
        self.current_kind
    }

    fn position(&self) -> usize {
        self.position
    }

    fn advance(&mut self, context: Self::LexContext) -> Self::Kind {
        // Implement token scanning logic
        let start = self.position;
        let kind = self.read_next_token();
        self.current_kind = kind;
        kind
    }
    
    // Implement other required methods...
}
在你的parser crate中创建
lexer/mod.rs
rust
use biome_html_syntax::HtmlSyntaxKind;
use biome_parser::{lexer::Lexer, ParseDiagnostic};

pub(crate) struct HtmlLexer<'source> {
    source: &'source str,
    position: usize,
    current_kind: HtmlSyntaxKind,
    diagnostics: Vec<ParseDiagnostic>,
}

impl<'source> Lexer<'source> for HtmlLexer<'source> {
    const NEWLINE: Self::Kind = HtmlSyntaxKind::NEWLINE;
    const WHITESPACE: Self::Kind = HtmlSyntaxKind::WHITESPACE;
    
    type Kind = HtmlSyntaxKind;
    type LexContext = ();
    type ReLexContext = ();

    fn source(&self) -> &'source str {
        self.source
    }

    fn current(&self) -> Self::Kind {
        self.current_kind
    }

    fn position(&self) -> usize {
        self.position
    }

    fn advance(&mut self, context: Self::LexContext) -> Self::Kind {
        // 实现标记扫描逻辑
        let start = self.position;
        let kind = self.read_next_token();
        self.current_kind = kind;
        kind
    }
    
    // 实现其他必需的方法...
}

Implement Token Source

实现标记源

rust
use biome_parser::lexer::BufferedLexer;
use biome_html_syntax::HtmlSyntaxKind;
use crate::lexer::HtmlLexer;

pub(crate) struct HtmlTokenSource<'src> {
    lexer: BufferedLexer<HtmlSyntaxKind, HtmlLexer<'src>>,
}

impl<'source> TokenSourceWithBufferedLexer<HtmlLexer<'source>> for HtmlTokenSource<'source> {
    fn lexer(&mut self) -> &mut BufferedLexer<HtmlSyntaxKind, HtmlLexer<'source>> {
        &mut self.lexer
    }
}
rust
use biome_parser::lexer::BufferedLexer;
use biome_html_syntax::HtmlSyntaxKind;
use crate::lexer::HtmlLexer;

pub(crate) struct HtmlTokenSource<'src> {
    lexer: BufferedLexer<HtmlSyntaxKind, HtmlLexer<'src>>,
}

impl<'source> TokenSourceWithBufferedLexer<HtmlLexer<'source>> for HtmlTokenSource<'source> {
    fn lexer(&mut self) -> &mut BufferedLexer<HtmlSyntaxKind, HtmlLexer<'source>> {
        &mut self.lexer
    }
}

Write Parse Rules

编写解析规则

Example: Parsing an if statement:
rust
use biome_parser::prelude::*;
use biome_js_syntax::JsSyntaxKind::*;

fn parse_if_statement(p: &mut JsParser) -> ParsedSyntax {
    // Presence test - return Absent if not at 'if'
    if !p.at(T![if]) {
        return Absent;
    }

    let m = p.start();

    // Parse required tokens
    p.expect(T![if]);
    p.expect(T!['(']);
    
    // Parse required nodes with error recovery
    parse_any_expression(p).or_add_diagnostic(p, expected_expression);
    
    p.expect(T![')']);
    parse_block_statement(p).or_add_diagnostic(p, expected_block);
    
    // Parse optional else clause
    if p.at(T![else]) {
        parse_else_clause(p).ok();
    }

    Present(m.complete(p, JS_IF_STATEMENT))
}
示例:解析if语句:
rust
use biome_parser::prelude::*;
use biome_js_syntax::JsSyntaxKind::*;

fn parse_if_statement(p: &mut JsParser) -> ParsedSyntax {
    // 存在性检查 - 如果当前不是'if'则返回Absent
    if !p.at(T![if]) {
        return Absent;
    }

    let m = p.start();

    // 解析必需的标记
    p.expect(T![if]);
    p.expect(T!['(']);
    
    // 带错误恢复的必需节点解析
    parse_any_expression(p).or_add_diagnostic(p, expected_expression);
    
    p.expect(T![')']);
    parse_block_statement(p).or_add_diagnostic(p, expected_block);
    
    // 解析可选的else子句
    if p.at(T![else]) {
        parse_else_clause(p).ok();
    }

    Present(m.complete(p, JS_IF_STATEMENT))
}

Parse Lists with Error Recovery

带错误恢复的列表解析

Use
ParseSeparatedList
for comma-separated lists:
rust
struct ArrayElementsList;

impl ParseSeparatedList for ArrayElementsList {
    type ParsedElement = CompletedMarker;

    fn parse_element(&mut self, p: &mut Parser) -> ParsedSyntax<Self::ParsedElement> {
        parse_array_element(p)
    }

    fn is_at_list_end(&self, p: &mut Parser) -> bool {
        // Stop at array closing bracket or file end
        p.at(T![']']) || p.at(EOF)
    }

    fn recover(
        &mut self,
        p: &mut Parser,
        parsed_element: ParsedSyntax<Self::ParsedElement>,
    ) -> RecoveryResult {
        parsed_element.or_recover(
            p,
            &ParseRecoveryTokenSet::new(
                JS_BOGUS_EXPRESSION,
                token_set![T![']'], T![,]]
            ),
            expected_array_element,
        )
    }
    
    fn separating_element_kind(&mut self) -> JsSyntaxKind {
        T![,]
    }
}

// Use the list parser
fn parse_array_elements(p: &mut Parser) -> CompletedMarker {
    let m = p.start();
    ArrayElementsList.parse_list(p);
    m.complete(p, JS_ARRAY_ELEMENT_LIST)
}
使用
ParseSeparatedList
处理逗号分隔的列表:
rust
struct ArrayElementsList;

impl ParseSeparatedList for ArrayElementsList {
    type ParsedElement = CompletedMarker;

    fn parse_element(&mut self, p: &mut Parser) -> ParsedSyntax<Self::ParsedElement> {
        parse_array_element(p)
    }

    fn is_at_list_end(&self, p: &mut Parser) -> bool {
        // 在数组闭合括号或文件结尾处停止
        p.at(T![']']) || p.at(EOF)
    }

    fn recover(
        &mut self,
        p: &mut Parser,
        parsed_element: ParsedSyntax<Self::ParsedElement>,
    ) -> RecoveryResult {
        parsed_element.or_recover(
            p,
            &ParseRecoveryTokenSet::new(
                JS_BOGUS_EXPRESSION,
                token_set![T![']'], T![,]]
            ),
            expected_array_element,
        )
    }
    
    fn separating_element_kind(&mut self) -> JsSyntaxKind {
        T![,]
    }
}

// 使用列表解析器
fn parse_array_elements(p: &mut Parser) -> CompletedMarker {
    let m = p.start();
    ArrayElementsList.parse_list(p);
    m.complete(p, JS_ARRAY_ELEMENT_LIST)
}

Implement Error Recovery

实现错误恢复

Error recovery wraps invalid tokens in
BOGUS
nodes:
rust
// Recovery set includes:
// - List terminator tokens (e.g., ']', '}')
// - Statement terminators (e.g., ';')
// - List separators (e.g., ',')
let recovery_set = token_set![T![']'], T![,], T![;]];

parsed_element.or_recover(
    p,
    &ParseRecoveryTokenSet::new(JS_BOGUS_EXPRESSION, recovery_set),
    expected_expression_error,
)
错误恢复会将无效标记包裹在
BOGUS
节点中:
rust
// 恢复集合包括:
// - 列表终止标记(例如']'、'}')
// - 语句终止符(例如';')
// - 列表分隔符(例如',')
let recovery_set = token_set![T![']'], T![,], T![;]];

parsed_element.or_recover(
    p,
    &ParseRecoveryTokenSet::new(JS_BOGUS_EXPRESSION, recovery_set),
    expected_expression_error,
)

Handle Conditional Syntax

处理条件语法

For syntax only valid in certain contexts (e.g., strict mode):
rust
fn parse_with_statement(p: &mut Parser) -> ParsedSyntax {
    if !p.at(T![with]) {
        return Absent;
    }

    let m = p.start();
    p.bump(T![with]);
    parenthesized_expression(p).or_add_diagnostic(p, expected_expression);
    parse_statement(p).or_add_diagnostic(p, expected_statement);
    
    let with_stmt = m.complete(p, JS_WITH_STATEMENT);

    // Mark as invalid in strict mode
    let conditional = StrictMode.excluding_syntax(p, with_stmt, |p, marker| {
        p.err_builder(
            "`with` statements are not allowed in strict mode",
            marker.range(p)
        )
    });

    Present(conditional.or_invalid_to_bogus(p))
}
针对仅在特定上下文中有效的语法(例如严格模式):
rust
fn parse_with_statement(p: &mut Parser) -> ParsedSyntax {
    if !p.at(T![with]) {
        return Absent;
    }

    let m = p.start();
    p.bump(T![with]);
    parenthesized_expression(p).or_add_diagnostic(p, expected_expression);
    parse_statement(p).or_add_diagnostic(p, expected_statement);
    
    let with_stmt = m.complete(p, JS_WITH_STATEMENT);

    // 在严格模式下标记为无效
    let conditional = StrictMode.excluding_syntax(p, with_stmt, |p, marker| {
        p.err_builder(
            "`with`语句在严格模式下不被允许",
            marker.range(p)
        )
    });

    Present(conditional.or_invalid_to_bogus(p))
}

Test Parser

测试解析器

Create test files in
tests/
:
crates/biome_html_parser/tests/
├── html_specs/
│   ├── ok/
│   │   ├── simple_element.html
│   │   └── nested_elements.html
│   └── error/
│       ├── unclosed_tag.html
│       └── invalid_syntax.html
└── html_test.rs
Run tests:
shell
cd crates/biome_html_parser
cargo test
tests/
目录下创建测试文件:
crates/biome_html_parser/tests/
├── html_specs/
│   ├── ok/
│   │   ├── simple_element.html
│   │   └── nested_elements.html
│   └── error/
│       ├── unclosed_tag.html
│       └── invalid_syntax.html
└── html_test.rs
运行测试:
shell
cd crates/biome_html_parser
cargo test

Tips

提示

  • Presence test: Always return
    Absent
    if the first token doesn't match - never progress parsing before returning
    Absent
  • Required vs optional: Use
    p.expect()
    for required tokens,
    p.eat()
    for optional ones
  • Missing markers: Use
    .or_add_diagnostic()
    for required nodes to add missing markers and errors
  • Error recovery: Include list terminators, separators, and statement boundaries in recovery sets
  • Bogus nodes: Check grammar for which
    BOGUS_*
    node types are valid in your context
  • Checkpoints: Use
    p.checkpoint()
    to save state and
    p.rewind()
    if parsing fails
  • Lookahead: Use
    p.at()
    to check tokens,
    p.nth_at()
    for lookahead beyond current token
  • Lists are mandatory: Always create list nodes even if empty - use
    parse_list()
    not
    parse_list().ok()
  • 存在性检查:如果第一个标记不匹配,始终返回
    Absent
    - 在返回
    Absent
    前绝不要推进解析
  • 必需与可选:对必需标记使用
    p.expect()
    ,对可选标记使用
    p.eat()
  • 缺失标记:对必需节点使用
    .or_add_diagnostic()
    以添加缺失标记和错误
  • 错误恢复:在恢复集合中包含列表终止符、分隔符和语句边界
  • Bogus节点:检查语法以确定在你的上下文中哪些
    BOGUS_*
    节点类型是有效的
  • 检查点:使用
    p.checkpoint()
    保存状态,在解析失败时使用
    p.rewind()
    回退
  • 前瞻:使用
    p.at()
    检查标记,使用
    p.nth_at()
    查看当前标记之后的前瞻标记
  • 列表是强制的:即使为空也要始终创建列表节点 - 使用
    parse_list()
    而非
    parse_list().ok()

Common Patterns

常见模式

rust
// Optional token
if p.eat(T![async]) {
    // handle async
}

// Required token with error
p.expect(T!['{']);

// Optional node
parse_type_annotation(p).ok();

// Required node with error
parse_expression(p).or_add_diagnostic(p, expected_expression);

// Lookahead
if p.at(T![if]) || p.at(T![for]) {
    // handle control flow
}

// Checkpoint for backtracking
let checkpoint = p.checkpoint();
if parse_something(p).is_absent() {
    p.rewind(checkpoint);
    parse_something_else(p);
}
rust
// 可选标记
if p.eat(T![async]) {
    // 处理async
}

// 带错误的必需标记
p.expect(T!['{']);

// 可选节点
parse_type_annotation(p).ok();

// 带错误的必需节点
parse_expression(p).or_add_diagnostic(p, expected_expression);

// 前瞻
if p.at(T![if]) || p.at(T![for]) {
    // 处理控制流
}

// 用于回溯的检查点
let checkpoint = p.checkpoint();
if parse_something(p).is_absent() {
    p.rewind(checkpoint);
    parse_something_else(p);
}

References

参考资料

  • Full guide:
    crates/biome_parser/CONTRIBUTING.md
  • Grammar examples:
    xtask/codegen/*.ungram
  • Parser examples:
    crates/biome_js_parser/src/syntax/
  • Error recovery: Search for
    ParseRecoveryTokenSet
    in existing parsers
  • 完整指南:
    crates/biome_parser/CONTRIBUTING.md
  • 语法示例:
    xtask/codegen/*.ungram
  • 解析器示例:
    crates/biome_js_parser/src/syntax/
  • 错误恢复:在现有解析器中搜索
    ParseRecoveryTokenSet