parse-dont-validate

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Parse, Don't Validate

解析,而非校验

A parser is a function that consumes less-structured input and produces more-structured output. Validation checks a property and throws it away. Parsing checks a property and preserves it in the type system. Always prefer parsing.
解析器是一种接收低结构化输入并输出更高结构化数据的函数。 校验会检查属性然后丢弃该信息,而解析会检查属性并将其保留在类型系统中。永远优先选择解析。

The Core Idea

核心思想

ts
// VALIDATION: checks a property, returns nothing useful
function validateNonEmpty(list: string[]): void {
  if (list.length === 0) throw new Error("list cannot be empty");
}

// PARSING: checks the same property, returns proof in the type
function parseNonEmpty<T>(list: T[]): [T, ...T[]] {
  if (list.length === 0) throw new Error("list cannot be empty");
  return list as [T, ...T[]];
}
Both check the same thing. But
parseNonEmpty
gives the caller access to what it learned.
validateNonEmpty
throws the knowledge away, forcing every downstream function to either re-check or hope for the best.
ts
// 校验:检查属性,无有效返回值
function validateNonEmpty(list: string[]): void {
  if (list.length === 0) throw new Error("list cannot be empty");
}

// 解析:检查相同属性,在类型中返回证明
function parseNonEmpty<T>(list: T[]): [T, ...T[]] {
  if (list.length === 0) throw new Error("list cannot be empty");
  return list as [T, ...T[]];
}
二者检查的内容完全相同,但
parseNonEmpty
会将检查得到的信息返回给调用方,
validateNonEmpty
则会丢弃该信息,迫使所有下游函数要么重复检查,要么寄希望于输入已经符合要求。

The Two Strategies

两种优化策略

When a function is partial (not defined for all inputs), there are exactly two ways to make it total:
当一个函数是偏函数(并非对所有输入都有定义)时,只有两种方法可以将其转为全函数:

1. Weaken the output (add Maybe/null)

1. 弱化输出(添加Maybe/null类型)

ts
function head<T>(list: T[]): T | undefined {
  return list[0];
}
Easy to implement, annoying to use. Every caller must handle
undefined
even if they already know the list is non-empty. Leads to redundant checks and
// should never happen
comments.
ts
function head<T>(list: T[]): T | undefined {
  return list[0];
}
实现简单,但使用起来很麻烦。每个调用方都必须处理
undefined
的情况,哪怕他们已经确认列表是非空的。这会导致大量冗余检查和
// 理论上不会发生
的注释。

2. Strengthen the input (narrow the type) -- PREFER THIS

2. 强化输入(收窄类型)——优先选择该方案

ts
function head<T>(list: [T, ...T[]]): T {
  return list[0];
}
The check happens once, at the boundary, when the data enters the system. After that, the type carries the proof. No redundant checks. No impossible branches. If the validation logic changes, the compiler catches every affected call site.
Always try strategy 2 first. Fall back to strategy 1 only when 2 is impractical.
ts
function head<T>(list: [T, ...T[]]): T {
  return list[0];
}
检查只会在数据进入系统的边界执行一次,之后类型会携带校验通过的证明,没有冗余检查,也没有不可能出现的分支。如果校验逻辑发生变化,编译器会捕获所有受影响的调用点。
永远优先尝试策略2,只有当2不可行时再 fallback 到策略1。

Practical Rules

实用规则

1. Make illegal states unrepresentable

1. 让非法状态无法被表示

Use the most precise data structure you reasonably can. Don't model things you shouldn't allow.
ts
// BAD: allows duplicate keys, order might matter or might not
type Config = Array<[string, string]>;

// GOOD: duplicates impossible by construction
type Config = Map<string, string>;

// or even better if keys are known:
type Config = { host: string; port: number; debug: boolean };
尽可能使用最精准的数据结构,不要为你本就不允许的情况预留建模空间。
ts
// 反面示例:允许重复键,顺序语义不明确
type Config = Array<[string, string]>;

// 正面示例:结构本身就不可能出现重复键
type Config = Map<string, string>;

// 如果键是已知的还可以更优:
type Config = { host: string; port: number; debug: boolean };

2. Push parsing to the boundary

2. 将解析逻辑推到系统边界

Parse data into precise types as soon as it enters your system. The boundary between your program and the outside world is where parsing belongs.
ts
// BAD: raw data flows deep into the system, validated ad-hoc
function processUser(data: unknown) {
  // 50 lines later...
  if (typeof data.email !== "string") throw new Error("invalid email");
}

// GOOD: parse at the boundary, use precise types everywhere else
interface User { name: string; email: string; age: number; }

function parseUser(data: unknown): User {
  // validate and parse here, once
}

function processUser(user: User) {
  // no validation needed -- the type guarantees it
}
数据进入系统的第一时间就将其解析为精准类型,解析逻辑应该放在程序与外部世界的边界处。
ts
// 反面示例:原始数据流入系统深处,临时校验随处可见
function processUser(data: unknown) {
  // 50行代码之后...
  if (typeof data.email !== "string") throw new Error("invalid email");
}

// 正面示例:在边界处完成解析,后续所有逻辑都使用精准类型
interface User { name: string; email: string; age: number; }

function parseUser(data: unknown): User {
  // 在此处一次性完成校验和解析
}

function processUser(user: User) {
  // 无需额外校验——类型已经提供了保证
}

3. Treat
void
-returning validators with deep suspicion

3. 对返回
void
的校验函数保持高度警惕

A function whose primary purpose is checking a property but returns
void
is almost always a missed opportunity. It should return a more precise type instead.
ts
// SUSPICIOUS: checks something, returns nothing
function validateAge(age: number): void {
  if (age < 0 || age > 150) throw new Error("invalid age");
}

// BETTER: returns proof of validity as a branded type
type ValidAge = number & { readonly __brand: "ValidAge" };
function parseAge(age: number): ValidAge {
  if (age < 0 || age > 150) throw new Error("invalid age");
  return age as ValidAge;
}
如果一个函数的核心目的是检查属性,但返回值是
void
,几乎都是设计浪费,它应该返回更精准的类型。
ts
// 有问题的实现:检查了属性但没有任何有效返回
function validateAge(age: number): void {
  if (age < 0 || age > 150) throw new Error("invalid age");
}

// 更优实现:以branded type的形式返回有效性证明
type ValidAge = number & { readonly __brand: "ValidAge" };
function parseAge(age: number): ValidAge {
  if (age < 0 || age > 150) throw new Error("invalid age");
  return age as ValidAge;
}

4. Use branded/opaque types as "fake parsers"

4. 使用branded/opaque类型作为"模拟解析器"

When making an illegal state truly unrepresentable is impractical (e.g., "integer in range 1-100"), use branded types with smart constructors to fake it:
ts
type EmailAddress = string & { readonly __brand: "EmailAddress" };

function parseEmail(input: string): EmailAddress {
  if (!input.includes("@")) throw new Error("invalid email");
  return input as EmailAddress;
}

// Now functions can demand EmailAddress instead of string
function sendEmail(to: EmailAddress, body: string): void { /* ... */ }
The type system won't let you pass a raw
string
where
EmailAddress
is expected. You must go through
parseEmail
first.
当真正实现非法状态不可表示不现实时(例如“1-100范围内的整数”),可以搭配smart constructor使用branded类型来模拟该效果:
ts
type EmailAddress = string & { readonly __brand: "EmailAddress" };

function parseEmail(input: string): EmailAddress {
  if (!input.includes("@")) throw new Error("invalid email");
  return input as EmailAddress;
}

// 现在函数可以要求传入EmailAddress而非普通字符串
function sendEmail(to: EmailAddress, body: string): void { /* ... */ }
类型系统不允许你在需要
EmailAddress
的地方传入原始
string
,你必须先调用
parseEmail
完成转换。

5. Let types inform code, not vice versa

5. 让类型指导代码,而非反过来

Don't stick a
boolean
in a record because your current function needs it. Design the types first, then write functions that transform between them.
ts
// BAD: boolean flag controlling behavior
interface Request { url: string; isAuthenticated: boolean; token?: string; }

// GOOD: discriminated union makes invalid state impossible
type Request =
  | { kind: "anonymous"; url: string }
  | { kind: "authenticated"; url: string; token: string };
不要因为你的当前函数需要就往记录里塞一个
boolean
字段,先设计类型,再编写类型转换的函数。
ts
// 反面示例:用布尔标志控制行为
interface Request { url: string; isAuthenticated: boolean; token?: string; }

// 正面示例:discriminated union让非法状态不可能出现
type Request =
  | { kind: "anonymous"; url: string }
  | { kind: "authenticated"; url: string; token: string };

6. Avoid denormalized data

6. 避免非规范化数据

Duplicating the same information in multiple places creates a trivially representable illegal state: the copies getting out of sync. Strive for a single source of truth.
If denormalization is necessary for performance, hide it behind an abstraction boundary where a small, trusted module keeps representations in sync.
在多个位置重复存储相同信息会制造出很容易出现的非法状态:多份副本数据不一致。尽量保证单一数据源。
如果出于性能考虑必须做非规范化,将其隐藏在抽象边界之后,由小型的可信模块保证多份表示的一致性。

7. Parse in multiple passes if needed

7. 必要时可以分多轮解析

Avoiding shotgun parsing means don't act on data before it's fully parsed. It doesn't mean you can't use some input data to decide how to parse other input data.
ts
// Fine: first parse the header to determine the format, then parse the body
const header = parseHeader(raw);
const body = parseBody(raw, header.format);
避免散弹式解析指的是不要在数据完全解析完成之前就对其进行操作,而非不允许用部分输入数据来决定其余输入的解析方式。
ts
// 合理实现:先解析头部确定格式,再解析对应格式的 body
const header = parseHeader(raw);
const body = parseBody(raw, header.format);

Shotgun Parsing -- The Anti-Pattern

散弹式解析——反模式

Shotgun parsing is when validation code is mixed with and spread across processing code. Checks are scattered everywhere, hoping to catch all bad cases without systematic justification.
The danger: if a late-discovered error means some invalid input was already partially processed, you may need to roll back state changes. This is fragile and error-prone.
Parsing avoids this by stratifying the program into two phases:
  1. Parsing phase -- failure due to invalid input can only happen here
  2. Execution phase -- input is known-good, failure modes are minimal
散弹式解析指的是校验代码和业务处理代码混在一起,散落在各处。检查逻辑到处都是,希望能覆盖所有坏情况,但没有系统的合理性支撑。
风险在于:如果后续才发现输入无效,此时部分无效输入可能已经被处理了,你需要回滚状态变更,这个过程非常脆弱且容易出错。
解析可以将程序分为两个阶段来避免这个问题:
  1. 解析阶段——只有这里会因为输入无效发生失败
  2. 执行阶段——输入已经确认有效,失败场景极少

Code Review Checklist

代码审查检查清单

When reviewing code, watch for these smells:
  • Function accepts
    string
    where a more specific type exists (URL, email, ID)
  • Validation function returns
    void
    instead of a refined type
  • Same property checked in multiple places (redundant validation)
  • // should never happen
    or
    // impossible
    comments
  • Raw
    unknown
    /
    any
    /
    object
    flowing past the system boundary into business logic
  • Boolean fields that could be discriminated unions
  • Optional fields that are actually always present after a certain point
  • Arrays where non-empty arrays are required
  • null
    checks deep in business logic for data validated at entry
审查代码时,注意以下坏味道:
  • 函数本可以使用更具体的类型(URL、邮箱、ID),却直接接收
    string
  • 校验函数返回
    void
    而非更精细化的类型
  • 同一个属性在多个位置被重复检查(冗余校验)
  • 出现
    // 应该不会发生
    // 不可能
    的注释
  • 原始
    unknown
    /
    any
    /
    object
    穿过系统边界流入业务逻辑
  • 可以用discriminated union替代的布尔字段
  • 在某个节点之后一定会存在的字段被定义为可选字段
  • 要求非空数组的场景使用了普通数组类型
  • 业务逻辑深处对入口处已经校验过的数据做
    null
    检查