nodejs-performance

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Node.js Performance

Node.js 性能优化

Use this workflow to turn Node.js performance/resource investigations into safe, reviewable PRs.
使用此工作流可将Node.js性能/资源问题的调研转化为安全、可评审的PR。

Goals

目标

  • Improve execution time first: reduce p50/p95/p99 latency and increase throughput without changing intended behavior.
  • Reduce CPU, memory, event-loop lag, I/O pressure, or lock contention when it supports execution-time gains.
  • Ship small, isolated changes with measurable impact.
  • 优先优化执行时间:在不改变预期行为的前提下,降低p50/p95/p99延迟并提升吞吐量。
  • 当有助于提升执行时间时,减少CPU、内存、事件循环延迟、I/O压力或锁竞争。
  • 交付具有可衡量影响的小型、独立变更。

Operating Rules

操作准则

  • Work on one optimization per PR.
  • Always choose the highest expected-impact task first.
  • Confirm and respect intentional behaviors before changing them.
  • Prefer low-risk changes in high-frequency paths.
  • Prioritize request/job execution-path work over bootstrap/startup micro-optimizations unless startup is on the critical path at scale.
  • Include evidence: targeted tests + before/after benchmark.
  • 每个PR仅处理一项优化。
  • 始终优先选择预期影响最高的任务。
  • 在变更前确认并尊重原有预期行为。
  • 优先在高频路径中进行低风险变更。
  • 除非启动环节在大规模场景下属于关键路径,否则优先处理请求/作业执行路径的工作,而非启动/初始化阶段的微优化。
  • 需包含佐证:针对性测试 + 变更前后的基准测试结果。

Impact-First Selection

基于影响优先级的选择

Before coding, rank candidates using this score:
priority = (frequency x blast_radius x expected_gain) / (risk x effort)
Use 1-5 for each factor:
  • frequency
    : how often the path runs in production.
  • blast_radius
    : how many requests/jobs/users are affected.
  • expected_gain
    : estimated latency/resource improvement.
  • risk
    : probability of behavior regression.
  • effort
    : engineering time and change surface area.
Pick the top-ranked candidate, then validate with a baseline measurement.
If two candidates have similar score, pick the one with clearer end-to-end execution-time impact.
编码前,使用以下评分公式对候选优化项进行排序:
priority = (frequency x blast_radius x expected_gain) / (risk x effort)
每个因子采用1-5分:
  • frequency
    :该路径在生产环境中的运行频率。
  • blast_radius
    :受影响的请求/作业/用户数量。
  • expected_gain
    :预估的延迟/资源优化幅度。
  • risk
    :行为出现回归的概率。
  • effort
    :开发时间和变更影响范围。
选择排名最高的候选项,然后通过基准测量验证。
如果两个候选项得分相近,选择端到端执行时间影响更明确的那个。

Prioritization Targets

优先级目标

Start with code that runs on every request/job/task:
  • Request/job wrappers and middleware.
  • Retry/timeout/circuit-breaker code.
  • Connection pools (DB/Redis/HTTP) and socket reuse.
  • Stream/pipeline transformations and buffering.
  • Serialization/deserialization hot paths (JSON, parsers, schema validation).
  • Queue consumers, schedulers, and worker dispatch.
  • Event listener attach/detach lifecycle and cleanup logic.
Deprioritize unless justified by production profile:
  • One-time startup/bootstrap code.
  • Rare admin/debug-only flows.
  • Teardown paths that are not on the steady-state critical path.
从每个请求/作业/任务都会执行的代码开始:
  • 请求/作业包装器和中间件。
  • 重试/超时/熔断代码。
  • 连接池(DB/Redis/HTTP)和套接字复用。
  • 流/管道转换与缓冲。
  • 序列化/反序列化热点路径(JSON、解析器、schema验证)。
  • 队列消费者、调度器和工作进程分发。
  • 事件监听器的附加/分离生命周期及清理逻辑。
除非生产环境profiling数据证明其必要性,否则优先度较低:
  • 一次性启动/初始化代码。
  • 罕见的管理员/仅调试流程。
  • 非稳态关键路径的销毁清理流程。

Common Hot-Path Smells

常见热点路径问题特征

  • Recomputing invariant values per invocation.
  • Re-parsing code/AST repeatedly.
  • Duplicate async lookups returning the same value.
  • Per-call heavy object allocation in common-case parsing.
  • Unnecessary awaits in teardown/close/dispose paths.
  • Missing fast paths for dominant input shapes.
  • Unbounded retries or retry storms under degraded dependencies.
  • Excessive concurrency causing memory spikes or downstream saturation.
  • Work done for logging/telemetry/metrics formatting even when disabled.
  • 每次调用时重复计算不变值。
  • 重复解析代码/AST。
  • 多次异步查找返回相同值。
  • 常见解析场景中每次调用都进行大量对象分配。
  • 销毁/关闭/释放路径中存在不必要的await。
  • 针对主要输入形状缺少快速路径。
  • 依赖降级时出现无界重试或重试风暴。
  • 过度并发导致内存峰值或下游饱和。
  • 即使禁用日志/遥测/指标格式化仍会执行相关操作。

Execution Workflow

执行工作流

  1. Pick one candidate
  • Rank candidates and pick the highest priority score.
  • Explain the issue in one sentence.
  • State expected impact (CPU, latency, memory, event-loop lag, I/O, contention).
  1. Prove it is hot
  • Add a focused micro-benchmark or scenario benchmark.
  • Capture baseline numbers before editing.
  • Prefer scenario benchmarks that include real request/job flow when the goal is execution-time improvement.
  • For resource issues, capture process metrics (
    rss
    , heap, FD count, event-loop delay).
  1. Design minimal fix
  • Keep behavior-compatible defaults.
  • Add fallback path for edge cases.
  • Avoid broad refactors in the same PR.
  1. Implement
  • Make the smallest patch that removes repeated work.
  • Keep interfaces stable unless change is necessary.
  1. Test
  • Add/adjust targeted tests for new behavior and regressions.
  • Run relevant package tests (not only whole-monorepo by default).
  • Add concurrency/degradation tests when the bug appears only under load.
  1. Benchmark again
  • Re-run the same benchmark with same parameters.
  • Report absolute and relative deltas.
  • Include latency deltas first (p50/p95/p99, throughput), then resource deltas when applicable.
  1. Package PR
  • Branch naming:
    codex/perf-<area>-<change>
    .
  • Commit message:
    perf(<package>): <what changed>
    .
  • Include risk notes and rollback simplicity.
  1. Iterate
  • Wait for review, then move to next isolated improvement.
  1. 选择一个候选项
  • 对候选项排序,选择优先级得分最高的。
  • 用一句话说明问题。
  • 说明预期影响(CPU、延迟、内存、事件循环延迟、I/O、竞争)。
  1. 证明其为热点路径
  • 添加针对性的微基准测试或场景基准测试。
  • 在修改前捕获基准数据。
  • 当目标是优化执行时间时,优先选择包含真实请求/作业流程的场景基准测试。
  • 针对资源问题,捕获进程指标(
    rss
    、堆内存、FD数量、事件循环延迟)。
  1. 设计最小化修复方案
  • 保持行为兼容的默认值。
  • 为边缘情况添加回退路径。
  • 避免在同一个PR中进行大范围重构。
  1. 实现
  • 制作最小化补丁以消除重复操作。
  • 保持接口稳定,除非必须变更。
  1. 测试
  • 添加/调整针对性测试以覆盖新行为和回归情况。
  • 运行相关包的测试(默认不运行整个monorepo的测试)。
  • 当问题仅在负载下出现时,添加并发/降级测试。
  1. 再次进行基准测试
  • 使用相同参数重新运行相同的基准测试。
  • 报告绝对和相对变化值。
  • 优先报告延迟变化(p50/p95/p99和吞吐量),适用时再报告资源变化。
  1. 打包PR
  • 分支命名:
    codex/perf-<area>-<change>
  • 提交信息:
    perf(<package>): <变更内容>
  • 包含风险说明和回滚简易性评估。
  1. 迭代
  • 等待评审通过后,再进行下一项独立优化。

Benchmarking Guidance

基准测试指南

  • Keep benchmark scope narrow to isolate one change.
  • Use warmup iterations.
  • Measure both:
  • micro
    : operation-level overhead.
  • scenario
    : request/job flow, concurrency, and degraded dependency condition.
  • For execution-time work, scenario numbers are the decision-maker; micro numbers are supporting evidence.
  • Always print:
  • total time
  • per-op time
  • p50/p95/p99 latency when applicable
  • speedup ratio
  • iterations and workload shape
  • resource counters (
    rss
    , heap, handles, event-loop delay) when relevant
  • 缩小基准测试范围以隔离单个变更。
  • 使用预热迭代。
  • 同时测量:
  • micro
    :操作级别的开销。
  • scenario
    :请求/作业流程、并发情况和依赖降级条件。
  • 对于执行时间优化工作,场景测试数据是决策依据;微基准测试数据仅作为辅助证据。
  • 始终输出:
  • 总时间
  • 每次操作的时间
  • 适用时的p50/p95/p99延迟
  • 加速比
  • 迭代次数和工作负载形态
  • 相关的资源计数器(
    rss
    、堆内存、句柄、事件循环延迟)

Resource Exhaustion Checklist

资源耗尽检查清单

  • Cap concurrency at each boundary (ingress, queue, downstream clients).
  • Ensure timeout + cancellation are wired end-to-end.
  • Ensure retries are bounded and jittered.
  • Confirm listeners/timers/intervals are always cleaned up.
  • Confirm streams are closed/destroyed on success and error paths.
  • Confirm object caches have size/TTL controls.
  • 在每个边界(入口、队列、下游客户端)设置并发上限。
  • 确保超时 + 取消逻辑端到端连通。
  • 确保重试有边界限制且添加了抖动。
  • 确认监听器/定时器/间隔器总是被清理。
  • 确认流在成功和错误路径中都被关闭/销毁。
  • 确认对象缓存有大小/TTL控制。

CI / Flake Handling

CI / 不稳定用例处理

  • If CI-only failures appear, add temporary diagnostic payloads in tests.
  • Serialize only affected flaky tests when resource contention is the cause.
  • Keep determinism improvements in test code, not production code, unless required.
  • 如果出现仅在CI中失败的情况,在测试中添加临时诊断数据。
  • 当资源竞争是原因时,仅将受影响的不稳定用例改为串行执行。
  • 除非必要,否则在测试代码中而非生产代码中进行确定性改进。

Output Template

输出模板

For each PR, report:
  1. Issue being fixed.
  2. Why it matters under load.
  3. Code locations changed.
  4. Tests run and results.
  5. Benchmark before/after numbers (execution first: p50/p95/p99 and throughput).
  6. Risk assessment.
  7. Next candidate optimization.
每个PR需报告:
  1. 修复的问题。
  2. 负载下该问题的影响。
  3. 修改的代码位置。
  4. 运行的测试及结果。
  5. 基准测试前后数据(优先执行时间:p50/p95/p99和吞吐量)。
  6. 风险评估。
  7. 下一个候选优化项。