performance

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Performance Engineering

性能工程

Evidence-based performance optimization → measure → profile → optimize → validate.
<when_to_use>
  • Profiling slow code paths or bottlenecks
  • Identifying memory leaks or excessive allocations
  • Optimizing latency-critical operations (P95, P99)
  • Benchmarking competing implementations
  • Database query optimization
  • Reducing CPU usage in hot paths
  • Improving throughput (RPS, ops/sec)
NOT for: premature optimization, optimization without measurement, guessing at bottlenecks
</when_to_use>
<iron_law>
NO OPTIMIZATION WITHOUT MEASUREMENT
Required workflow:
  1. Measure baseline performance with realistic workload
  2. Profile to identify actual bottleneck
  3. Optimize the bottleneck (not what you think is slow)
  4. Measure again to verify improvement
  5. Document gains and tradeoffs
Optimizing unmeasured code wastes time and introduces bugs.
</iron_law>
<stages>
Load the maintain-tasks skill for stage tracking:
Stage 1: Establishing baseline
  • content: "Establish performance baseline with realistic workload"
  • activeForm: "Establishing performance baseline"
Stage 2: Profiling bottlenecks
  • content: "Profile code to identify actual bottlenecks"
  • activeForm: "Profiling code to identify bottlenecks"
Stage 3: Analyzing root cause
  • content: "Analyze profiling data to determine root cause"
  • activeForm: "Analyzing profiling data"
Stage 4: Implementing optimization
  • content: "Implement targeted optimization for identified bottleneck"
  • activeForm: "Implementing optimization"
Stage 5: Validating improvement
  • content: "Measure performance gains and verify no regressions"
  • activeForm: "Validating performance improvement"
</stages> <metrics>
基于实证的性能优化:测量→性能分析→优化→验证。
<when_to_use>
  • 对缓慢的代码路径或瓶颈进行性能分析
  • 识别内存泄漏或过度内存分配问题
  • 优化对延迟敏感的操作(P95、P99)
  • 对竞品实现进行基准测试
  • 数据库查询优化
  • 降低热点路径的CPU使用率
  • 提升吞吐量(RPS、每秒操作数)
不适用于:过早优化、无测量依据的优化、凭猜测判断瓶颈
</when_to_use>
<iron_law>
无测量,不优化
必备工作流:
  1. 基于真实工作负载测量基准性能
  2. 进行性能分析以定位实际瓶颈
  3. 针对瓶颈进行优化(而非你认为缓慢的部分)
  4. 再次测量以验证优化效果
  5. 记录性能提升和权衡点
对未测量的代码进行优化会浪费时间并引入bug。
</iron_law>
<stages>
加载maintain-tasks技能以进行阶段跟踪:
阶段1:建立基准
  • content: "基于真实工作负载建立性能基准"
  • activeForm: "正在建立性能基准"
阶段2:分析瓶颈
  • content: "对代码进行性能分析以定位实际瓶颈"
  • activeForm: "正在对代码进行性能分析以定位瓶颈"
阶段3:分析根本原因
  • content: "分析性能分析数据以确定根本原因"
  • activeForm: "正在分析性能分析数据"
阶段4:实施优化
  • content: "针对已定位的瓶颈实施定向优化"
  • activeForm: "正在实施优化"
阶段5:验证优化效果
  • content: "测量性能提升并验证无回归问题"
  • activeForm: "正在验证性能提升效果"
</stages> <metrics>

Key Performance Indicators

关键性能指标

Latency (response time):
  • P50 (median) — typical case
  • P95 — most users
  • P99 — tail latency
  • P99.9 — outliers
  • TTFB — time to first byte
  • TTLB — time to last byte
Throughput:
  • RPS — requests per second
  • ops/sec — operations per second
  • bytes/sec — data transfer rate
  • queries/sec — database throughput
Memory:
  • Heap usage — allocated memory
  • GC frequency — garbage collection pauses
  • GC duration — stop-the-world time
  • Allocation rate — memory churn
  • Resident set size (RSS) — total memory
CPU:
  • CPU time — total compute
  • Wall time — elapsed time
  • Hot paths — frequently executed code
  • Time complexity — algorithmic efficiency
  • CPU utilization — percentage used
Always measure:
  • Before optimization (baseline)
  • After optimization (improvement)
  • Under realistic load (not toy data)
  • Multiple runs (account for variance)
</metrics>
<profiling_tools>
延迟(响应时间):
  • P50(中位数)——典型场景
  • P95——覆盖多数用户
  • P99——尾部延迟
  • P99.9——极端异常值
  • TTFB——首字节响应时间
  • TTLB——末字节响应时间
吞吐量:
  • RPS——每秒请求数
  • ops/sec——每秒操作数
  • bytes/sec——数据传输速率
  • queries/sec——数据库吞吐量
内存:
  • 堆内存使用量——已分配内存
  • GC频率——垃圾回收暂停次数
  • GC持续时间——全局停顿时间
  • 分配速率——内存周转量
  • 常驻内存集大小(RSS)——总内存占用
CPU:
  • CPU时间——总计算耗时
  • 挂钟时间——实际流逝时间
  • 热点路径——频繁执行的代码
  • 时间复杂度——算法效率
  • CPU利用率——占用百分比
必须测量的场景:
  • 优化前(基准)
  • 优化后(性能提升)
  • 真实负载下(而非测试用例数据)
  • 多次运行(考虑方差)
</metrics>
<profiling_tools>

TypeScript/Bun

TypeScript/Bun

Built-in timing:
typescript
console.time('operation')
// ... code to measure
console.timeEnd('operation')

// High precision
const start = Bun.nanoseconds()
// ... code to measure
const elapsed = Bun.nanoseconds() - start
console.log(`Took ${elapsed / 1_000_000}ms`)
Performance API:
typescript
const mark1 = performance.mark('start')
// ... code to measure
const mark2 = performance.mark('end')
performance.measure('operation', 'start', 'end')
const measure = performance.getEntriesByName('operation')[0]
console.log(`Duration: ${measure.duration}ms`)
Memory profiling:
  • Chrome DevTools → Memory tab → heap snapshots
  • Node.js
    --inspect
    flag + Chrome DevTools
  • process.memoryUsage()
    for RSS/heap tracking
CPU profiling:
  • Chrome DevTools → Performance tab → record session
  • Node.js
    --prof
    flag +
    node --prof-process
  • Flamegraphs for visualization
内置计时工具:
typescript
console.time('operation')
// ... code to measure
console.timeEnd('operation')

// High precision
const start = Bun.nanoseconds()
// ... code to measure
const elapsed = Bun.nanoseconds() - start
console.log(`Took ${elapsed / 1_000_000}ms`)
Performance API:
typescript
const mark1 = performance.mark('start')
// ... code to measure
const mark2 = performance.mark('end')
performance.measure('operation', 'start', 'end')
const measure = performance.getEntriesByName('operation')[0]
console.log(`Duration: ${measure.duration}ms`)
内存分析:
  • Chrome DevTools → 内存标签页 → 堆快照
  • Node.js
    --inspect
    标记 + Chrome DevTools
  • process.memoryUsage()
    用于跟踪RSS/堆内存
CPU分析:
  • Chrome DevTools → 性能标签页 → 录制会话
  • Node.js
    --prof
    标记 +
    node --prof-process
  • 火焰图可视化

Rust

Rust

Benchmarking:
rust
#[cfg(test)]
mod benches {
    use criterion::{black_box, criterion_group, criterion_main, Criterion};

    fn benchmark_function(c: &mut Criterion) {
        c.bench_function("my_function", |b| {
            b.iter(|| my_function(black_box(42)))
        });
    }

    criterion_group!(benches, benchmark_function);
    criterion_main!(benches);
}
Profiling:
  • cargo bench
    — criterion benchmarks
  • perf record
    +
    perf report
    — Linux profiling
  • cargo flamegraph
    — visual flamegraphs
  • cargo bloat
    — binary size analysis
  • valgrind --tool=callgrind
    — detailed profiling
  • heaptrack
    — memory profiling
Instrumentation:
rust
use std::time::Instant;

let start = Instant::now();
// ... code to measure
let duration = start.elapsed();
println!("Took: {:?}", duration);
</profiling_tools>
<optimization_patterns>
基准测试:
rust
#[cfg(test)]
mod benches {
    use criterion::{black_box, criterion_group, criterion_main, Criterion};

    fn benchmark_function(c: &mut Criterion) {
        c.bench_function("my_function", |b| {
            b.iter(|| my_function(black_box(42)))
        });
    }

    criterion_group!(benches, benchmark_function);
    criterion_main!(benches);
}
性能分析:
  • cargo bench
    —— criterion基准测试
  • perf record
    +
    perf report
    —— Linux性能分析
  • cargo flamegraph
    —— 可视化火焰图
  • cargo bloat
    —— 二进制大小分析
  • valgrind --tool=callgrind
    —— 详细性能分析
  • heaptrack
    —— 内存分析
插桩计时:
rust
use std::time::Instant;

let start = Instant::now();
// ... code to measure
let duration = start.elapsed();
println!("Took: {:?}", duration);
</profiling_tools>
<optimization_patterns>

Algorithm Improvements

算法优化

Time complexity:
  • O(n²) → O(n log n) — sorting, searching
  • O(n) → O(log n) — binary search, trees
  • O(n) → O(1) — hash maps, memoization
Space-time tradeoffs:
  • Cache computed results (memoization)
  • Precompute expensive operations
  • Index data for faster lookup
  • Use hash maps for O(1) access
时间复杂度优化:
  • O(n²) → O(n log n) —— 排序、搜索
  • O(n) → O(log n) —— 二分查找、树结构
  • O(n) → O(1) —— 哈希表、记忆化
时空权衡:
  • 缓存计算结果(记忆化)
  • 预计算耗时操作
  • 为数据建立索引以加快查找
  • 使用哈希表实现O(1)访问

Memory Optimization

内存优化

Reduce allocations:
typescript
// Bad: creates new array each iteration
for (const item of items) {
  const results = []
  results.push(process(item))
}

// Good: reuse array
const results = []
for (const item of items) {
  results.push(process(item))
}
rust
// Bad: allocates String every time
fn format_user(name: &str) -> String {
    format!("User: {}", name)
}

// Good: reuses buffer
fn format_user(name: &str, buf: &mut String) {
    buf.clear();
    buf.push_str("User: ");
    buf.push_str(name);
}
Memory pooling:
  • Reuse expensive objects (connections, buffers)
  • Object pools for frequently allocated types
  • Arena allocators for batch allocations
Lazy evaluation:
  • Compute only when needed
  • Stream processing vs loading all data
  • Iterators over materialized collections
减少内存分配:
typescript
// 不佳:每次迭代创建新数组
for (const item of items) {
  const results = []
  results.push(process(item))
}

// 优化:复用数组
const results = []
for (const item of items) {
  results.push(process(item))
}
rust
// 不佳:每次分配新String
fn format_user(name: &str) -> String {
    format!("User: {}", name)
}

// 优化:复用缓冲区
fn format_user(name: &str, buf: &mut String) {
    buf.clear();
    buf.push_str("User: ");
    buf.push_str(name);
}
内存池:
  • 复用昂贵对象(连接、缓冲区)
  • 为频繁分配的类型使用对象池
  • 为批量分配使用区域分配器
延迟计算:
  • 仅在需要时进行计算
  • 流处理而非加载全部数据
  • 使用迭代器而非物化集合

I/O Optimization

I/O优化

Batching:
  • Batch API calls (1 request vs 100)
  • Batch database writes (bulk insert)
  • Batch file operations (single write vs many)
Caching:
  • Cache expensive computations
  • Cache database queries (Redis, in-memory)
  • Cache API responses (HTTP caching)
  • Invalidate stale cache entries
Async I/O:
  • Non-blocking operations (async/await)
  • Concurrent requests (Promise.all, tokio::spawn)
  • Connection pooling (reuse connections)
批量处理:
  • 批量API调用(1次请求 vs 100次)
  • 批量数据库写入(批量插入)
  • 批量文件操作(单次写入 vs 多次写入)
缓存:
  • 缓存耗时计算结果
  • 缓存数据库查询(Redis、内存缓存)
  • 缓存API响应(HTTP缓存)
  • 失效过期缓存条目
异步I/O:
  • 非阻塞操作(async/await)
  • 并发请求(Promise.all、tokio::spawn)
  • 连接池(复用连接)

Database Optimization

数据库优化

Query optimization:
  • Add indexes for common queries
  • Use EXPLAIN/EXPLAIN ANALYZE
  • Avoid N+1 queries (use joins or batch loading)
  • Select only needed columns
  • Filter at database level (WHERE vs client filter)
Schema design:
  • Normalize to reduce duplication
  • Denormalize for read-heavy workloads
  • Partition large tables
  • Use appropriate data types
Connection management:
  • Connection pooling (don't create per request)
  • Prepared statements (avoid SQL parsing)
  • Transaction batching (reduce round trips)
</optimization_patterns>
<workflow>
Loop: Measure → Profile → Analyze → Optimize → Validate
  1. Define performance goal — target metric (e.g., P95 < 100ms)
  2. Establish baseline — measure current performance under realistic load
  3. Profile systematically — identify actual bottleneck (not guesses)
  4. Analyze root cause — understand why code is slow
  5. Design optimization — plan targeted improvement
  6. Implement optimization — make focused change
  7. Measure improvement — verify gains, check for regressions
  8. Document results — record baseline, optimization, gains, tradeoffs
At each step:
  • Document measurements with methodology
  • Note profiling tool output
  • Track optimization attempts (what worked/failed)
  • Update performance documentation
</workflow> <validation>
Before declaring optimization complete:
Check gains:
  • ✓ Measured improvement meets target?
  • ✓ Improvement statistically significant?
  • ✓ Tested under realistic load?
  • ✓ Multiple runs confirm consistency?
Check regressions:
  • ✓ No degradation in other metrics?
  • ✓ Memory usage still acceptable?
  • ✓ Code complexity still manageable?
  • ✓ Tests still pass?
Check documentation:
  • ✓ Baseline measurements recorded?
  • ✓ Optimization approach explained?
  • ✓ Gains quantified with numbers?
  • ✓ Tradeoffs documented?
</validation> <rules>
ALWAYS:
  • Measure before optimizing (baseline)
  • Profile to find actual bottleneck
  • Use realistic workload (not toy data)
  • Measure multiple runs (account for variance)
  • Document baseline and improvements
  • Check for regressions in other metrics
  • Consider readability vs performance tradeoff
  • Verify statistical significance
NEVER:
  • Optimize without measuring first
  • Guess at bottleneck without profiling
  • Benchmark with unrealistic data
  • Trust single-run measurements
  • Skip documentation of results
  • Sacrifice correctness for speed
  • Optimize without clear performance goal
  • Ignore algorithmic improvements
</rules> <references>
Methodology:
  • benchmarking.md — rigorous benchmarking methodology
Related skills:
  • codebase-recon — evidence-based investigation (foundation)
  • debugging — structured bug investigation
  • typescript-dev — correctness before performance
</references>
查询优化:
  • 为常用查询添加索引
  • 使用EXPLAIN/EXPLAIN ANALYZE
  • 避免N+1查询(使用连接或批量加载)
  • 仅选择所需列
  • 在数据库层面过滤(WHERE子句 vs 客户端过滤)
Schema设计:
  • 规范化以减少重复
  • 为读密集型工作负载进行反规范化
  • 对大表进行分区
  • 使用合适的数据类型
连接管理:
  • 连接池(不为每个请求创建新连接)
  • 预编译语句(避免SQL解析)
  • 事务批量处理(减少往返次数)
</optimization_patterns>
<workflow>
循环流程:测量→性能分析→分析→优化→验证
  1. 定义性能目标 —— 目标指标(例如,P95 < 100ms)
  2. 建立基准 —— 在真实负载下测量当前性能
  3. 系统性能分析 —— 定位实际瓶颈(而非猜测)
  4. 分析根本原因 —— 理解代码缓慢的原因
  5. 设计优化方案 —— 规划定向改进措施
  6. 实施优化 —— 进行针对性修改
  7. 测量性能提升 —— 验证优化效果,检查是否有回归
  8. 记录结果 —— 记录基准、优化措施、性能提升和权衡点
在每个步骤中:
  • 记录测量方法和结果
  • 记录性能分析工具的输出
  • 跟踪优化尝试(哪些有效/无效)
  • 更新性能文档
</workflow> <validation>
在宣布优化完成前:
检查性能提升:
  • ✓ 测量的性能提升是否达到目标?
  • ✓ 性能提升是否具有统计显著性?
  • ✓ 是否在真实负载下测试?
  • ✓ 多次运行是否确认结果一致?
检查回归问题:
  • ✓ 其他指标是否出现退化?
  • ✓ 内存使用是否仍在可接受范围内?
  • ✓ 代码复杂度是否仍可控?
  • ✓ 测试是否全部通过?
检查文档:
  • ✓ 是否记录了基准测量结果?
  • ✓ 是否解释了优化方法?
  • ✓ 是否用数据量化了性能提升?
  • ✓ 是否记录了权衡点?
</validation> <rules>
必须遵守:
  • 优化前先测量(建立基准)
  • 通过性能分析定位实际瓶颈
  • 使用真实工作负载(而非测试用例数据)
  • 多次测量(考虑方差)
  • 记录基准和性能提升
  • 检查其他指标是否出现回归
  • 考虑可读性与性能的权衡
  • 验证统计显著性
绝对禁止:
  • 未先测量就进行优化
  • 未进行性能分析就猜测瓶颈
  • 使用不真实数据进行基准测试
  • 相信单次运行的测量结果
  • 跳过结果记录
  • 为了速度牺牲正确性
  • 无明确性能目标就进行优化
  • 忽略算法层面的优化
</rules> <references>
方法学:
  • benchmarking.md —— 严谨的基准测试方法学
相关技能:
  • codebase-recon —— 基于实证的调查(基础)
  • debugging —— 结构化bug调查
  • typescript-dev —— 正确性优先于性能
</references>