performance

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Performance Engineering

性能工程

Evidence-based performance optimization → measure → profile → optimize → validate.

<when_to_use>

Profiling slow code paths or bottlenecks
Identifying memory leaks or excessive allocations
Optimizing latency-critical operations (P95, P99)
Benchmarking competing implementations
Database query optimization
Reducing CPU usage in hot paths
Improving throughput (RPS, ops/sec)

NOT for: premature optimization, optimization without measurement, guessing at bottlenecks

</when_to_use>

<iron_law>

NO OPTIMIZATION WITHOUT MEASUREMENT

Required workflow:

Measure baseline performance with realistic workload
Profile to identify actual bottleneck
Optimize the bottleneck (not what you think is slow)
Measure again to verify improvement
Document gains and tradeoffs

Optimizing unmeasured code wastes time and introduces bugs.

</iron_law>

Load the maintain-tasks skill for stage tracking:

Stage 1: Establishing baseline

content: "Establish performance baseline with realistic workload"
activeForm: "Establishing performance baseline"

Stage 2: Profiling bottlenecks

content: "Profile code to identify actual bottlenecks"
activeForm: "Profiling code to identify bottlenecks"

Stage 3: Analyzing root cause

content: "Analyze profiling data to determine root cause"
activeForm: "Analyzing profiling data"

Stage 4: Implementing optimization

content: "Implement targeted optimization for identified bottleneck"
activeForm: "Implementing optimization"

Stage 5: Validating improvement

content: "Measure performance gains and verify no regressions"
activeForm: "Validating performance improvement"

</stages> <metrics>

基于实证的性能优化：测量→性能分析→优化→验证。

<when_to_use>

对缓慢的代码路径或瓶颈进行性能分析
识别内存泄漏或过度内存分配问题
优化对延迟敏感的操作（P95、P99）
对竞品实现进行基准测试
数据库查询优化
降低热点路径的CPU使用率
提升吞吐量（RPS、每秒操作数）

不适用于：过早优化、无测量依据的优化、凭猜测判断瓶颈

</when_to_use>

<iron_law>

无测量，不优化

必备工作流：

基于真实工作负载测量基准性能
进行性能分析以定位实际瓶颈
针对瓶颈进行优化（而非你认为缓慢的部分）
再次测量以验证优化效果
记录性能提升和权衡点

对未测量的代码进行优化会浪费时间并引入bug。

</iron_law>

加载maintain-tasks技能以进行阶段跟踪：

阶段1：建立基准

content: "基于真实工作负载建立性能基准"
activeForm: "正在建立性能基准"

阶段2：分析瓶颈

content: "对代码进行性能分析以定位实际瓶颈"
activeForm: "正在对代码进行性能分析以定位瓶颈"

阶段3：分析根本原因

content: "分析性能分析数据以确定根本原因"
activeForm: "正在分析性能分析数据"

阶段4：实施优化

content: "针对已定位的瓶颈实施定向优化"
activeForm: "正在实施优化"

阶段5：验证优化效果

content: "测量性能提升并验证无回归问题"
activeForm: "正在验证性能提升效果"

</stages> <metrics>

Key Performance Indicators

关键性能指标

Latency (response time):

P50 (median) — typical case
P95 — most users
P99 — tail latency
P99.9 — outliers
TTFB — time to first byte
TTLB — time to last byte

Throughput:

RPS — requests per second
ops/sec — operations per second
bytes/sec — data transfer rate
queries/sec — database throughput

Memory:

Heap usage — allocated memory
GC frequency — garbage collection pauses
GC duration — stop-the-world time
Allocation rate — memory churn
Resident set size (RSS) — total memory

CPU:

CPU time — total compute
Wall time — elapsed time
Hot paths — frequently executed code
Time complexity — algorithmic efficiency
CPU utilization — percentage used

Always measure:

Before optimization (baseline)
After optimization (improvement)
Under realistic load (not toy data)
Multiple runs (account for variance)

</metrics>

<profiling_tools>

延迟（响应时间）：

P50（中位数）——典型场景
P95——覆盖多数用户
P99——尾部延迟
P99.9——极端异常值
TTFB——首字节响应时间
TTLB——末字节响应时间

吞吐量：

RPS——每秒请求数
ops/sec——每秒操作数
bytes/sec——数据传输速率
queries/sec——数据库吞吐量

内存：

堆内存使用量——已分配内存
GC频率——垃圾回收暂停次数
GC持续时间——全局停顿时间
分配速率——内存周转量
常驻内存集大小（RSS）——总内存占用

CPU：

CPU时间——总计算耗时
挂钟时间——实际流逝时间
热点路径——频繁执行的代码
时间复杂度——算法效率
CPU利用率——占用百分比

必须测量的场景：

优化前（基准）
优化后（性能提升）
真实负载下（而非测试用例数据）
多次运行（考虑方差）

</metrics>

<profiling_tools>

TypeScript/Bun

Built-in timing:

typescript

console.time('operation')
// ... code to measure
console.timeEnd('operation')

// High precision
const start = Bun.nanoseconds()
// ... code to measure
const elapsed = Bun.nanoseconds() - start
console.log(`Took ${elapsed / 1_000_000}ms`)

Performance API:

typescript

const mark1 = performance.mark('start')
// ... code to measure
const mark2 = performance.mark('end')
performance.measure('operation', 'start', 'end')
const measure = performance.getEntriesByName('operation')[0]
console.log(`Duration: ${measure.duration}ms`)

Memory profiling:

Chrome DevTools → Memory tab → heap snapshots
Node.js
```
--inspect
```
flag + Chrome DevTools
```
process.memoryUsage()
```
for RSS/heap tracking

CPU profiling:

Chrome DevTools → Performance tab → record session
Node.js
```
--prof
```
flag +
```
node --prof-process
```
Flamegraphs for visualization

内置计时工具：

typescript

console.time('operation')
// ... code to measure
console.timeEnd('operation')

// High precision
const start = Bun.nanoseconds()
// ... code to measure
const elapsed = Bun.nanoseconds() - start
console.log(`Took ${elapsed / 1_000_000}ms`)

Performance API：

typescript

const mark1 = performance.mark('start')
// ... code to measure
const mark2 = performance.mark('end')
performance.measure('operation', 'start', 'end')
const measure = performance.getEntriesByName('operation')[0]
console.log(`Duration: ${measure.duration}ms`)

内存分析：

Chrome DevTools → 内存标签页 → 堆快照
Node.js
```
--inspect
```
标记 + Chrome DevTools
```
process.memoryUsage()
```
用于跟踪RSS/堆内存

CPU分析：

Chrome DevTools → 性能标签页 → 录制会话
Node.js
```
--prof
```
标记 +
```
node --prof-process
```
火焰图可视化

Rust

Benchmarking:

rust

#[cfg(test)]
mod benches {
    use criterion::{black_box, criterion_group, criterion_main, Criterion};

    fn benchmark_function(c: &mut Criterion) {
        c.bench_function("my_function", |b| {
            b.iter(|| my_function(black_box(42)))
        });
    }

    criterion_group!(benches, benchmark_function);
    criterion_main!(benches);
}

Profiling:

```
cargo bench
```
— criterion benchmarks
```
perf record
```
+
```
perf report
```
— Linux profiling
```
cargo flamegraph
```
— visual flamegraphs
```
cargo bloat
```
— binary size analysis
```
valgrind --tool=callgrind
```
— detailed profiling
```
heaptrack
```
— memory profiling

Instrumentation:

rust

use std::time::Instant;

let start = Instant::now();
// ... code to measure
let duration = start.elapsed();
println!("Took: {:?}", duration);

</profiling_tools>

<optimization_patterns>

基准测试：

rust

#[cfg(test)]
mod benches {
    use criterion::{black_box, criterion_group, criterion_main, Criterion};

    fn benchmark_function(c: &mut Criterion) {
        c.bench_function("my_function", |b| {
            b.iter(|| my_function(black_box(42)))
        });
    }

    criterion_group!(benches, benchmark_function);
    criterion_main!(benches);
}

性能分析：

```
cargo bench
```
—— criterion基准测试
```
perf record
```
+
```
perf report
```
—— Linux性能分析
```
cargo flamegraph
```
—— 可视化火焰图
```
cargo bloat
```
—— 二进制大小分析
```
valgrind --tool=callgrind
```
—— 详细性能分析
```
heaptrack
```
—— 内存分析

插桩计时：

rust

use std::time::Instant;

let start = Instant::now();
// ... code to measure
let duration = start.elapsed();
println!("Took: {:?}", duration);

</profiling_tools>

<optimization_patterns>

Algorithm Improvements

算法优化

Time complexity:

O(n²) → O(n log n) — sorting, searching
O(n) → O(log n) — binary search, trees
O(n) → O(1) — hash maps, memoization

Space-time tradeoffs:

Cache computed results (memoization)
Precompute expensive operations
Index data for faster lookup
Use hash maps for O(1) access

时间复杂度优化：

O(n²) → O(n log n) —— 排序、搜索
O(n) → O(log n) —— 二分查找、树结构
O(n) → O(1) —— 哈希表、记忆化

时空权衡：

缓存计算结果（记忆化）
预计算耗时操作
为数据建立索引以加快查找
使用哈希表实现O(1)访问

Memory Optimization

内存优化

Reduce allocations:

typescript

// Bad: creates new array each iteration
for (const item of items) {
  const results = []
  results.push(process(item))
}

// Good: reuse array
const results = []
for (const item of items) {
  results.push(process(item))
}

rust

// Bad: allocates String every time
fn format_user(name: &str) -> String {
    format!("User: {}", name)
}

// Good: reuses buffer
fn format_user(name: &str, buf: &mut String) {
    buf.clear();
    buf.push_str("User: ");
    buf.push_str(name);
}

Memory pooling:

Reuse expensive objects (connections, buffers)
Object pools for frequently allocated types
Arena allocators for batch allocations

Lazy evaluation:

Compute only when needed
Stream processing vs loading all data
Iterators over materialized collections

减少内存分配：

typescript

// 不佳：每次迭代创建新数组
for (const item of items) {
  const results = []
  results.push(process(item))
}

// 优化：复用数组
const results = []
for (const item of items) {
  results.push(process(item))
}

rust

// 不佳：每次分配新String
fn format_user(name: &str) -> String {
    format!("User: {}", name)
}

// 优化：复用缓冲区
fn format_user(name: &str, buf: &mut String) {
    buf.clear();
    buf.push_str("User: ");
    buf.push_str(name);
}

内存池：

复用昂贵对象（连接、缓冲区）
为频繁分配的类型使用对象池
为批量分配使用区域分配器

延迟计算：

仅在需要时进行计算
流处理而非加载全部数据
使用迭代器而非物化集合

I/O Optimization

I/O优化

Batching:

Batch API calls (1 request vs 100)
Batch database writes (bulk insert)
Batch file operations (single write vs many)

Caching:

Cache expensive computations
Cache database queries (Redis, in-memory)
Cache API responses (HTTP caching)
Invalidate stale cache entries

Async I/O:

Non-blocking operations (async/await)
Concurrent requests (Promise.all, tokio::spawn)
Connection pooling (reuse connections)

批量处理：

批量API调用（1次请求 vs 100次）
批量数据库写入（批量插入）
批量文件操作（单次写入 vs 多次写入）

缓存：

缓存耗时计算结果
缓存数据库查询（Redis、内存缓存）
缓存API响应（HTTP缓存）
失效过期缓存条目

异步I/O：

非阻塞操作（async/await）
并发请求（Promise.all、tokio::spawn）
连接池（复用连接）

Database Optimization

数据库优化

Query optimization:

Add indexes for common queries
Use EXPLAIN/EXPLAIN ANALYZE
Avoid N+1 queries (use joins or batch loading)
Select only needed columns
Filter at database level (WHERE vs client filter)

Schema design:

Normalize to reduce duplication
Denormalize for read-heavy workloads
Partition large tables
Use appropriate data types

Connection management:

Connection pooling (don't create per request)
Prepared statements (avoid SQL parsing)
Transaction batching (reduce round trips)

</optimization_patterns>

Loop: Measure → Profile → Analyze → Optimize → Validate

Define performance goal — target metric (e.g., P95 < 100ms)
Establish baseline — measure current performance under realistic load
Profile systematically — identify actual bottleneck (not guesses)
Analyze root cause — understand why code is slow
Design optimization — plan targeted improvement
Implement optimization — make focused change
Measure improvement — verify gains, check for regressions
Document results — record baseline, optimization, gains, tradeoffs

At each step:

Document measurements with methodology
Note profiling tool output
Track optimization attempts (what worked/failed)
Update performance documentation

</workflow> <validation>

Before declaring optimization complete:

Check gains:

✓ Measured improvement meets target?
✓ Improvement statistically significant?
✓ Tested under realistic load?
✓ Multiple runs confirm consistency?

Check regressions:

✓ No degradation in other metrics?
✓ Memory usage still acceptable?
✓ Code complexity still manageable?
✓ Tests still pass?

Check documentation:

✓ Baseline measurements recorded?
✓ Optimization approach explained?
✓ Gains quantified with numbers?
✓ Tradeoffs documented?

</validation> <rules>

ALWAYS:

Measure before optimizing (baseline)
Profile to find actual bottleneck
Use realistic workload (not toy data)
Measure multiple runs (account for variance)
Document baseline and improvements
Check for regressions in other metrics
Consider readability vs performance tradeoff
Verify statistical significance

NEVER:

Optimize without measuring first
Guess at bottleneck without profiling
Benchmark with unrealistic data
Trust single-run measurements
Skip documentation of results
Sacrifice correctness for speed
Optimize without clear performance goal
Ignore algorithmic improvements

</rules> <references>

Methodology:

benchmarking.md — rigorous benchmarking methodology

Related skills:

codebase-recon — evidence-based investigation (foundation)
debugging — structured bug investigation
typescript-dev — correctness before performance

</references>

查询优化：

为常用查询添加索引
使用EXPLAIN/EXPLAIN ANALYZE
避免N+1查询（使用连接或批量加载）
仅选择所需列
在数据库层面过滤（WHERE子句 vs 客户端过滤）

Schema设计：

规范化以减少重复
为读密集型工作负载进行反规范化
对大表进行分区
使用合适的数据类型

连接管理：

连接池（不为每个请求创建新连接）
预编译语句（避免SQL解析）
事务批量处理（减少往返次数）

</optimization_patterns>

循环流程：测量→性能分析→分析→优化→验证

定义性能目标 —— 目标指标（例如，P95 < 100ms）
建立基准 —— 在真实负载下测量当前性能
系统性能分析 —— 定位实际瓶颈（而非猜测）
分析根本原因 —— 理解代码缓慢的原因
设计优化方案 —— 规划定向改进措施
实施优化 —— 进行针对性修改
测量性能提升 —— 验证优化效果，检查是否有回归
记录结果 —— 记录基准、优化措施、性能提升和权衡点

在每个步骤中：

记录测量方法和结果
记录性能分析工具的输出
跟踪优化尝试（哪些有效/无效）
更新性能文档

</workflow> <validation>

在宣布优化完成前：

检查性能提升：

✓ 测量的性能提升是否达到目标？
✓ 性能提升是否具有统计显著性？
✓ 是否在真实负载下测试？
✓ 多次运行是否确认结果一致？

检查回归问题：

✓ 其他指标是否出现退化？
✓ 内存使用是否仍在可接受范围内？
✓ 代码复杂度是否仍可控？
✓ 测试是否全部通过？

检查文档：

✓ 是否记录了基准测量结果？
✓ 是否解释了优化方法？
✓ 是否用数据量化了性能提升？
✓ 是否记录了权衡点？

</validation> <rules>

必须遵守：

优化前先测量（建立基准）
通过性能分析定位实际瓶颈
使用真实工作负载（而非测试用例数据）
多次测量（考虑方差）
记录基准和性能提升
检查其他指标是否出现回归
考虑可读性与性能的权衡
验证统计显著性

绝对禁止：

未先测量就进行优化
未进行性能分析就猜测瓶颈
使用不真实数据进行基准测试
相信单次运行的测量结果
跳过结果记录
为了速度牺牲正确性
无明确性能目标就进行优化
忽略算法层面的优化

</rules> <references>

方法学：

benchmarking.md —— 严谨的基准测试方法学