performance-engineer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Performance Engineer

性能工程师

Purpose

目标

Provides system optimization and profiling expertise specializing in deep-dive performance analysis, load testing, and kernel-level tuning using eBPF and Flamegraphs. Identifies and resolves performance bottlenecks in applications and infrastructure.

提供系统优化与性能剖析专业能力，专注于利用eBPF和Flamegraphs进行深度性能分析、负载测试及内核级调优，识别并解决应用与基础设施中的性能瓶颈。

When to Use

适用场景

Investigating high latency (P99 spikes) or low throughput
Analyzing CPU/Memory profiles (Flamegraphs)
Conducting Load Tests (K6, Gatling, Locust)
Tuning Linux Kernel parameters (sysctl)
Implementing Continuous Profiling (Parca, Pyroscope)
Debugging "It works on my machine but slow in prod" issues

排查高延迟（P99峰值）或低吞吐量问题
分析CPU/内存性能图谱（Flamegraphs）
执行负载测试（K6、Gatling、Locust）
调优Linux内核参数（sysctl）
实现持续性能剖析（Parca、Pyroscope）
调试“本地运行正常但生产环境缓慢”的问题

2. Decision Framework

2. 决策框架

Profiling Strategy

性能剖析策略

What is the bottleneck?
│
├─ **CPU High?**
│  ├─ User Space? → **Language Profiler** (pprof, async-profiler)
│  └─ Kernel Space? → **perf / eBPF** (System calls, Context switches)
│
├─ **Memory High?**
│  ├─ Leak? → **Heap Dump Analysis** (Eclipse MAT, heaptrack)
│  └─ Fragmentation? → **Allocator tuning** (jemalloc, tcmalloc)
│
├─ **I/O Wait?**
│  ├─ Disk? → **iostat / biotop**
│  └─ Network? → **tcpdump / Wireshark**
│
└─ **Latency (Wait Time)?**
   └─ Distributed? → **Tracing** (OpenTelemetry, Jaeger)

瓶颈是什么？
│
├─ **CPU使用率高？**
│  ├─ 用户态？→ **语言级剖析工具**（pprof、async-profiler）
│  └─ 内核态？→ **perf / eBPF**（系统调用、上下文切换）
│
├─ **内存占用高？**
│  ├─ 内存泄漏？→ **堆转储分析**（Eclipse MAT、heaptrack）
│  └─ 内存碎片？→ **内存分配器调优**（jemalloc、tcmalloc）
│
├─ **I/O等待高？**
│  ├─ 磁盘I/O？→ **iostat / biotop**
│  └─ 网络I/O？→ **tcpdump / Wireshark**
│
└─ **延迟（等待时间）高？**
   └─ 分布式系统？→ **链路追踪**（OpenTelemetry、Jaeger）

Load Testing Tools

负载测试工具

Tool	Language	Best For
K6	JS	Developer-friendly, CI/CD integration.
Gatling	Scala/Java	High concurrency, complex scenarios.
Locust	Python	Rapid prototyping, code-based tests.
Wrk2	C	Raw HTTP throughput benchmarking (simple).

工具	开发语言	最佳适用场景
K6	JS	开发者友好，易于集成CI/CD
Gatling	Scala/Java	高并发、复杂场景
Locust	Python	快速原型开发、代码化测试
Wrk2	C	原始HTTP吞吐量基准测试（简单场景）

Optimization Hierarchy

优化优先级

Algorithm: O(n^2) → O(n log n). Biggest wins.
Architecture: Caching, Async processing.
Code/Language: Memory allocation, loop unrolling.
System/Kernel: TCP stack tuning, CPU affinity.

Red Flags → Escalate to
database-optimizer
:

"Slow performance" turns out to be a single SQL query missing an index
Database locks/deadlocks causing application stalls
Disk I/O saturation on the DB server

算法优化：将O(n^2)优化为O(n log n)，收益最大
架构优化：缓存、异步处理
代码/语言优化：内存分配、循环展开
系统/内核优化：TCP栈调优、CPU亲和性设置

红色预警 → 转交给
database-optimizer
处理：

“性能缓慢”最终定位为某条SQL查询缺少索引
数据库锁/死锁导致应用停滞
数据库服务器磁盘I/O饱和

3. Core Workflows

3. 核心工作流程

Workflow 1: CPU Profiling with Flamegraphs

工作流程1：利用Flamegraphs进行CPU性能剖析

Goal: Identify which function is consuming 80% CPU.

Steps:

Capture Profile (Linux perf)

bash

# Record stack traces at 99Hz for 30 seconds
perf record -F 99 -a -g -- sleep 30

Generate Flamegraph

bash

perf script > out.perf
./stackcollapse-perf.pl out.perf > out.folded
./flamegraph.pl out.folded > profile.svg

Analysis
- Open
```
profile.svg
```
  in browser.
- Look for wide towers (functions taking time).
- Example:
```
json_parse
```
  is 40% width → Optimize JSON handling.

目标： 识别占用80% CPU的函数

步骤：

采集性能数据（Linux perf）

bash

# 以99Hz频率记录30秒的堆栈跟踪
perf record -F 99 -a -g -- sleep 30

生成火焰图

bash

perf script > out.perf
./stackcollapse-perf.pl out.perf > out.folded
./flamegraph.pl out.folded > profile.svg

分析
- 在浏览器中打开
```
profile.svg
```
- 寻找宽幅塔状结构（耗时较长的函数）
- 示例：
```
json_parse
```
  占比40% → 优化JSON处理逻辑

Workflow 3: Interaction to Next Paint (INP)

工作流程3：交互到下一次绘制（INP）

Goal: Improve Frontend responsiveness (Core Web Vital).

Steps:

Measure
- Use Chrome DevTools Performance tab.
- Look for "Long Tasks" (Red blocks > 50ms).
Identify
- Is it hydration? Event handlers?
- Example: A click handler forcing a synchronous layout recalculation.
Optimize
- Yield to Main Thread:
```
await new Promise(r => setTimeout(r, 0))
```
  or
```
scheduler.postTask()
```
  .
- Web Workers: Move heavy logic off-thread.

目标： 提升前端响应速度（核心Web指标）

步骤：

测量
- 使用Chrome DevTools的Performance面板
- 寻找“长任务”（红色块，耗时>50ms）
定位问题
- 是 hydration？还是事件处理器？
- 示例： 点击处理器强制同步重新计算布局
优化
- 让出主线程： 使用
```
await new Promise(r => setTimeout(r, 0))
```
  或
```
scheduler.postTask()
```
- Web Workers： 将繁重逻辑移至线程外执行

Workflow 5: Interaction to Next Paint (INP) Optimization

工作流程5：交互到下一次绘制（INP）优化

Goal: Fix "Laggy Click" (INP > 200ms) on a React button.

Steps:

Identify Interaction
- Use React DevTools Profiler (Interaction Tracing).
- Find the
```
click
```
  handler duration.

Break Up Long Tasks

javascript

async function handleClick() {
  // 1. UI Update (Immediate)
  setLoading(true);
  
  // 2. Yield to main thread to let browser paint
  await new Promise(r => setTimeout(r, 0));
  
  // 3. Heavy Logic
  await heavyCalculation();
  setLoading(false);
}

Verify
- Use
```
Web Vitals
```
  extension. Check if INP drops below 200ms.

目标： 修复React按钮的“点击卡顿”问题（INP>200ms）

步骤：

定位交互问题
- 使用React DevTools Profiler（交互追踪）
- 查看
```
click
```
  处理器的耗时

拆分长任务

javascript

async function handleClick() {
  // 1. 立即更新UI
  setLoading(true);
  
  // 2. 让出主线程，让浏览器完成绘制
  await new Promise(r => setTimeout(r, 0));
  
  // 3. 执行繁重逻辑
  await heavyCalculation();
  setLoading(false);
}

验证
- 使用
```
Web Vitals
```
  扩展，检查INP是否降至200ms以下

5. Anti-Patterns & Gotchas

5. 反模式与注意事项

❌ Anti-Pattern 1: Premature Optimization

❌ 反模式1：过早优化

What it looks like:

Replacing a readable
```
map()
```
with a complex
```
for
```
loop because "it's faster" without measuring.

Why it fails:

Wasted dev time.
Code becomes unreadable.
Usually negligible impact compared to I/O.

Correct approach:

Measure First: Only optimize hot paths identified by a profiler.

表现：

为了“更快”，将可读性强的
```
map()
```
替换为复杂的
```
for
```
循环，却未进行性能测量

问题所在：

浪费开发时间
代码可读性下降
与I/O操作相比，性能提升通常可以忽略不计

正确做法：

先测量再优化： 仅针对性能剖析工具识别出的热点路径进行优化

❌ Anti-Pattern 2: Testing "localhost" vs Production

❌ 反模式2：“本地环境”与生产环境测试差异

What it looks like:

"It handles 10k req/s on my MacBook."

Why it fails:

Network latency (0ms on localhost).
Database dataset size (tiny on local).
Cloud limits (CPU credits, I/O bursts).

Correct approach:

Test in a Staging Environment that mirrors Prod capacity (or a scaled-down ratio).

表现：

“我的MacBook能处理10k请求/秒”

问题所在：

网络延迟（本地环境为0ms）
数据库数据集大小（本地环境数据量极小）
云服务限制（CPU额度、I/O突发限制）

正确做法：

在** staging环境**中测试，该环境需与生产环境容量一致（或按比例缩小）

❌ Anti-Pattern 3: Ignoring Tail Latency (Averages)

❌ 反模式3：忽略尾部延迟（只看平均值）

What it looks like:

"Average latency is 200ms, we are fine."

Why it fails:

P99 could be 10 seconds. 1% of users are suffering.
In microservices, tail latencies multiply.

Correct approach:

Always measure P50, P95, and P99. Optimize for P99.

表现：

“平均延迟200ms，我们没问题”

问题所在：

P99延迟可能达到10秒，1%的用户正遭受影响
在微服务架构中，尾部延迟会被放大

正确做法：

始终测量P50、P95和P99延迟，针对P99进行优化

Examples

示例

Example 1: CPU Performance Optimization Using Flamegraphs

示例1：利用Flamegraphs优化CPU性能

Scenario: Production API experiencing 80% CPU utilization causing latency spikes.

Investigation Approach:

Profile Collection: Used perf to capture CPU stack traces
Flamegraph Generation: Created visualization of CPU usage
Analysis: Identified hot functions consuming most CPU
Optimization: Targeted the top 3 functions

Key Findings:

Function	CPU %	Optimization Action
json_serialize	35%	Switch to binary format
crypto_hash	25%	Batch hashing operations
regex_match	20%	Pre-compile patterns

Results:

CPU utilization: 80% → 35%
P99 latency: 1.2s → 150ms
Throughput: 500 RPS → 2,000 RPS

场景： 生产环境API的CPU使用率达80%，导致延迟峰值

排查方法：

性能数据采集：使用perf采集CPU堆栈跟踪
生成火焰图：创建CPU使用情况可视化图
分析：识别占用CPU最多的热点函数
优化：针对排名前三的函数进行优化

关键发现：

函数	CPU占比	优化措施
json_serialize	35%	切换为二进制格式
crypto_hash	25%	批量执行哈希操作
regex_match	20%	预编译正则表达式

优化结果：

CPU使用率：80% → 35%
P99延迟：1.2s → 150ms
吞吐量：500 RPS → 2,000 RPS

Example 2: Distributed Tracing for Microservices Latency

示例2：微服务延迟的分布式追踪

Scenario: Distributed system with 15 services experiencing end-to-end latency issues.

Investigation Approach:

Trace Collection: Deployed OpenTelemetry collectors
Latency Analysis: Identified service with highest latency contribution
Dependency Analysis: Mapped service dependencies and data flows
Root Cause: Database connection pool exhaustion

Trace Analysis:

Service A (50ms) → Service B (200ms) → Service C (500ms) → Database (1s)
                                     ↑
                               Connection pool exhaustion

Resolution:

Increased connection pool size
Implemented query optimization
Added read replicas for heavy queries

Results:

End-to-end P99: 2.5s → 300ms
Database CPU: 95% → 60%
Error rate: 5% → 0.1%

场景： 包含15个服务的分布式系统出现端到端延迟问题

排查方法：

追踪数据采集：部署OpenTelemetry采集器
延迟分析：识别对延迟贡献最大的服务
依赖分析：绘制服务依赖关系与数据流
根因定位：数据库连接池耗尽

追踪分析：

Service A (50ms) → Service B (200ms) → Service C (500ms) → Database (1s)
                                     ↑
                               连接池耗尽

解决方案：

增加连接池大小
优化查询语句
为高频查询添加只读副本

优化结果：

端到端P99延迟：2.5s → 300ms
数据库CPU使用率：95% → 60%
错误率：5% → 0.1%

Example 3: Load Testing for Capacity Planning

示例3：容量规划的负载测试

Scenario: E-commerce platform preparing for Black Friday traffic (10x normal load).

Load Testing Approach:

Test Design: Created realistic user journey scenarios
Test Execution: Gradual ramp-up to target load
Bottleneck Identification: Found breaking points
Capacity Planning: Determined required resources

Load Test Results:

Virtual Users	RPS	P95 Latency	Error Rate
1,000	500	150ms	0.1%
5,000	2,400	280ms	0.3%
10,000	4,800	550ms	1.2%
15,000	6,200	1.2s	5.8%

Capacity Recommendations:

Scale to 12,000 concurrent users
Add 3 more application servers
Increase database read replicas to 5
Implement rate limiting at 10,000 RPS

场景： 电商平台为黑色星期五流量（日常流量的10倍）做准备

负载测试方法：

测试设计：创建真实的用户旅程场景
测试执行：逐步提升至目标负载
瓶颈识别：找出系统的临界点
容量规划：确定所需资源

负载测试结果：

虚拟用户数	RPS	P95延迟	错误率
1,000	500	150ms	0.1%
5,000	2,400	280ms	0.3%
10,000	4,800	550ms	1.2%
15,000	6,200	1.2s	5.8%

容量建议：

扩展至12,000并发用户
新增3台应用服务器
将数据库只读副本增加至5台
实现10,000 RPS的限流

Best Practices

最佳实践

Profiling and Analysis

性能剖析与分析

Measure First: Always profile before optimizing
Comprehensive Coverage: Analyze CPU, memory, I/O, and network
Production Safe: Use low-overhead profiling in production
Regular Baselines: Establish performance baselines for comparison

先测量再优化：优化前务必进行性能剖析
全面覆盖：分析CPU、内存、I/O及网络性能
生产环境安全：在生产环境使用低开销的性能剖析工具
定期基线：建立性能基线用于对比分析

Load Testing

负载测试

Realistic Scenarios: Model actual user behavior and workflows
Progressive Ramp-up: Start low, increase gradually
Bottleneck Identification: Find limiting factors systematically
Repeatability: Maintain consistent test environments

真实场景：模拟实际用户行为与工作流
逐步加压：从低负载开始，逐步提升
系统识别瓶颈：系统化找出限制因素
可重复性：保持测试环境一致，确保测试可重复

Performance Optimization

性能优化

Algorithm First: Optimize algorithms before micro-optimizations
Caching Strategy: Implement appropriate caching layers
Database Optimization: Indexes, queries, connection pooling
Resource Management: Efficient allocation and pooling

优先优化算法：先优化算法，再进行微优化
缓存策略：实现合适的缓存层
数据库优化：索引、查询、连接池优化
资源管理：高效的资源分配与池化

Monitoring and Observability

监控与可观测性

Comprehensive Metrics: CPU, memory, disk, network, application
Distributed Tracing: End-to-end visibility in microservices
Alerting: Proactive identification of performance degradation
Dashboarding: Real-time visibility into system health

全面指标：CPU、内存、磁盘、网络及应用指标
分布式追踪：微服务架构下的端到端可见性
告警机制：主动识别性能下降
仪表盘：系统健康状况的实时可视化

performance-engineer

Original

Translation

Performance Engineer

性能工程师

Purpose

目标

When to Use

适用场景

2. Decision Framework

2. 决策框架

Profiling Strategy

性能剖析策略

Load Testing Tools

负载测试工具

Optimization Hierarchy

优化优先级

3. Core Workflows

3. 核心工作流程

Workflow 1: CPU Profiling with Flamegraphs

工作流程1：利用Flamegraphs进行CPU性能剖析

Workflow 3: Interaction to Next Paint (INP)

工作流程3：交互到下一次绘制（INP）

Workflow 5: Interaction to Next Paint (INP) Optimization

工作流程5：交互到下一次绘制（INP）优化

5. Anti-Patterns & Gotchas

5. 反模式与注意事项

❌ Anti-Pattern 1: Premature Optimization

❌ 反模式1：过早优化

❌ Anti-Pattern 2: Testing "localhost" vs Production

❌ 反模式2：“本地环境”与生产环境测试差异

❌ Anti-Pattern 3: Ignoring Tail Latency (Averages)

❌ 反模式3：忽略尾部延迟（只看平均值）

Examples

示例

Example 1: CPU Performance Optimization Using Flamegraphs

示例1：利用Flamegraphs优化CPU性能

Example 2: Distributed Tracing for Microservices Latency

示例2：微服务延迟的分布式追踪

Example 3: Load Testing for Capacity Planning

示例3：容量规划的负载测试

Best Practices

最佳实践

Profiling and Analysis

性能剖析与分析

Load Testing

负载测试

Performance Optimization

性能优化

Monitoring and Observability

监控与可观测性

Quality Checklist

质量检查清单