performance_engineering

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Performance Engineering

性能工程

Purpose

用途

Performance engineering encompasses load testing, profiling, and optimization to deliver reliable, scalable systems. This skill provides frameworks for choosing the right performance testing approach (load, stress, soak, spike), profiling techniques to identify bottlenecks (CPU, memory, I/O), and optimization strategies for backend APIs, databases, and frontend applications.

Use this skill to validate system capacity before launch, detect performance regressions in CI/CD pipelines, identify and resolve bottlenecks through profiling, and optimize application responsiveness across the stack.

性能工程涵盖负载测试、性能分析和优化，旨在交付可靠、可扩展的系统。本技能提供了选择合适性能测试方法（负载、压力、浸泡、尖峰测试）的框架，识别瓶颈的性能分析技术（CPU、内存、I/O），以及针对后端API、数据库和前端应用的优化策略。

使用本技能可在上线前验证系统容量、在CI/CD流水线中检测性能回归、通过性能分析识别并解决瓶颈，以及优化全栈应用的响应速度。

When to Use This Skill

何时使用本技能

Common Triggers:

"Validate API can handle expected traffic"
"Find maximum capacity and breaking points"
"Identify why the application is slow"
"Detect memory leaks or resource exhaustion"
"Optimize Core Web Vitals for SEO"
"Set up performance testing in CI/CD"
"Reduce cloud infrastructure costs"

Use Cases:

Pre-launch capacity planning and load validation
Post-refactor performance regression testing
Investigating slow response times or high latency
Detecting memory leaks in long-running services
Optimizing database query performance
Validating auto-scaling configuration
Establishing performance SLOs and budgets

常见触发场景：

"验证API能否处理预期流量"
"找出最大容量和断点"
"排查应用运行缓慢的原因"
"检测内存泄漏或资源耗尽问题"
"优化Core Web Vitals以提升SEO"
"在CI/CD中设置性能测试"
"降低云基础设施成本"

使用案例：

上线前的容量规划与负载验证
重构后的性能回归测试
排查响应缓慢或高延迟问题
检测长期运行服务中的内存泄漏
优化数据库查询性能
验证自动扩缩容配置
制定性能SLO和预算

Performance Testing Types

性能测试类型

Load Testing

负载测试

Validate system behavior under expected traffic levels.

When to use: Pre-launch capacity planning, regression testing after refactors, validating auto-scaling.

验证系统在预期流量水平下的表现。

适用场景： 上线前的容量规划、重构后的回归测试、验证自动扩缩容。

Stress Testing

压力测试

Find system capacity limits and failure modes.

When to use: Capacity planning, understanding failure behavior, infrastructure sizing decisions.

找出系统的容量极限和故障模式。

适用场景： 容量规划、了解故障行为、基础设施规模决策。

Soak Testing

浸泡测试

Identify memory leaks, resource exhaustion, and degradation over time.

When to use: Detecting memory leaks, validating connection pool cleanup, testing long-running batch jobs.

识别随时间推移出现的内存泄漏、资源耗尽和性能退化问题。

适用场景： 检测内存泄漏、验证连接池清理、测试长期运行的批处理作业。

Spike Testing

尖峰测试

Validate system response to sudden traffic spikes.

When to use: Validating auto-scaling, testing event-driven systems (product launches), ensuring rate limiting works.

验证系统对突发流量尖峰的响应能力。

适用场景： 验证自动扩缩容、测试事件驱动系统（如产品发布）、确保限流机制有效。

Quick Decision Framework

快速决策框架

Which test type to use?

What am I trying to learn?
├─ Can my system handle expected traffic? → LOAD TEST
├─ What's the maximum capacity? → STRESS TEST
├─ Will it stay stable over time? → SOAK TEST
└─ Can it handle traffic spikes? → SPIKE TEST

For detailed testing patterns, load scenarios, and interpreting results, see

references/testing-types.md

选择哪种测试类型？

我想要了解什么？
├─ 我的系统能否处理预期流量？ → 负载测试
├─ 系统的最大容量是多少？ → 压力测试
├─ 系统能否长期保持稳定？ → 浸泡测试
└─ 系统能否应对流量尖峰？ → 尖峰测试

如需详细的测试模式、负载场景和结果解读，请查看

references/testing-types.md

。

Load Testing Quick Starts

负载测试快速入门

k6 (JavaScript)

k6（JavaScript）

Installation:

bash

brew install k6  # macOS
sudo apt-get install k6  # Linux

Basic Load Test:

javascript

import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '30s', target: 20 },
    { duration: '1m', target: 20 },
    { duration: '30s', target: 0 },
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],
    http_req_failed: ['rate<0.01'],
  },
};

export default function () {
  const res = http.get('https://api.example.com/products');
  check(res, {
    'status is 200': (r) => r.status === 200,
  });
  sleep(1);
}

Run:

k6 run script.js

For stress, soak, and spike testing examples, see

examples/k6/

安装：

bash

brew install k6  # macOS
sudo apt-get install k6  # Linux

基础负载测试：

javascript

import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '30s', target: 20 },
    { duration: '1m', target: 20 },
    { duration: '30s', target: 0 },
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],
    http_req_failed: ['rate<0.01'],
  },
};

export default function () {
  const res = http.get('https://api.example.com/products');
  check(res, {
    'status is 200': (r) => r.status === 200,
  });
  sleep(1);
}

运行：

k6 run script.js

压力测试、浸泡测试和尖峰测试示例，请查看

examples/k6/

。

Locust (Python)

Locust（Python）

Installation:

bash

pip install locust

Basic Load Test:

python

from locust import HttpUser, task, between

class WebsiteUser(HttpUser):
    wait_time = between(1, 3)
    host = "https://api.example.com"

    @task(3)
    def view_products(self):
        self.client.get("/products")

    @task(1)
    def view_product_detail(self):
        self.client.get("/products/123")

Run:

locust -f locustfile.py --headless -u 100 -r 10 --run-time 10m

For REST API testing and data-driven testing, see

examples/locust/

安装：

bash

pip install locust

基础负载测试：

python

from locust import HttpUser, task, between

class WebsiteUser(HttpUser):
    wait_time = between(1, 3)
    host = "https://api.example.com"

    @task(3)
    def view_products(self):
        self.client.get("/products")

    @task(1)
    def view_product_detail(self):
        self.client.get("/products/123")

运行：

locust -f locustfile.py --headless -u 100 -r 10 --run-time 10m

REST API测试和数据驱动测试示例，请查看

examples/locust/

。

Profiling Quick Starts

性能分析快速入门

When to Profile

何时进行性能分析

Symptom	Profiling Type	Tool
High CPU (>70%)	CPU Profiling	py-spy, pprof, DevTools
Memory growing	Memory Profiling	memory_profiler, pprof heap
Slow response, low CPU	I/O Profiling	Query logs, pprof block

症状	分析类型	工具
CPU使用率高（>70%）	CPU分析	py-spy, pprof, DevTools
内存持续增长	内存分析	memory_profiler, pprof heap
响应缓慢但CPU使用率低	I/O分析	查询日志, pprof block

Python Profiling

Python性能分析

py-spy (Production-Safe):

bash

pip install py-spy

py-spy（生产环境安全）：

bash

pip install py-spy

Profile running process

分析运行中的进程

py-spy record -o profile.svg --pid <PID> --duration 30

Top-like view

类Top视图

py-spy top --pid <PID>


**Memory Profiling:**
```python
from memory_profiler import profile

@profile
def my_function():
    a = [1] * (10 ** 6)
    return a

py-spy top --pid <PID>


**内存分析：**
```python
from memory_profiler import profile

@profile
def my_function():
    a = [1] * (10 ** 6)
    return a

Run: python -m memory_profiler script.py

运行：python -m memory_profiler script.py

undefined

undefined

Go Profiling

Go性能分析

pprof (Built-in):

import (
    "net/http"
    _ "net/http/pprof"
)

func main() {
    go func() {
        http.ListenAndServe("localhost:6060", nil)
    }()
    startApp()
}

Capture profile:

bash

undefined

pprof（内置工具）：

import (
    "net/http"
    _ "net/http/pprof"
)

func main() {
    go func() {
        http.ListenAndServe("localhost:6060", nil)
    }()
    startApp()
}

捕获分析数据：

bash

undefined

CPU profile (30 seconds)

CPU分析（30秒）

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

Interactive analysis

交互式分析

(pprof) top (pprof) web

undefined

(pprof) top (pprof) web

undefined

TypeScript/JavaScript Profiling

TypeScript/JavaScript性能分析

Chrome DevTools (Browser/Node.js):

Node.js:

bash

node --inspect app.js

Chrome DevTools（浏览器/Node.js）：

Node.js：

bash

node --inspect app.js

Open chrome://inspect

打开 chrome://inspect

Performance tab → Record

性能标签页 → 录制


**clinic.js (Node.js):**
```bash
npm install -g clinic
clinic doctor -- node app.js

For detailed profiling workflows and analysis, see

references/profiling-guide.md

and

examples/profiling/


**clinic.js（Node.js）：**
```bash
npm install -g clinic
clinic doctor -- node app.js

如需详细的性能分析流程和解读，请查看

references/profiling-guide.md

和

examples/profiling/

。

Optimization Strategies

优化策略

Caching

缓存

When to cache:

Data queried frequently (>100 req/min)
Data freshness tolerance (>1 minute acceptable staleness)

Redis example:

python

import redis
r = redis.Redis()

def get_cached_data(key, fn, ttl=300):
    cached = r.get(key)
    if cached:
        return json.loads(cached)
    data = fn()
    r.setex(key, ttl, json.dumps(data))
    return data

何时使用缓存：

频繁查询的数据（>100次请求/分钟）
可容忍数据过期的场景（>1分钟的过期时间可接受）

Redis示例：

python

import redis
r = redis.Redis()

def get_cached_data(key, fn, ttl=300):
    cached = r.get(key)
    if cached:
        return json.loads(cached)
    data = fn()
    r.setex(key, ttl, json.dumps(data))
    return data

Database Query Optimization

数据库查询优化

N+1 prevention:

python

undefined

避免N+1查询：

python

undefined

Bad: N+1 queries

不良写法：N+1查询

users = User.query.all() for user in users: print(user.orders) # Separate query per user

users = User.query.all() for user in users: print(user.orders) # 每个用户触发一次单独查询

Good: Eager loading

推荐写法：预加载

users = User.query.options(joinedload(User.orders)).all()


**Indexing:**
```sql
CREATE INDEX idx_users_email ON users(email);

users = User.query.options(joinedload(User.orders)).all()


**索引优化：**
```sql
CREATE INDEX idx_users_email ON users(email);

API Performance

API性能优化

Cursor-based pagination:

typescript

app.get('/api/products', async (req, res) => {
  const { cursor, limit = 20 } = req.query;

  const products = await db.query(
    'SELECT * FROM products WHERE id > ? ORDER BY id LIMIT ?',
    [cursor || 0, limit]
  );

  res.json({
    data: products,
    next_cursor: products[products.length - 1]?.id,
  });
});

基于游标分页：

typescript

app.get('/api/products', async (req, res) => {
  const { cursor, limit = 20 } = req.query;

  const products = await db.query(
    'SELECT * FROM products WHERE id > ? ORDER BY id LIMIT ?',
    [cursor || 0, limit]
  );

  res.json({
    data: products,
    next_cursor: products[products.length - 1]?.id,
  });
});

Frontend Performance (Core Web Vitals)

前端性能（Core Web Vitals）

Key metrics:

LCP (Largest Contentful Paint): < 2.5s
INP (Interaction to Next Paint): < 200ms
CLS (Cumulative Layout Shift): < 0.1

Optimization techniques:

Code splitting (lazy loading)
Image optimization (WebP, responsive, lazy loading)
Preload critical resources
Minimize render-blocking resources

For detailed optimization strategies, see

references/optimization-strategies.md

and

references/frontend-performance.md

关键指标：

LCP（最大内容绘制）： < 2.5秒
INP（交互到下一次绘制）： < 200毫秒
CLS（累积布局偏移）： < 0.1

优化技巧：

代码分割（懒加载）
图片优化（WebP格式、响应式图片、懒加载）
预加载关键资源
最小化阻塞渲染的资源

如需详细的优化策略，请查看

references/optimization-strategies.md

和

references/frontend-performance.md

。

Performance SLOs

性能SLO

Recommended SLOs by Service Type

按服务类型推荐的SLO

Service Type	p95 Latency	p99 Latency	Availability
User-Facing API	< 200ms	< 500ms	99.9%
Internal API	< 100ms	< 300ms	99.5%
Database Query	< 50ms	< 100ms	99.99%
Background Job	< 5s	< 10s	99%
Real-time API	< 50ms	< 100ms	99.95%

服务类型	p95延迟	p99延迟	可用性
面向用户的API	< 200ms	< 500ms	99.9%
内部API	< 100ms	< 300ms	99.5%
数据库查询	< 50ms	< 100ms	99.99%
后台作业	< 5s	< 10s	99%
实时API	< 50ms	< 100ms	99.95%

SLO Selection Process

SLO选择流程

Measure baseline performance
Identify user expectations
Set achievable targets (10-20% better than baseline)
Iterate as system matures

For detailed SLO framework and performance budgets, see

references/slo-framework.md

测量基准性能
明确用户期望
设置可实现的目标（比基准性能好10-20%）
随着系统成熟逐步迭代

如需详细的SLO框架和性能预算，请查看

references/slo-framework.md

。

CI/CD Integration

CI/CD集成

Performance Testing in Pipelines

流水线中的性能测试

GitHub Actions example:

yaml

name: performance_engineering

on:
  pull_request:
    branches: [main]

jobs:
  load-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install k6
        run: |
          curl https://github.com/grafana/k6/releases/download/v0.48.0/k6-v0.48.0-linux-amd64.tar.gz -L | tar xvz
          sudo mv k6-v0.48.0-linux-amd64/k6 /usr/local/bin/

      - name: Run load test
        run: k6 run tests/load/api-test.js

Performance budgets:

javascript

// k6 test with thresholds (fail build if violated)
export const options = {
  thresholds: {
    http_req_duration: ['p(95)<500'],
    http_req_failed: ['rate<0.01'],
  },
};

GitHub Actions示例：

yaml

name: performance_engineering

on:
  pull_request:
    branches: [main]

jobs:
  load-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install k6
        run: |
          curl https://github.com/grafana/k6/releases/download/v0.48.0/k6-v0.48.0-linux-amd64.tar.gz -L | tar xvz
          sudo mv k6-v0.48.0-linux-amd64/k6 /usr/local/bin/

      - name: Run load test
        run: k6 run tests/load/api-test.js

性能预算：

javascript

// k6测试阈值（违反则构建失败）
export const options = {
  thresholds: {
    http_req_duration: ['p(95)<500'],
    http_req_failed: ['rate<0.01'],
  },
};

Profiling Workflow

性能分析流程

Standard process:

Observe symptoms (high CPU, memory growth, slow response)
Hypothesize bottleneck (CPU? Memory? I/O?)
Choose profiling type based on hypothesis
Run profiler under realistic load
Analyze profile (flamegraph, call tree)
Identify hot spots (top 20% functions using 80% resources)
Optimize bottlenecks
Re-profile to validate improvement

Best practices:

Profile under realistic load (not idle systems)
Use sampling profilers (py-spy, pprof) in production (low overhead)
Focus on hot paths (optimize biggest bottlenecks first)
Validate optimizations with before/after comparisons

标准流程：

观察症状（CPU使用率高、内存增长、响应缓慢）
假设瓶颈所在（CPU？内存？I/O？）
根据假设选择分析类型
在真实负载下运行分析工具
分析结果（火焰图、调用树）
识别热点（占用80%资源的前20%函数）
优化瓶颈
重新分析以验证优化效果

最佳实践：

在真实负载下进行分析（而非空闲系统）
在生产环境中使用采样分析工具（py-spy、pprof）（开销低）
聚焦关键路径（优先优化最大的瓶颈）
通过前后对比验证优化效果

Tool Recommendations

工具推荐

Load Testing

负载测试

Primary: k6 (JavaScript-based, Grafana-backed)

Modern architecture, cloud-native
JavaScript DSL (ES6+)
Grafana/Prometheus integration
Multi-protocol (HTTP/1.1, HTTP/2, WebSocket, gRPC)

When to use: Modern APIs, microservices, CI/CD integration.

Alternative: Locust (Python-based)

Python-native (write tests in Python)
Web UI for real-time monitoring
Flexible for complex user scenarios

When to use: Python-heavy teams, complex user flows.

首选：k6（基于JavaScript，Grafana支持）

现代架构，云原生
JavaScript DSL（ES6+）
集成Grafana/Prometheus
多协议支持（HTTP/1.1、HTTP/2、WebSocket、gRPC）

适用场景： 现代API、微服务、CI/CD集成。

替代工具：Locust（基于Python）

原生Python支持（用Python编写测试）
实时监控Web UI
灵活支持复杂用户场景

适用场景： 以Python为主的团队、复杂用户流程。

Profiling

性能分析

Python:

py-spy (sampling, production-safe)
cProfile (deterministic, detailed)
memory_profiler (memory leak detection)

Go:

pprof (built-in, CPU/heap/goroutine/block profiling)

TypeScript/JavaScript:

Chrome DevTools (browser/Node.js)
clinic.js (Node.js performance suite)

For detailed tool comparisons, see

references/testing-types.md

and

references/profiling-guide.md

Python：

py-spy（采样分析，生产环境安全）
cProfile（确定性分析，详细）
memory_profiler（内存泄漏检测）

Go：

pprof（内置工具，支持CPU/堆/协程/阻塞分析）

TypeScript/JavaScript：

Chrome DevTools（浏览器/Node.js）
clinic.js（Node.js性能套件）

如需详细的工具对比，请查看

references/testing-types.md

和

references/profiling-guide.md

。

Reference Documentation

参考文档

Detailed Guides:

```
references/testing-types.md
```
- Load, stress, soak, spike testing patterns
```
references/profiling-guide.md
```
- CPU, memory, I/O profiling across languages
```
references/optimization-strategies.md
```
- Caching, database, API optimization
```
references/frontend-performance.md
```
- Core Web Vitals, bundle optimization
```
references/slo-framework.md
```
- Setting SLOs, performance budgets
```
references/benchmarking.md
```
- Benchmarking best practices

Examples:

```
examples/k6/
```
- Load, stress, soak, spike tests
```
examples/locust/
```
- Python-based load testing
```
examples/profiling/
```
- Profiling examples (Python, Go, TypeScript)
```
examples/optimization/
```
- Caching, query, API optimization

详细指南：

```
references/testing-types.md
```
- 负载、压力、浸泡、尖峰测试模式
```
references/profiling-guide.md
```
- 跨语言的CPU、内存、I/O分析
```
references/optimization-strategies.md
```
- 缓存、数据库、API优化
```
references/frontend-performance.md
```
- Core Web Vitals、包优化
```
references/slo-framework.md
```
- SLO设置、性能预算
```
references/benchmarking.md
```
- 基准测试最佳实践

示例：

```
examples/k6/
```
- 负载、压力、浸泡、尖峰测试示例
```
examples/locust/
```
- 基于Python的负载测试示例
```
examples/profiling/
```
- 性能分析示例（Python、Go、TypeScript）
```
examples/optimization/
```
- 缓存、查询、API优化示例

Related Skills

🔄 Workflow

🔄 工作流程

Kaynak: k6 Methodology & The Art of Capacity Planning

来源： k6方法论 & 容量规划的艺术

Aşama 1: Planning & SLOs

步骤1：规划与SLOs

Goal: Testin amacı ne? (Smoke, Load, Stress, Soak?).
SLOs: Başarı kriterlerini belirle (Örn: p95 latency < 200ms, Error rate < %1).
Environment: Test ortamı Prod ile ne kadar benzer? (Scaling faktörünü belirle).

目标：测试的目的是什么？（冒烟测试、负载测试、压力测试、浸泡测试？）
SLOs：确定成功标准（例如：p95延迟 < 200ms，错误率 < 1%）
环境：测试环境与生产环境的相似度如何？（确定缩放系数）

Aşama 2: Scripting & Execution

步骤2：脚本编写与执行

User Journey: Gerçek kullanıcı davranışını simüle et (Login -> Browse -> Buy).
Data Driven: Testi statik verilerle değil, CSV'den gelen dinamik verilerle besle (Cache'i aşmak için).
Ramp-up: Trafiği aniden değil, kademeli artır (Sistemin ısınması için).

用户旅程：模拟真实用户行为（登录 -> 浏览 -> 购买）
数据驱动：使用CSV中的动态数据而非静态数据驱动测试（以突破缓存）
逐步加压：逐步增加流量而非突然加压（让系统有预热时间）

Aşama 3: Analysis & Optimization

步骤3：分析与优化

Correlation: Hata anında CPU/Memory/DB metrikleri ne durumdaydı?
Bottleneck: Darboğaz nerede? (App Code, DB, Network, veya Load Injector'ın kendisi?).
Report: Teknik ve yönetici özeti içeren rapor hazırla.

关联分析：发生错误时CPU/内存/数据库指标处于什么状态？
瓶颈定位：瓶颈在哪里？（应用代码、数据库、网络还是负载注入器本身？）
报告：准备包含技术摘要和管理层摘要的报告

Kontrol Noktaları

检查点

Aşama	Doğrulama
1	Test verisi (Database seed) yeterli hacimde mi?
2	Load Generator (Test makinesi) CPU darboğazına girdi mi? (False negative riski).
3	3rd party API'lar (Stripe, Twilio) mock'landı mı? (Masraf ve ban riski).

步骤	验证内容
1	测试数据（数据库种子数据）是否足够？
2	负载生成器（测试机器）是否出现CPU瓶颈？（存在假阴性风险）
3	第三方API（Stripe、Twilio）是否已被模拟？（存在成本和封禁风险）