performance_engineering

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Performance Engineering

性能工程

Purpose

用途

Performance engineering encompasses load testing, profiling, and optimization to deliver reliable, scalable systems. This skill provides frameworks for choosing the right performance testing approach (load, stress, soak, spike), profiling techniques to identify bottlenecks (CPU, memory, I/O), and optimization strategies for backend APIs, databases, and frontend applications.
Use this skill to validate system capacity before launch, detect performance regressions in CI/CD pipelines, identify and resolve bottlenecks through profiling, and optimize application responsiveness across the stack.
性能工程涵盖负载测试、性能分析和优化,旨在交付可靠、可扩展的系统。本技能提供了选择合适性能测试方法(负载、压力、浸泡、尖峰测试)的框架,识别瓶颈的性能分析技术(CPU、内存、I/O),以及针对后端API、数据库和前端应用的优化策略。
使用本技能可在上线前验证系统容量、在CI/CD流水线中检测性能回归、通过性能分析识别并解决瓶颈,以及优化全栈应用的响应速度。

When to Use This Skill

何时使用本技能

Common Triggers:
  • "Validate API can handle expected traffic"
  • "Find maximum capacity and breaking points"
  • "Identify why the application is slow"
  • "Detect memory leaks or resource exhaustion"
  • "Optimize Core Web Vitals for SEO"
  • "Set up performance testing in CI/CD"
  • "Reduce cloud infrastructure costs"
Use Cases:
  • Pre-launch capacity planning and load validation
  • Post-refactor performance regression testing
  • Investigating slow response times or high latency
  • Detecting memory leaks in long-running services
  • Optimizing database query performance
  • Validating auto-scaling configuration
  • Establishing performance SLOs and budgets
常见触发场景:
  • "验证API能否处理预期流量"
  • "找出最大容量和断点"
  • "排查应用运行缓慢的原因"
  • "检测内存泄漏或资源耗尽问题"
  • "优化Core Web Vitals以提升SEO"
  • "在CI/CD中设置性能测试"
  • "降低云基础设施成本"
使用案例:
  • 上线前的容量规划与负载验证
  • 重构后的性能回归测试
  • 排查响应缓慢或高延迟问题
  • 检测长期运行服务中的内存泄漏
  • 优化数据库查询性能
  • 验证自动扩缩容配置
  • 制定性能SLO和预算

Performance Testing Types

性能测试类型

Load Testing

负载测试

Validate system behavior under expected traffic levels.
When to use: Pre-launch capacity planning, regression testing after refactors, validating auto-scaling.
验证系统在预期流量水平下的表现。
适用场景: 上线前的容量规划、重构后的回归测试、验证自动扩缩容。

Stress Testing

压力测试

Find system capacity limits and failure modes.
When to use: Capacity planning, understanding failure behavior, infrastructure sizing decisions.
找出系统的容量极限和故障模式。
适用场景: 容量规划、了解故障行为、基础设施规模决策。

Soak Testing

浸泡测试

Identify memory leaks, resource exhaustion, and degradation over time.
When to use: Detecting memory leaks, validating connection pool cleanup, testing long-running batch jobs.
识别随时间推移出现的内存泄漏、资源耗尽和性能退化问题。
适用场景: 检测内存泄漏、验证连接池清理、测试长期运行的批处理作业。

Spike Testing

尖峰测试

Validate system response to sudden traffic spikes.
When to use: Validating auto-scaling, testing event-driven systems (product launches), ensuring rate limiting works.
验证系统对突发流量尖峰的响应能力。
适用场景: 验证自动扩缩容、测试事件驱动系统(如产品发布)、确保限流机制有效。

Quick Decision Framework

快速决策框架

Which test type to use?
What am I trying to learn?
├─ Can my system handle expected traffic? → LOAD TEST
├─ What's the maximum capacity? → STRESS TEST
├─ Will it stay stable over time? → SOAK TEST
└─ Can it handle traffic spikes? → SPIKE TEST
For detailed testing patterns, load scenarios, and interpreting results, see
references/testing-types.md
.
选择哪种测试类型?
我想要了解什么?
├─ 我的系统能否处理预期流量? → 负载测试
├─ 系统的最大容量是多少? → 压力测试
├─ 系统能否长期保持稳定? → 浸泡测试
└─ 系统能否应对流量尖峰? → 尖峰测试
如需详细的测试模式、负载场景和结果解读,请查看
references/testing-types.md

Load Testing Quick Starts

负载测试快速入门

k6 (JavaScript)

k6(JavaScript)

Installation:
bash
brew install k6  # macOS
sudo apt-get install k6  # Linux
Basic Load Test:
javascript
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '30s', target: 20 },
    { duration: '1m', target: 20 },
    { duration: '30s', target: 0 },
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],
    http_req_failed: ['rate<0.01'],
  },
};

export default function () {
  const res = http.get('https://api.example.com/products');
  check(res, {
    'status is 200': (r) => r.status === 200,
  });
  sleep(1);
}
Run:
k6 run script.js
For stress, soak, and spike testing examples, see
examples/k6/
.
安装:
bash
brew install k6  # macOS
sudo apt-get install k6  # Linux
基础负载测试:
javascript
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '30s', target: 20 },
    { duration: '1m', target: 20 },
    { duration: '30s', target: 0 },
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],
    http_req_failed: ['rate<0.01'],
  },
};

export default function () {
  const res = http.get('https://api.example.com/products');
  check(res, {
    'status is 200': (r) => r.status === 200,
  });
  sleep(1);
}
运行:
k6 run script.js
压力测试、浸泡测试和尖峰测试示例,请查看
examples/k6/

Locust (Python)

Locust(Python)

Installation:
bash
pip install locust
Basic Load Test:
python
from locust import HttpUser, task, between

class WebsiteUser(HttpUser):
    wait_time = between(1, 3)
    host = "https://api.example.com"

    @task(3)
    def view_products(self):
        self.client.get("/products")

    @task(1)
    def view_product_detail(self):
        self.client.get("/products/123")
Run:
locust -f locustfile.py --headless -u 100 -r 10 --run-time 10m
For REST API testing and data-driven testing, see
examples/locust/
.
安装:
bash
pip install locust
基础负载测试:
python
from locust import HttpUser, task, between

class WebsiteUser(HttpUser):
    wait_time = between(1, 3)
    host = "https://api.example.com"

    @task(3)
    def view_products(self):
        self.client.get("/products")

    @task(1)
    def view_product_detail(self):
        self.client.get("/products/123")
运行:
locust -f locustfile.py --headless -u 100 -r 10 --run-time 10m
REST API测试和数据驱动测试示例,请查看
examples/locust/

Profiling Quick Starts

性能分析快速入门

When to Profile

何时进行性能分析

SymptomProfiling TypeTool
High CPU (>70%)CPU Profilingpy-spy, pprof, DevTools
Memory growingMemory Profilingmemory_profiler, pprof heap
Slow response, low CPUI/O ProfilingQuery logs, pprof block
症状分析类型工具
CPU使用率高(>70%)CPU分析py-spy, pprof, DevTools
内存持续增长内存分析memory_profiler, pprof heap
响应缓慢但CPU使用率低I/O分析查询日志, pprof block

Python Profiling

Python性能分析

py-spy (Production-Safe):
bash
pip install py-spy
py-spy(生产环境安全):
bash
pip install py-spy

Profile running process

分析运行中的进程

py-spy record -o profile.svg --pid <PID> --duration 30
py-spy record -o profile.svg --pid <PID> --duration 30

Top-like view

类Top视图

py-spy top --pid <PID>

**Memory Profiling:**
```python
from memory_profiler import profile

@profile
def my_function():
    a = [1] * (10 ** 6)
    return a
py-spy top --pid <PID>

**内存分析:**
```python
from memory_profiler import profile

@profile
def my_function():
    a = [1] * (10 ** 6)
    return a

Run: python -m memory_profiler script.py

运行:python -m memory_profiler script.py

undefined
undefined

Go Profiling

Go性能分析

pprof (Built-in):
go
import (
    "net/http"
    _ "net/http/pprof"
)

func main() {
    go func() {
        http.ListenAndServe("localhost:6060", nil)
    }()
    startApp()
}
Capture profile:
bash
undefined
pprof(内置工具):
go
import (
    "net/http"
    _ "net/http/pprof"
)

func main() {
    go func() {
        http.ListenAndServe("localhost:6060", nil)
    }()
    startApp()
}
捕获分析数据:
bash
undefined

CPU profile (30 seconds)

CPU分析(30秒)

Interactive analysis

交互式分析

(pprof) top (pprof) web
undefined
(pprof) top (pprof) web
undefined

TypeScript/JavaScript Profiling

TypeScript/JavaScript性能分析

Chrome DevTools (Browser/Node.js):
Node.js:
bash
node --inspect app.js
Chrome DevTools(浏览器/Node.js):
Node.js:
bash
node --inspect app.js

Open chrome://inspect

打开 chrome://inspect

Performance tab → Record

性能标签页 → 录制


**clinic.js (Node.js):**
```bash
npm install -g clinic
clinic doctor -- node app.js
For detailed profiling workflows and analysis, see
references/profiling-guide.md
and
examples/profiling/
.

**clinic.js(Node.js):**
```bash
npm install -g clinic
clinic doctor -- node app.js
如需详细的性能分析流程和解读,请查看
references/profiling-guide.md
examples/profiling/

Optimization Strategies

优化策略

Caching

缓存

When to cache:
  • Data queried frequently (>100 req/min)
  • Data freshness tolerance (>1 minute acceptable staleness)
Redis example:
python
import redis
r = redis.Redis()

def get_cached_data(key, fn, ttl=300):
    cached = r.get(key)
    if cached:
        return json.loads(cached)
    data = fn()
    r.setex(key, ttl, json.dumps(data))
    return data
何时使用缓存:
  • 频繁查询的数据(>100次请求/分钟)
  • 可容忍数据过期的场景(>1分钟的过期时间可接受)
Redis示例:
python
import redis
r = redis.Redis()

def get_cached_data(key, fn, ttl=300):
    cached = r.get(key)
    if cached:
        return json.loads(cached)
    data = fn()
    r.setex(key, ttl, json.dumps(data))
    return data

Database Query Optimization

数据库查询优化

N+1 prevention:
python
undefined
避免N+1查询:
python
undefined

Bad: N+1 queries

不良写法:N+1查询

users = User.query.all() for user in users: print(user.orders) # Separate query per user
users = User.query.all() for user in users: print(user.orders) # 每个用户触发一次单独查询

Good: Eager loading

推荐写法:预加载

users = User.query.options(joinedload(User.orders)).all()

**Indexing:**
```sql
CREATE INDEX idx_users_email ON users(email);
users = User.query.options(joinedload(User.orders)).all()

**索引优化:**
```sql
CREATE INDEX idx_users_email ON users(email);

API Performance

API性能优化

Cursor-based pagination:
typescript
app.get('/api/products', async (req, res) => {
  const { cursor, limit = 20 } = req.query;

  const products = await db.query(
    'SELECT * FROM products WHERE id > ? ORDER BY id LIMIT ?',
    [cursor || 0, limit]
  );

  res.json({
    data: products,
    next_cursor: products[products.length - 1]?.id,
  });
});
基于游标分页:
typescript
app.get('/api/products', async (req, res) => {
  const { cursor, limit = 20 } = req.query;

  const products = await db.query(
    'SELECT * FROM products WHERE id > ? ORDER BY id LIMIT ?',
    [cursor || 0, limit]
  );

  res.json({
    data: products,
    next_cursor: products[products.length - 1]?.id,
  });
});

Frontend Performance (Core Web Vitals)

前端性能(Core Web Vitals)

Key metrics:
  • LCP (Largest Contentful Paint): < 2.5s
  • INP (Interaction to Next Paint): < 200ms
  • CLS (Cumulative Layout Shift): < 0.1
Optimization techniques:
  • Code splitting (lazy loading)
  • Image optimization (WebP, responsive, lazy loading)
  • Preload critical resources
  • Minimize render-blocking resources
For detailed optimization strategies, see
references/optimization-strategies.md
and
references/frontend-performance.md
.
关键指标:
  • LCP(最大内容绘制): < 2.5秒
  • INP(交互到下一次绘制): < 200毫秒
  • CLS(累积布局偏移): < 0.1
优化技巧:
  • 代码分割(懒加载)
  • 图片优化(WebP格式、响应式图片、懒加载)
  • 预加载关键资源
  • 最小化阻塞渲染的资源
如需详细的优化策略,请查看
references/optimization-strategies.md
references/frontend-performance.md

Performance SLOs

性能SLO

Recommended SLOs by Service Type

按服务类型推荐的SLO

Service Typep95 Latencyp99 LatencyAvailability
User-Facing API< 200ms< 500ms99.9%
Internal API< 100ms< 300ms99.5%
Database Query< 50ms< 100ms99.99%
Background Job< 5s< 10s99%
Real-time API< 50ms< 100ms99.95%
服务类型p95延迟p99延迟可用性
面向用户的API< 200ms< 500ms99.9%
内部API< 100ms< 300ms99.5%
数据库查询< 50ms< 100ms99.99%
后台作业< 5s< 10s99%
实时API< 50ms< 100ms99.95%

SLO Selection Process

SLO选择流程

  1. Measure baseline performance
  2. Identify user expectations
  3. Set achievable targets (10-20% better than baseline)
  4. Iterate as system matures
For detailed SLO framework and performance budgets, see
references/slo-framework.md
.
  1. 测量基准性能
  2. 明确用户期望
  3. 设置可实现的目标(比基准性能好10-20%)
  4. 随着系统成熟逐步迭代
如需详细的SLO框架和性能预算,请查看
references/slo-framework.md

CI/CD Integration

CI/CD集成

Performance Testing in Pipelines

流水线中的性能测试

GitHub Actions example:
yaml
name: performance_engineering

on:
  pull_request:
    branches: [main]

jobs:
  load-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install k6
        run: |
          curl https://github.com/grafana/k6/releases/download/v0.48.0/k6-v0.48.0-linux-amd64.tar.gz -L | tar xvz
          sudo mv k6-v0.48.0-linux-amd64/k6 /usr/local/bin/

      - name: Run load test
        run: k6 run tests/load/api-test.js
Performance budgets:
javascript
// k6 test with thresholds (fail build if violated)
export const options = {
  thresholds: {
    http_req_duration: ['p(95)<500'],
    http_req_failed: ['rate<0.01'],
  },
};
GitHub Actions示例:
yaml
name: performance_engineering

on:
  pull_request:
    branches: [main]

jobs:
  load-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install k6
        run: |
          curl https://github.com/grafana/k6/releases/download/v0.48.0/k6-v0.48.0-linux-amd64.tar.gz -L | tar xvz
          sudo mv k6-v0.48.0-linux-amd64/k6 /usr/local/bin/

      - name: Run load test
        run: k6 run tests/load/api-test.js
性能预算:
javascript
// k6测试阈值(违反则构建失败)
export const options = {
  thresholds: {
    http_req_duration: ['p(95)<500'],
    http_req_failed: ['rate<0.01'],
  },
};

Profiling Workflow

性能分析流程

Standard process:
  1. Observe symptoms (high CPU, memory growth, slow response)
  2. Hypothesize bottleneck (CPU? Memory? I/O?)
  3. Choose profiling type based on hypothesis
  4. Run profiler under realistic load
  5. Analyze profile (flamegraph, call tree)
  6. Identify hot spots (top 20% functions using 80% resources)
  7. Optimize bottlenecks
  8. Re-profile to validate improvement
Best practices:
  • Profile under realistic load (not idle systems)
  • Use sampling profilers (py-spy, pprof) in production (low overhead)
  • Focus on hot paths (optimize biggest bottlenecks first)
  • Validate optimizations with before/after comparisons
标准流程:
  1. 观察症状(CPU使用率高、内存增长、响应缓慢)
  2. 假设瓶颈所在(CPU?内存?I/O?)
  3. 根据假设选择分析类型
  4. 在真实负载下运行分析工具
  5. 分析结果(火焰图、调用树)
  6. 识别热点(占用80%资源的前20%函数)
  7. 优化瓶颈
  8. 重新分析以验证优化效果
最佳实践:
  • 在真实负载下进行分析(而非空闲系统)
  • 在生产环境中使用采样分析工具(py-spy、pprof)(开销低)
  • 聚焦关键路径(优先优化最大的瓶颈)
  • 通过前后对比验证优化效果

Tool Recommendations

工具推荐

Load Testing

负载测试

Primary: k6 (JavaScript-based, Grafana-backed)
  • Modern architecture, cloud-native
  • JavaScript DSL (ES6+)
  • Grafana/Prometheus integration
  • Multi-protocol (HTTP/1.1, HTTP/2, WebSocket, gRPC)
When to use: Modern APIs, microservices, CI/CD integration.
Alternative: Locust (Python-based)
  • Python-native (write tests in Python)
  • Web UI for real-time monitoring
  • Flexible for complex user scenarios
When to use: Python-heavy teams, complex user flows.
首选:k6(基于JavaScript,Grafana支持)
  • 现代架构,云原生
  • JavaScript DSL(ES6+)
  • 集成Grafana/Prometheus
  • 多协议支持(HTTP/1.1、HTTP/2、WebSocket、gRPC)
适用场景: 现代API、微服务、CI/CD集成。
替代工具:Locust(基于Python)
  • 原生Python支持(用Python编写测试)
  • 实时监控Web UI
  • 灵活支持复杂用户场景
适用场景: 以Python为主的团队、复杂用户流程。

Profiling

性能分析

Python:
  • py-spy (sampling, production-safe)
  • cProfile (deterministic, detailed)
  • memory_profiler (memory leak detection)
Go:
  • pprof (built-in, CPU/heap/goroutine/block profiling)
TypeScript/JavaScript:
  • Chrome DevTools (browser/Node.js)
  • clinic.js (Node.js performance suite)
For detailed tool comparisons, see
references/testing-types.md
and
references/profiling-guide.md
.
Python:
  • py-spy(采样分析,生产环境安全)
  • cProfile(确定性分析,详细)
  • memory_profiler(内存泄漏检测)
Go:
  • pprof(内置工具,支持CPU/堆/协程/阻塞分析)
TypeScript/JavaScript:
  • Chrome DevTools(浏览器/Node.js)
  • clinic.js(Node.js性能套件)
如需详细的工具对比,请查看
references/testing-types.md
references/profiling-guide.md

Reference Documentation

参考文档

Detailed Guides:
  • references/testing-types.md
    - Load, stress, soak, spike testing patterns
  • references/profiling-guide.md
    - CPU, memory, I/O profiling across languages
  • references/optimization-strategies.md
    - Caching, database, API optimization
  • references/frontend-performance.md
    - Core Web Vitals, bundle optimization
  • references/slo-framework.md
    - Setting SLOs, performance budgets
  • references/benchmarking.md
    - Benchmarking best practices
Examples:
  • examples/k6/
    - Load, stress, soak, spike tests
  • examples/locust/
    - Python-based load testing
  • examples/profiling/
    - Profiling examples (Python, Go, TypeScript)
  • examples/optimization/
    - Caching, query, API optimization
详细指南:
  • references/testing-types.md
    - 负载、压力、浸泡、尖峰测试模式
  • references/profiling-guide.md
    - 跨语言的CPU、内存、I/O分析
  • references/optimization-strategies.md
    - 缓存、数据库、API优化
  • references/frontend-performance.md
    - Core Web Vitals、包优化
  • references/slo-framework.md
    - SLO设置、性能预算
  • references/benchmarking.md
    - 基准测试最佳实践
示例:
  • examples/k6/
    - 负载、压力、浸泡、尖峰测试示例
  • examples/locust/
    - 基于Python的负载测试示例
  • examples/profiling/
    - 性能分析示例(Python、Go、TypeScript)
  • examples/optimization/
    - 缓存、查询、API优化示例

Related Skills

相关技能

For comprehensive testing strategies, see the
testing-strategies
skill.
For CI/CD integration patterns, see the
building-ci-pipelines
skill.
For infrastructure sizing based on load tests, see the
infrastructure-as-code
skill.
Performance Engineering v1.1 - Enhanced
如需全面的测试策略,请查看
testing-strategies
技能。
如需CI/CD集成模式,请查看
building-ci-pipelines
技能。
如需基于负载测试的基础设施规模规划,请查看
infrastructure-as-code
技能。
性能工程 v1.1 - 增强版

🔄 Workflow

🔄 工作流程

Aşama 1: Planning & SLOs

步骤1:规划与SLOs

  • Goal: Testin amacı ne? (Smoke, Load, Stress, Soak?).
  • SLOs: Başarı kriterlerini belirle (Örn: p95 latency < 200ms, Error rate < %1).
  • Environment: Test ortamı Prod ile ne kadar benzer? (Scaling faktörünü belirle).
  • 目标:测试的目的是什么?(冒烟测试、负载测试、压力测试、浸泡测试?)
  • SLOs:确定成功标准(例如:p95延迟 < 200ms,错误率 < 1%)
  • 环境:测试环境与生产环境的相似度如何?(确定缩放系数)

Aşama 2: Scripting & Execution

步骤2:脚本编写与执行

  • User Journey: Gerçek kullanıcı davranışını simüle et (Login -> Browse -> Buy).
  • Data Driven: Testi statik verilerle değil, CSV'den gelen dinamik verilerle besle (Cache'i aşmak için).
  • Ramp-up: Trafiği aniden değil, kademeli artır (Sistemin ısınması için).
  • 用户旅程:模拟真实用户行为(登录 -> 浏览 -> 购买)
  • 数据驱动:使用CSV中的动态数据而非静态数据驱动测试(以突破缓存)
  • 逐步加压:逐步增加流量而非突然加压(让系统有预热时间)

Aşama 3: Analysis & Optimization

步骤3:分析与优化

  • Correlation: Hata anında CPU/Memory/DB metrikleri ne durumdaydı?
  • Bottleneck: Darboğaz nerede? (App Code, DB, Network, veya Load Injector'ın kendisi?).
  • Report: Teknik ve yönetici özeti içeren rapor hazırla.
  • 关联分析:发生错误时CPU/内存/数据库指标处于什么状态?
  • 瓶颈定位:瓶颈在哪里?(应用代码、数据库、网络还是负载注入器本身?)
  • 报告:准备包含技术摘要和管理层摘要的报告

Kontrol Noktaları

检查点

AşamaDoğrulama
1Test verisi (Database seed) yeterli hacimde mi?
2Load Generator (Test makinesi) CPU darboğazına girdi mi? (False negative riski).
33rd party API'lar (Stripe, Twilio) mock'landı mı? (Masraf ve ban riski).
步骤验证内容
1测试数据(数据库种子数据)是否足够?
2负载生成器(测试机器)是否出现CPU瓶颈?(存在假阴性风险)
3第三方API(Stripe、Twilio)是否已被模拟?(存在成本和封禁风险)