m10-performance

Original🇺🇸 English
Not Translated

CRITICAL: Use for performance optimization. Triggers: performance, optimization, benchmark, profiling, flamegraph, criterion, slow, fast, allocation, cache, SIMD, make it faster, 性能优化, 基准测试

3installs
Added on

NPX Install

npx skill4agent add actionbook/rust-skills m10-performance

SKILL.md Content

Performance Optimization

Layer 2: Design Choices

Core Question

What's the bottleneck, and is optimization worth it?
Before optimizing:
  • Have you measured? (Don't guess)
  • What's the acceptable performance?
  • Will optimization add complexity?

Performance Decision → Implementation

GoalDesign ChoiceImplementation
Reduce allocationsPre-allocate, reuse
with_capacity
, object pools
Improve cacheContiguous data
Vec
,
SmallVec
ParallelizeData parallelism
rayon
, threads
Avoid copiesZero-copyReferences,
Cow<T>
Reduce indirectionInline data
smallvec
, arrays

Thinking Prompt

Before optimizing:
  1. Have you measured?
    • Profile first → flamegraph, perf
    • Benchmark → criterion, cargo bench
    • Identify actual hotspots
  2. What's the priority?
    • Algorithm (10x-1000x improvement)
    • Data structure (2x-10x)
    • Allocation (2x-5x)
    • Cache (1.5x-3x)
  3. What's the trade-off?
    • Complexity vs speed
    • Memory vs CPU
    • Latency vs throughput

Trace Up ↑

To domain constraints (Layer 3):
"How fast does this need to be?"
    ↑ Ask: What's the performance SLA?
    ↑ Check: domain-* (latency requirements)
    ↑ Check: Business requirements (acceptable response time)
QuestionTrace ToAsk
Latency requirementsdomain-*What's acceptable response time?
Throughput needsdomain-*How many requests per second?
Memory constraintsdomain-*What's the memory budget?

Trace Down ↓

To implementation (Layer 1):
"Need to reduce allocations"
    ↓ m01-ownership: Use references, avoid clone
    ↓ m02-resource: Pre-allocate with_capacity

"Need to parallelize"
    ↓ m07-concurrency: Choose rayon or threads
    ↓ m07-concurrency: Consider async for I/O-bound

"Need cache efficiency"
    ↓ Data layout: Prefer Vec over HashMap when possible
    ↓ Access patterns: Sequential over random access

Quick Reference

ToolPurpose
cargo bench
Micro-benchmarks
criterion
Statistical benchmarks
perf
/
flamegraph
CPU profiling
heaptrack
Allocation tracking
valgrind
/
cachegrind
Cache analysis

Optimization Priority

1. Algorithm choice     (10x - 1000x)
2. Data structure       (2x - 10x)
3. Allocation reduction (2x - 5x)
4. Cache optimization   (1.5x - 3x)
5. SIMD/Parallelism     (2x - 8x)

Common Techniques

TechniqueWhenHow
Pre-allocationKnown size
Vec::with_capacity(n)
Avoid cloningHot pathsUse references or
Cow<T>
Batch operationsMany small opsCollect then process
SmallVecUsually small
smallvec::SmallVec<[T; N]>
Inline buffersFixed-size dataArrays over Vec

Common Mistakes

MistakeWhy WrongBetter
Optimize without profilingWrong targetProfile first
Benchmark in debug modeMeaninglessAlways
--release
Use LinkedListCache unfriendly
Vec
or
VecDeque
Hidden
.clone()
Unnecessary allocsUse references
Premature optimizationWasted effortMake it work first

Anti-Patterns

Anti-PatternWhy BadBetter
Clone to avoid lifetimesPerformance costProper ownership
Box everythingIndirection costStack when possible
HashMap for small setsOverheadVec with linear search
String concat in loopO(n^2)
String::with_capacity
or
format!

Related Skills

WhenSee
Reducing clonesm01-ownership
Concurrency optionsm07-concurrency
Smart pointer choicem02-resource
Domain requirementsdomain-*