Rust Performance Best Practices
Expert-level performance optimization guide for Rust. Contains 45+ rules across 9 categories with real benchmarks, failure modes, and profiling workflows.
When to Apply
Reference these guidelines when:
- Investigating slow Rust programs or high latency
- Optimizing build times or binary size
- Reviewing allocation-heavy code
- Debugging lock contention or thread scaling issues
- Setting up release profiles for production
- Working with async runtimes (Tokio, async-std)
When NOT to Apply
Skip these optimizations when:
- Code isn't in a hot path (profile first!)
- Readability would suffer significantly
- You haven't measured a performance problem
- The optimization requires unsafe code you can't verify
- Premature optimization would delay shipping
The Optimization Workflow
CRITICAL: Most Rust code doesn't need optimization. Profile first, optimize second.
┌─────────────────────────────────────────────────────────────┐
│ OPTIMIZATION WORKFLOW │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. MEASURE FIRST │
│ └── Profile before changing anything │
│ └── Use cargo flamegraph, perf, or heaptrack │
│ └── Identify actual bottlenecks (don't guess!) │
│ │
│ 2. CHECK BUILD SETTINGS │
│ └── Release mode? (10-100x vs debug) │
│ └── LTO enabled? (5-20% improvement) │
│ └── Target CPU? (10-30% for SIMD) │
│ │
│ 3. FIX ALGORITHMIC ISSUES │
│ └── O(n²) → O(n log n) matters more than micro-opts │
│ └── Check data structure choices │
│ └── Avoid unnecessary work │
│ │
│ 4. REDUCE ALLOCATIONS │
│ └── Pre-size collections (with_capacity) │
│ └── Reuse buffers (clear + reuse) │
│ └── Avoid cloning (borrow instead) │
│ │
│ 5. OPTIMIZE HOT LOOPS │
│ └── Iterators over indices │
│ └── Reduce lock scope │
│ └── Batch I/O operations │
│ │
│ 6. MEASURE AGAIN │
│ └── Verify improvement with benchmarks │
│ └── Check for regressions elsewhere │
│ └── Document the optimization │
│ │
└─────────────────────────────────────────────────────────────┘
Quick Profiling Commands
bash
# CPU profiling (Linux)
cargo flamegraph --bin myapp
perf record -g ./target/release/myapp && perf report
# Memory profiling
heaptrack ./target/release/myapp && heaptrack_gui heaptrack.myapp.*.gz
DHAT_LOG_FILE=dhat.out cargo run --release && dh_view.py dhat.out
# Benchmark
cargo bench # All benchmarks
cargo bench hot_function # Specific benchmark
# Check allocations
MALLOC_TRACE=/tmp/mtrace.log ./target/release/myapp
mtrace ./target/release/myapp /tmp/mtrace.log
# Assembly inspection
cargo asm my_crate::hot_function --rust
# syscall count
strace -c ./target/release/myapp 2>&1 | head -20
Common Scenarios → Rules
"My Rust program is slow"
Is it running in debug mode?
├── YES → build-release-profile (10-100x speedup)
└── NO
│
Where does flamegraph show time?
├── malloc/free → alloc-* rules (with_capacity, reuse buffers)
├── Mutex::lock → sync-* rules (RwLock, atomics, shorter scope)
├── read/write syscalls → io-* rules (BufReader/BufWriter)
├── clone/drop → alloc-avoid-clone, use references
└── Your code → iter-* rules, algorithm improvements
"My binary is too large"
1. Enable LTO: build-enable-lto (10-20% smaller)
2. Set opt-level = 'z': build-opt-level (optimizes for size)
3. panic = 'abort': build-panic-abort (removes unwinding code)
4. Strip symbols: strip = true in Cargo.toml
5. Remove debug info: debug = 0
"High memory usage"
1. Pre-size collections: alloc-*-with-capacity
2. Reuse allocations: alloc-reuse-buffers
3. Avoid cloning: alloc-avoid-clone
4. Use slices in APIs: alloc-use-slices-in-apis
5. Consider arena allocators: bumpalo crate
"Lock contention / thread scaling"
1. Profile: lock_api::ReentrantMutex or parking_lot profiling
2. Reduce lock scope: sync-keep-lock-scope-short
3. Read-heavy? → sync-use-rwlock
4. Simple counters? → sync-use-atomics
5. Message passing? → sync-use-channels
6. Thread-local + periodic flush for stats
"Slow file I/O"
1. Wrap in BufReader/BufWriter: io-use-bufreader, io-use-bufwriter
2. Flush before returning: io-flush-bufwriter (data loss prevention!)
3. Reuse line buffer: io-read-line-with-bufread
4. Consider mmap for random access: memmap2 crate
Rule Categories
| Priority | Category | Typical Impact | Prefix |
|---|
| 1 | Build Profiles | 10-100x (debug→release) | |
| 2 | Benchmarking | Enables measurement | |
| 3 | Allocation | 2-50x for allocation-heavy code | |
| 4 | Data Structures | 2-10x for hot paths | |
| 5 | Iteration | 2-5x for loop-heavy code | |
| 6 | Synchronization | 5-100x for contended code | |
| 7 | I/O | 10-100x for I/O-bound code | |
| 8 | Unsafe | 5-30% in tight loops (experts only) | |
1. Build Profiles (CRITICAL)
These apply to ALL Rust code. Check these first.
| Rule | Impact | One-liner |
|---|
| 10-100x | Always ship release builds |
| 2-5x | opt-level=3 for speed, 'z' for size |
| 5-20% | LTO enables cross-crate optimization |
| 5-15% | codegen-units=1 for max optimization |
| Binary size | panic='abort' removes unwinding |
| 10-30% | target-cpu=native for SIMD |
| 5-20% | Profile-guided optimization |
| 5-10% | Disable for release builds |
2. Benchmarking (REQUIRED)
You can't optimize what you don't measure.
| Rule | Purpose |
|---|
| Use with criterion |
| Bench profile enables optimizations |
| Prevent dead code elimination |
| I/O variance destroys measurements |
3. Allocation
Every allocation is a syscall. Reduce them.
| Rule | Impact | Pattern |
|---|
| 2-10x | not |
alloc-string-with-capacity
| 2-5x | |
alloc-hashmap-with-capacity
| 2-5x | HashMap::with_capacity(n)
|
| 2-10x | and reuse, don't reallocate (up to 50x in tight loops) |
| Flexibility | not in parameters |
| 2-10x | Borrow instead of (benefits scale with data size) |
4. Data Structures
The right data structure beats micro-optimization.
| Rule | When |
|---|
| Almost always (Vec wins) |
data-choose-vecdeque-for-queue
| FIFO queues |
| HashMap=O(1), BTreeMap=sorted |
| Insert-or-update patterns |
| FFI newtypes |
5. Iteration
Iterators are as fast as loops and safer.
| Rule | Impact | Pattern |
|---|
iter-avoid-collect-then-loop
| 2-3x | Chain iterators, don't collect |
| 2-3x | not intermediate vecs |
| Short-circuit | not |
| In-place | not |
| O(log n) | on sorted data |
6. Synchronization
Locks are expensive. Minimize contention.
| Rule | Impact | When |
|---|
| Avoids copying | Share large (>64B) data across threads |
| 2-8x for reads | >80% reads, few writes; consider parking_lot |
sync-keep-lock-scope-short
| 4x | Minimize code under lock |
| 3-4x | Message passing vs shared state |
| 20x | Simple counters, flags |
| 1.5-5x | Prefer over std sync primitives |
7. I/O
Every syscall costs. Buffer them.
| Rule | Impact | Pattern |
|---|
| 50x | Wrap in |
| 18x | Wrap in |
| CRITICAL | Must flush or lose data! |
io-read-line-with-bufread
| 53x | Reuse String buffer with |
8. Async/Await (HIGH)
Critical for Tokio and async-std applications.
| Rule | Impact | Pattern |
|---|
| Prevents hang | Use for CPU-bound work |
| Latency | Yield periodically in long computations |
| Correctness | across points |
| Throughput | Use async I/O, not std::fs in async contexts |
| Backpressure | Prefer bounded channels for flow control |
Key insight: The async runtime is cooperative. Blocking the executor thread starves all other tasks.
rust
// BAD: Blocks the async runtime
async fn process(data: &[u8]) -> Result<Hash> {
let hash = expensive_hash(data); // CPU-bound, blocks executor!
Ok(hash)
}
// GOOD: Offload to blocking thread pool
async fn process(data: Vec<u8>) -> Result<Hash> {
tokio::task::spawn_blocking(move || expensive_hash(&data)).await?
}
9. Unsafe (Expert Only)
Only after profiling proves these matter.
| Rule | Impact | Risk |
|---|
| 5-30% | UB if bounds wrong |
| 20-100x alloc | UB if read before write |
| Correctness | Prefer safe alternatives |
| Zero-cost | Required for FFI newtypes |
Decision Trees
When to use with_capacity?
Do you know the size?
├── YES, exact → with_capacity(exact)
├── YES, approximate → with_capacity(estimate)
└── NO
│
Will it grow frequently?
├── YES → Start bigger or use reserve()
└── NO → Vec::new() is fine
Mutex vs RwLock vs Atomics?
Is it a simple counter/flag?
├── YES → Atomics (20x faster)
└── NO
│
What's the read/write ratio?
├── Mostly reads (>90%) → RwLock
├── Mostly writes → Mutex
└── Mixed → Mutex (simpler)
Consider: parking_lot > std for all of these
When is unsafe get_unchecked worth it?
Did you profile and find bounds checks are the bottleneck?
├── NO → Don't use it
└── YES
│
Did you check if LLVM already removed the bounds check?
├── NO → Check assembly first (cargo asm)
└── YES, still there
│
Can you use iterators instead?
├── YES → Use iterators (same speed, safe)
└── NO → get_unchecked with documented invariants
Reading Rules
Each rule file in
contains:
- Quantified impact with real benchmark numbers
- Visual explanations of how the optimization works
- Incorrect examples showing common mistakes
- Correct examples with best practices
- When NOT to apply - trade-offs and edge cases
- Common mistakes to avoid
- Profiling commands to identify the issue
- References to official docs
Full Compiled Document
For all rules in a single file: