cpu-cache-opt
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCPU Cache Optimization
CPU缓存优化
Purpose
目标
Guide agents through cache-aware programming: diagnosing cache misses with perf, data layout transformations (AoS→SoA), false sharing detection and fixes, prefetching, and cache-friendly algorithm design.
指导开发者进行缓存感知编程:使用perf诊断缓存缺失、数据布局转换(AoS→SoA)、伪共享检测与修复、预取操作,以及设计缓存友好型算法。
Triggers
触发场景
- "My program has high cache miss rates — how do I fix it?"
- "What is false sharing and how do I detect it?"
- "Should I use AoS or SoA data layout?"
- "How do I measure cache performance with perf?"
- "How do I use __builtin_prefetch?"
- "My multithreaded program is slower than single-threaded due to cache"
- "我的程序缓存缺失率很高 — 该如何修复?"
- "什么是伪共享,如何检测它?"
- "我应该使用AoS还是SoA数据布局?"
- "如何用perf测量缓存性能?"
- "如何使用__builtin_prefetch?"
- "我的多线程程序因缓存问题比单线程程序更慢"
Workflow
工作流程
1. Measure cache performance
1. 测量缓存性能
bash
undefinedbash
undefinedBasic cache counters
Basic cache counters
perf stat -e cache-references,cache-misses,cycles,instructions ./prog
perf stat -e cache-references,cache-misses,cycles,instructions ./prog
L1/L2/L3 miss breakdown
L1/L2/L3 miss breakdown
perf stat -e
L1-dcache-load-misses,
L1-dcache-loads,
L2-dcache-load-misses,
LLC-load-misses,
LLC-loads
./prog
L1-dcache-load-misses,
L1-dcache-loads,
L2-dcache-load-misses,
LLC-load-misses,
LLC-loads
./prog
perf stat -e
L1-dcache-load-misses,
L1-dcache-loads,
L2-dcache-load-misses,
LLC-load-misses,
LLC-loads
./prog
L1-dcache-load-misses,
L1-dcache-loads,
L2-dcache-load-misses,
LLC-load-misses,
LLC-loads
./prog
Cache miss rate = L1-dcache-load-misses / L1-dcache-loads
Cache miss rate = L1-dcache-load-misses / L1-dcache-loads
> 5% is concerning; > 20% is severe
> 5% is concerning; > 20% is severe
False sharing detection
False sharing detection
perf stat -e
machine_clears.memory_ordering,
mem_load_l3_hit_retired.xsnp_hitm
./prog
machine_clears.memory_ordering,
mem_load_l3_hit_retired.xsnp_hitm
./prog
undefinedperf stat -e
machine_clears.memory_ordering,
mem_load_l3_hit_retired.xsnp_hitm
./prog
machine_clears.memory_ordering,
mem_load_l3_hit_retired.xsnp_hitm
./prog
undefined2. Cache line basics
2. 缓存行基础
- Cache line size: 64 bytes on x86-64, ARM (most platforms)
- L1 cache: 32–64 KB, ~4 cycles latency
- L2 cache: 256 KB–1 MB, ~12 cycles latency
- L3 cache: 6–64 MB, ~40 cycles latency
- Main memory: ~200–300 cycles latency
c
// Check cache line size
long cache_line = sysconf(_SC_LEVEL1_DCACHE_LINESIZE);
// Align data to cache line
struct alignas(64) HotData {
int counter;
// ... 60 bytes of data that fit in one line
};
// C
typedef struct {
int x;
} __attribute__((aligned(64))) AlignedData;- 缓存行大小:64字节(x86-64、ARM等多数平台)
- L1缓存:32–64 KB,延迟约4个周期
- L2缓存:256 KB–1 MB,延迟约12个周期
- L3缓存:6–64 MB,延迟约40个周期
- 主内存:延迟约200–300个周期
c
// Check cache line size
long cache_line = sysconf(_SC_LEVEL1_DCACHE_LINESIZE);
// Align data to cache line
struct alignas(64) HotData {
int counter;
// ... 60 bytes of data that fit in one line
};
// C
typedef struct {
int x;
} __attribute__((aligned(64))) AlignedData;3. AoS vs SoA data layout
3. AoS与SoA数据布局
c
// AoS (Array of Structures) — default layout
struct Particle {
float x, y, z; // position (12 bytes)
float vx, vy, vz; // velocity (12 bytes)
float mass; // (4 bytes)
int flags; // (4 bytes)
};
Particle particles[N]; // Bad for loops that only need position
// Problem: accessing particles[i].x loads x,y,z,vx,vy,vz,mass,flags
// But we only need x,y,z → 75% of loaded data is wasted
// SoA (Structure of Arrays) — cache-friendly for SIMD + sequential access
struct ParticlesSoA {
float *x, *y, *z;
float *vx, *vy, *vz;
float *mass;
int *flags;
};
// Accessing x[i] for i=0..N loads 16 consecutive x values → 0% waste
// Also auto-vectorizes betterc
// AoS (Array of Structures) — default layout
struct Particle {
float x, y, z; // position (12 bytes)
float vx, vy, vz; // velocity (12 bytes)
float mass; // (4 bytes)
int flags; // (4 bytes)
};
Particle particles[N]; // Bad for loops that only need position
// Problem: accessing particles[i].x loads x,y,z,vx,vy,vz,mass,flags
// But we only need x,y,z → 75% of loaded data is wasted
// SoA (Structure of Arrays) — cache-friendly for SIMD + sequential access
struct ParticlesSoA {
float *x, *y, *z;
float *vx, *vy, *vz;
float *mass;
int *flags;
};
// Accessing x[i] for i=0..N loads 16 consecutive x values → 0% waste
// Also auto-vectorizes better4. Common cache-unfriendly patterns
4. 常见的缓存不友好模式
c
// BAD: random access (linked list traversal)
Node *node = head;
while (node) {
process(node->data);
node = node->next; // pointer chasing = cache miss per node
}
// BETTER: pool allocate nodes contiguously
// Or: rewrite as contiguous array with indices
// BAD: stride > cache line in matrix traversal
for (int i = 0; i < N; i++)
for (int j = 0; j < M; j++)
sum += matrix[j][i]; // column-major access on row-major array
// GOOD: row-major access
for (int i = 0; i < N; i++)
for (int j = 0; j < M; j++)
sum += matrix[i][j];
// BAD: large struct with hot + cold fields
struct Record {
int id; // hot: accessed every iteration
char name[128]; // cold: accessed rarely
int value; // hot
char desc[256]; // cold
};
// GOOD: separate hot and cold data
struct RecordHot { int id; int value; };
struct RecordCold { char name[128]; char desc[256]; };
RecordHot hot_data[N];
RecordCold cold_data[N];c
// BAD: random access (linked list traversal)
Node *node = head;
while (node) {
process(node->data);
node = node->next; // pointer chasing = cache miss per node
}
// BETTER: pool allocate nodes contiguously
// Or: rewrite as contiguous array with indices
// BAD: stride > cache line in matrix traversal
for (int i = 0; i < N; i++)
for (int j = 0; j < M; j++)
sum += matrix[j][i]; // column-major access on row-major array
// GOOD: row-major access
for (int i = 0; i < N; i++)
for (int j = 0; j < M; j++)
sum += matrix[i][j];
// BAD: large struct with hot + cold fields
struct Record {
int id; // hot: accessed every iteration
char name[128]; // cold: accessed rarely
int value; // hot
char desc[256]; // cold
};
// GOOD: separate hot and cold data
struct RecordHot { int id; int value; };
struct RecordCold { char name[128]; char desc[256]; };
RecordHot hot_data[N];
RecordCold cold_data[N];5. False sharing
5. 伪共享
False sharing occurs when two threads write to different variables that share a cache line, causing constant cache-line invalidations.
c
// BAD: counters likely on same cache line (8 bytes each, line = 64 bytes)
int counter_a; // thread A's counter
int counter_b; // thread B's counter
// Both on the same cache line → every write invalidates the other thread's cache
// GOOD: pad to separate cache lines
struct alignas(64) PaddedCounter {
int value;
char padding[60]; // Ensure next counter is on different cache line
};
PaddedCounter counters[NUM_THREADS];
// Thread i: counters[i].value++
// C++ standard approach
struct alignas(std::hardware_destructive_interference_size) PaddedCounter {
int value;
};伪共享指当两个线程写入共享同一缓存行的不同变量时,导致缓存行持续失效的现象。
c
// BAD: counters likely on same cache line (8 bytes each, line = 64 bytes)
int counter_a; // thread A's counter
int counter_b; // thread B's counter
// Both on the same cache line → every write invalidates the other thread's cache
// GOOD: pad to separate cache lines
struct alignas(64) PaddedCounter {
int value;
char padding[60]; // Ensure next counter is on different cache line
};
PaddedCounter counters[NUM_THREADS];
// Thread i: counters[i].value++
// C++ standard approach
struct alignas(std::hardware_destructive_interference_size) PaddedCounter {
int value;
};6. Prefetching
6. 预取操作
Manual prefetch hints to hide memory latency:
c
#include <immintrin.h> // or <xmmintrin.h>
// Prefetch for read (locality 0=non-temporal, 3=high temporal)
__builtin_prefetch(ptr, 0, 3); // prefetch for read, high locality
__builtin_prefetch(ptr, 1, 3); // prefetch for write, high locality
// SSE prefetch (x86)
_mm_prefetch((char*)ptr, _MM_HINT_T0); // L1
_mm_prefetch((char*)ptr, _MM_HINT_T1); // L2
_mm_prefetch((char*)ptr, _MM_HINT_T2); // L3
_mm_prefetch((char*)ptr, _MM_HINT_NTA); // non-temporal (streaming)
// Typical pattern: prefetch N iterations ahead
#define PREFETCH_DIST 8
for (int i = 0; i < N; i++) {
if (i + PREFETCH_DIST < N)
__builtin_prefetch(&data[i + PREFETCH_DIST], 0, 3);
process(data[i]);
}Prefetching rules:
- Prefetch too early = cache evicted before use
- Prefetch too late = no benefit
- Prefetch distance = memory latency / time per iteration (typically 8–32 elements)
手动添加预取提示以隐藏内存延迟:
c
#include <immintrin.h> // or <xmmintrin.h>
// Prefetch for read (locality 0=non-temporal, 3=high temporal)
__builtin_prefetch(ptr, 0, 3); // prefetch for read, high locality
__builtin_prefetch(ptr, 1, 3); // prefetch for write, high locality
// SSE prefetch (x86)
_mm_prefetch((char*)ptr, _MM_HINT_T0); // L1
_mm_prefetch((char*)ptr, _MM_HINT_T1); // L2
_mm_prefetch((char*)ptr, _MM_HINT_T2); // L3
_mm_prefetch((char*)ptr, _MM_HINT_NTA); // non-temporal (streaming)
// Typical pattern: prefetch N iterations ahead
#define PREFETCH_DIST 8
for (int i = 0; i < N; i++) {
if (i + PREFETCH_DIST < N)
__builtin_prefetch(&data[i + PREFETCH_DIST], 0, 3);
process(data[i]);
}预取规则:
- 预取过早 → 缓存数据在使用前被淘汰
- 预取过晚 → 无性能收益
- 预取距离 = 内存延迟 / 每次迭代耗时(通常为8–32个元素)
7. Cache-friendly algorithm design
7. 缓存友好型算法设计
c
// Loop blocking / tiling for matrix operations
// Process cache-fitting blocks instead of full rows/columns
#define BLOCK 64 // tuned to L1 cache size
void matrix_mult_blocked(float *C, float *A, float *B, int N) {
for (int i = 0; i < N; i += BLOCK)
for (int k = 0; k < N; k += BLOCK)
for (int j = 0; j < N; j += BLOCK)
// Inner block fits in L1 cache
for (int ii = i; ii < i + BLOCK && ii < N; ii++)
for (int kk = k; kk < k + BLOCK && kk < N; kk++)
for (int jj = j; jj < j + BLOCK && jj < N; jj++)
C[ii*N+jj] += A[ii*N+kk] * B[kk*N+jj];
}For perf cache event reference and false sharing detection patterns, see references/cache-counters.md.
c
// Loop blocking / tiling for matrix operations
// Process cache-fitting blocks instead of full rows/columns
#define BLOCK 64 // tuned to L1 cache size
void matrix_mult_blocked(float *C, float *A, float *B, int N) {
for (int i = 0; i < N; i += BLOCK)
for (int k = 0; k < N; k += BLOCK)
for (int j = 0; j < N; j += BLOCK)
// Inner block fits in L1 cache
for (int ii = i; ii < i + BLOCK && ii < N; ii++)
for (int kk = k; kk < k + BLOCK && kk < N; kk++)
for (int jj = j; jj < j + BLOCK && jj < N; jj++)
C[ii*N+jj] += A[ii*N+kk] * B[kk*N+jj];
}关于perf缓存事件参考和伪共享检测模式,请参见references/cache-counters.md。
Related skills
相关技能
- Use for
skills/profilers/linux-perfandperf statcache measurementsperf record - Use — cachegrind simulates cache behaviour
skills/profilers/valgrind - Use — SoA layout pairs with SIMD vectorization
skills/low-level-programming/simd-intrinsics - Use for false sharing in concurrent contexts
skills/low-level-programming/memory-model
- 使用进行
skills/profilers/linux-perf和perf stat缓存测量perf record - 使用— cachegrind可模拟缓存行为
skills/profilers/valgrind - 使用— SoA布局与SIMD向量化适配性良好
skills/low-level-programming/simd-intrinsics - 使用处理并发场景下的伪共享问题
skills/low-level-programming/memory-model