simd-intrinsics
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSIMD Intrinsics
SIMD Intrinsics
Purpose
用途
Guide agents through SIMD: reading auto-vectorization output, writing SSE2/AVX2/NEON intrinsics, runtime CPU feature detection, and choosing between compiler auto-vectorization and manual intrinsics.
指导开发者掌握SIMD相关操作:阅读自动向量化输出结果、编写SSE2/AVX2/NEON Intrinsics代码、运行时CPU特性检测,以及在编译器自动向量化与手动编写Intrinsics之间做选择。
Triggers
触发场景
- "How do I check if my loop is being auto-vectorized?"
- "How do I write SSE2/AVX2 intrinsics?"
- "Auto-vectorization failed — how do I fix it?"
- "How do I check for CPU features at runtime?"
- "Should I use intrinsics or let the compiler vectorize?"
- "How do I write NEON intrinsics for ARM?"
- "如何检查我的循环是否被自动向量化?"
- "如何编写SSE2/AVX2 Intrinsics代码?"
- "自动向量化失败了,该怎么修复?"
- "如何在运行时检查CPU特性?"
- "我应该使用Intrinsics还是让编译器自动向量化?"
- "如何为ARM编写NEON Intrinsics代码?"
Workflow
操作流程
1. Check auto-vectorization
1. 检查自动向量化
bash
undefinedbash
undefinedGCC: show vectorization info
GCC: 显示向量化信息
gcc -O2 -march=native -fopt-info-vec src/hot.c -o hot
gcc -O2 -march=native -fopt-info-vec src/hot.c -o hot
Verbose: show missed + successful
详细模式:显示未成功和成功的向量化信息
gcc -O2 -march=native -fopt-info-vec-missed -fopt-info-vec-optimized src/hot.c
gcc -O2 -march=native -fopt-info-vec-missed -fopt-info-vec-optimized src/hot.c
Clang: vectorization remarks
Clang: 向量化备注信息
clang -O2 -march=native
-Rpass=loop-vectorize
-Rpass-missed=loop-vectorize
-Rpass-analysis=loop-vectorize
src/hot.c -o hot
-Rpass=loop-vectorize
-Rpass-missed=loop-vectorize
-Rpass-analysis=loop-vectorize
src/hot.c -o hot
clang -O2 -march=native
-Rpass=loop-vectorize
-Rpass-missed=loop-vectorize
-Rpass-analysis=loop-vectorize
src/hot.c -o hot
-Rpass=loop-vectorize
-Rpass-missed=loop-vectorize
-Rpass-analysis=loop-vectorize
src/hot.c -o hot
Example missed message:
示例未成功向量化的提示信息:
hot.c:15:5: remark: loop not vectorized: value that could not be identified as
hot.c:15:5: remark: loop not vectorized: value that could not be identified as
reduction is used outside the loop [-Rpass-missed=loop-vectorize]
reduction is used outside the loop [-Rpass-missed=loop-vectorize]
Common auto-vectorization blockers:
| Blocker | Fix |
|---------|-----|
| Loop-carried dependency | Restructure to remove dependency |
| Data-dependent exit (early return) | Move exit after loop |
| Non-contiguous memory | Use gather/scatter or restructure |
| Aliasing (pointer may alias) | Add `__restrict__` or `restrict` |
| Unknown trip count | Add `__builtin_expect` or hint |
| Function call in loop body | Inline the function |
```c
// Help the compiler by adding restrict
void add_arrays(float * __restrict__ dst,
const float * __restrict__ a,
const float * __restrict__ b,
size_t n) {
for (size_t i = 0; i < n; i++)
dst[i] = a[i] + b[i]; // Now vectorizable
}
常见的自动向量化阻碍因素:
| 阻碍因素 | 修复方案 |
|---------|-----|
| 循环依赖 | 重构代码以消除依赖 |
| 数据相关的提前退出(提前返回) | 将退出逻辑移至循环之后 |
| 非连续内存访问 | 使用聚集/分散指令或重构代码 |
| 指针别名(指针可能指向同一内存) | 添加 `__restrict__` 或 `restrict` 关键字 |
| 未知循环迭代次数 | 添加 `__builtin_expect` 或编译提示 |
| 循环体中包含函数调用 | 内联该函数 |
```c
// 通过添加restrict关键字帮助编译器优化
void add_arrays(float * __restrict__ dst,
const float * __restrict__ a,
const float * __restrict__ b,
size_t n) {
for (size_t i = 0; i < n; i++)
dst[i] = a[i] + b[i]; // 现在可被向量化
}2. Runtime CPU feature detection
2. 运行时CPU特性检测
c
// Linux: use __builtin_cpu_supports (GCC/Clang)
if (__builtin_cpu_supports("avx2")) {
process_avx2(data, len);
} else if (__builtin_cpu_supports("sse4.2")) {
process_sse42(data, len);
} else {
process_scalar(data, len);
}
// Check specific features:
__builtin_cpu_supports("sse2")
__builtin_cpu_supports("sse4.1")
__builtin_cpu_supports("sse4.2")
__builtin_cpu_supports("avx")
__builtin_cpu_supports("avx2")
__builtin_cpu_supports("avx512f")
__builtin_cpu_supports("bmi")
__builtin_cpu_supports("bmi2")
__builtin_cpu_supports("fma")c
// Portable: use CPUID directly
#include <cpuid.h>
static int has_avx2(void) {
unsigned int eax, ebx, ecx, edx;
// CPUID leaf 7, subleaf 0
__cpuid_count(7, 0, eax, ebx, ecx, edx);
return (ebx >> 5) & 1; // bit 5 = AVX2
}c
// Linux: 使用__builtin_cpu_supports(GCC/Clang)
if (__builtin_cpu_supports("avx2")) {
process_avx2(data, len);
} else if (__builtin_cpu_supports("sse4.2")) {
process_sse42(data, len);
} else {
process_scalar(data, len);
}
// 检测特定特性:
__builtin_cpu_supports("sse2")
__builtin_cpu_supports("sse4.1")
__builtin_cpu_supports("sse4.2")
__builtin_cpu_supports("avx")
__builtin_cpu_supports("avx2")
__builtin_cpu_supports("avx512f")
__builtin_cpu_supports("bmi")
__builtin_cpu_supports("bmi2")
__builtin_cpu_supports("fma")c
// 可移植方案:直接使用CPUID指令
#include <cpuid.h>
static int has_avx2(void) {
unsigned int eax, ebx, ecx, edx;
// CPUID leaf 7, subleaf 0
__cpuid_count(7, 0, eax, ebx, ecx, edx);
return (ebx >> 5) & 1; // 第5位对应AVX2支持
}3. SSE2 / SSE4.2 intrinsics (x86)
3. SSE2 / SSE4.2 Intrinsics(x86平台)
c
#include <immintrin.h> // All x86 intrinsics
// SSE2: 128-bit vectors
// __m128 = 4 floats
// __m128d = 2 doubles
// __m128i = integers (8x16, 4x32, 2x64, 16x8)
void sum_floats_sse2(float *dst, const float *a, const float *b, int n) {
int i = 0;
for (; i <= n - 4; i += 4) {
__m128 va = _mm_loadu_ps(a + i); // unaligned load
__m128 vb = _mm_loadu_ps(b + i);
__m128 vc = _mm_add_ps(va, vb);
_mm_storeu_ps(dst + i, vc); // unaligned store
}
// Handle remainder
for (; i < n; i++) dst[i] = a[i] + b[i];
}c
#include <immintrin.h> // 包含所有x86平台Intrinsics
// SSE2: 128位向量
// __m128 = 4个float类型值
// __m128d = 2个double类型值
// __m128i = 整数类型(8×16位、4×32位、2×64位、16×8位)
void sum_floats_sse2(float *dst, const float *a, const float *b, int n) {
int i = 0;
for (; i <= n - 4; i += 4) {
__m128 va = _mm_loadu_ps(a + i); // 非对齐加载
__m128 vb = _mm_loadu_ps(b + i);
__m128 vc = _mm_add_ps(va, vb);
_mm_storeu_ps(dst + i, vc); // 非对齐存储
}
// 处理剩余元素
for (; i < n; i++) dst[i] = a[i] + b[i];
}4. AVX2 intrinsics (x86)
4. AVX2 Intrinsics(x86平台)
c
#ifdef __AVX2__
#include <immintrin.h>
// __m256 = 8 floats, __m256d = 4 doubles, __m256i = integers
void sum_floats_avx2(float *dst, const float *a, const float *b, int n) {
int i = 0;
for (; i <= n - 8; i += 8) {
__m256 va = _mm256_loadu_ps(a + i);
__m256 vb = _mm256_loadu_ps(b + i);
__m256 vc = _mm256_add_ps(va, vb);
_mm256_storeu_ps(dst + i, vc);
}
// SSE2 tail (4 elements)
for (; i <= n - 4; i += 4) {
__m128 va = _mm_loadu_ps(a + i);
__m128 vb = _mm_loadu_ps(b + i);
_mm_storeu_ps(dst + i, _mm_add_ps(va, vb));
}
// Scalar tail
for (; i < n; i++) dst[i] = a[i] + b[i];
}
// Fused multiply-add (FMA) — 1 instruction for a*b+c
void fma_avx2(float *dst, const float *a, const float *b, const float *c, int n) {
for (int i = 0; i <= n - 8; i += 8) {
__m256 va = _mm256_loadu_ps(a + i);
__m256 vb = _mm256_loadu_ps(b + i);
__m256 vc = _mm256_loadu_ps(c + i);
_mm256_storeu_ps(dst + i, _mm256_fmadd_ps(va, vb, vc)); // dst = a*b + c
}
}
#endifCompile with:
gcc -O2 -mavx2 -mfma src/simd.cc
#ifdef __AVX2__
#include <immintrin.h>
// __m256 = 8个float类型值, __m256d = 4个double类型值, __m256i = 整数类型
void sum_floats_avx2(float *dst, const float *a, const float *b, int n) {
int i = 0;
for (; i <= n - 8; i += 8) {
__m256 va = _mm256_loadu_ps(a + i);
__m256 vb = _mm256_loadu_ps(b + i);
__m256 vc = _mm256_add_ps(va, vb);
_mm256_storeu_ps(dst + i, vc);
}
// SSE2处理剩余4个元素
for (; i <= n - 4; i += 4) {
__m128 va = _mm_loadu_ps(a + i);
__m128 vb = _mm_loadu_ps(b + i);
_mm_storeu_ps(dst + i, _mm_add_ps(va, vb));
}
// 标量处理剩余元素
for (; i < n; i++) dst[i] = a[i] + b[i];
}
// 融合乘加(FMA)—— 一条指令完成a*b+c操作
void fma_avx2(float *dst, const float *a, const float *b, const float *c, int n) {
for (int i = 0; i <= n - 8; i += 8) {
__m256 va = _mm256_loadu_ps(a + i);
__m256 vb = _mm256_loadu_ps(b + i);
__m256 vc = _mm256_loadu_ps(c + i);
_mm256_storeu_ps(dst + i, _mm256_fmadd_ps(va, vb, vc)); // dst = a*b + c
}
}
#endif编译命令:
gcc -O2 -mavx2 -mfma src/simd.c5. NEON intrinsics (ARM/AArch64)
5. NEON Intrinsics(ARM/AArch64平台)
c
#include <arm_neon.h>
// float32x4_t = 4 floats (128-bit)
// float32x8_t = 8 floats (ARM SVE — scalable)
// uint8x16_t = 16 bytes
// int32x4_t = 4 int32
void sum_floats_neon(float *dst, const float *a, const float *b, int n) {
int i = 0;
for (; i <= n - 4; i += 4) {
float32x4_t va = vld1q_f32(a + i); // load 4 floats
float32x4_t vb = vld1q_f32(b + i);
float32x4_t vc = vaddq_f32(va, vb); // add
vst1q_f32(dst + i, vc); // store 4 floats
}
for (; i < n; i++) dst[i] = a[i] + b[i];
}
// AArch64 FMA
void fma_neon(float *dst, const float *a, const float *b, const float *c, int n) {
for (int i = 0; i <= n - 4; i += 4) {
float32x4_t va = vld1q_f32(a + i);
float32x4_t vb = vld1q_f32(b + i);
float32x4_t vc = vld1q_f32(c + i);
vst1q_f32(dst + i, vfmaq_f32(vc, va, vb)); // vc + va*vb
}
}Compile with:
gcc -O2 -march=armv8-a+simd src/simd.cc
#include <arm_neon.h>
// float32x4_t = 4个float类型值(128位)
// float32x8_t = 8个float类型值(ARM SVE —— 可伸缩向量)
// uint8x16_t = 16个字节
// int32x4_t = 4个int32类型值
void sum_floats_neon(float *dst, const float *a, const float *b, int n) {
int i = 0;
for (; i <= n - 4; i += 4) {
float32x4_t va = vld1q_f32(a + i); // 加载4个float值
float32x4_t vb = vld1q_f32(b + i);
float32x4_t vc = vaddq_f32(va, vb); // 向量加法
vst1q_f32(dst + i, vc); // 存储4个float值
}
for (; i < n; i++) dst[i] = a[i] + b[i];
}
// AArch64平台FMA操作
void fma_neon(float *dst, const float *a, const float *b, const float *c, int n) {
for (int i = 0; i <= n - 4; i += 4) {
float32x4_t va = vld1q_f32(a + i);
float32x4_t vb = vld1q_f32(b + i);
float32x4_t vc = vld1q_f32(c + i);
vst1q_f32(dst + i, vfmaq_f32(vc, va, vb)); // vc + va*vb
}
}编译命令:
gcc -O2 -march=armv8-a+simd src/simd.c6. Choose auto-vectorization vs intrinsics
6. 选择自动向量化还是手动编写Intrinsics
text
Can the compiler auto-vectorize?
→ Try first: add __restrict__, remove complex control flow, align data
→ Check with -fopt-info-vec or -Rpass=loop-vectorize
→ If vectorized: verify correctness and performance
Still need intrinsics?
→ Prefer compiler builtins: __builtin_popcount, __builtin_ctz
→ Use SIMD intrinsics for: hand-tuned shuffles, gather/scatter, horizontal ops
→ Avoid intrinsics for: simple element-wise ops (let compiler do it)text
编译器能否自动向量化?
→ 优先尝试:添加__restrict__关键字、移除复杂控制流、对齐数据
→ 使用-fopt-info-vec或-Rpass=loop-vectorize命令检查向量化状态
→ 若已向量化:验证代码正确性和性能
仍需要手动编写Intrinsics?
→ 优先使用编译器内置函数:__builtin_popcount、__builtin_ctz
→ 以下场景使用SIMD Intrinsics:手动调优的混洗操作、聚集/分散指令、水平运算
→ 以下场景避免使用Intrinsics:简单的逐元素运算(交给编译器处理)7. Alignment and performance
7. 内存对齐与性能优化
c
// Aligned allocation (required for _mm256_load_ps, optional for _mm256_loadu_ps)
float *buf = (float *)aligned_alloc(32, n * sizeof(float));
// 32-byte alignment for AVX2, 64 for AVX-512
// Hint alignment to compiler
float *__attribute__((aligned(32))) buf = ...;
// Use aligned loads when data is aligned (faster)
__m256 v = _mm256_load_ps(aligned_ptr); // requires 32-byte alignment
__m256 v = _mm256_loadu_ps(unaligned_ptr); // any alignment, slightly slower on old CPUsFor Intel Intrinsics Guide reference and NEON lookup tables, see references/intel-intrinsics-guide.md.
c
// 对齐分配(_mm256_load_ps需要对齐,_mm256_loadu_ps可选)
float *buf = (float *)aligned_alloc(32, n * sizeof(float));
// AVX2需要32字节对齐,AVX-512需要64字节对齐
// 向编译器提示内存对齐
float *__attribute__((aligned(32))) buf = ...;
// 当数据对齐时使用对齐加载(速度更快)
__m256 v = _mm256_load_ps(aligned_ptr); // 需要32字节对齐
__m256 v = _mm256_loadu_ps(unaligned_ptr); // 支持任意对齐,在旧CPU上速度略慢关于Intel Intrinsics参考手册和NEON查找表,请查看 references/intel-intrinsics-guide.md。
Related skills
相关技能
- Use for
skills/compilers/gcc,-march,-msse4.2flags-mavx2 - Use for vectorization remarks and auto-vectorization control
skills/compilers/clang - Use to measure SIMD impact with perf stat counters
skills/profilers/linux-perf - Use for reading SIMD assembly output
skills/low-level-programming/assembly-x86
- 关于、
-march、-msse4.2等编译选项,使用-mavx2技能skills/compilers/gcc - 关于向量化备注信息和自动向量化控制,使用技能
skills/compilers/clang - 关于使用perf统计计数器测量SIMD性能影响,使用技能
skills/profilers/linux-perf - 关于阅读SIMD汇编输出,使用技能
skills/low-level-programming/assembly-x86