assembly-arm
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseARM / AArch64 Assembly
ARM / AArch64 汇编
Purpose
用途
Guide agents through AArch64 (64-bit) and ARM (32-bit Thumb) assembly: registers, calling conventions, inline asm, and NEON/SVE SIMD patterns.
指导Agent掌握AArch64(64位)和ARM(32位Thumb)汇编:寄存器、调用约定、内联汇编以及NEON/SVE SIMD模式。
Triggers
触发场景
- "How do I read ARM64 assembly output?"
- "What are the AArch64 registers and calling convention?"
- "How do I write inline asm for ARM?"
- "What is the difference between AArch64 and ARM Thumb?"
- "How do I use NEON intrinsics?"
- "如何读取ARM64汇编输出?"
- "AArch64寄存器和调用约定是什么?"
- "如何为ARM编写内联汇编?"
- "AArch64和ARM Thumb有什么区别?"
- "如何使用NEON intrinsics?"
Workflow
工作流程
1. Generate ARM assembly
1. 生成ARM汇编
bash
undefinedbash
undefinedAArch64 (native or cross-compile)
AArch64(原生或交叉编译)
aarch64-linux-gnu-gcc -S -O2 foo.c -o foo.s
aarch64-linux-gnu-gcc -S -O2 foo.c -o foo.s
32-bit ARM Thumb
32位ARM Thumb
arm-linux-gnueabihf-gcc -S -O2 -mthumb foo.c -o foo.s
arm-linux-gnueabihf-gcc -S -O2 -mthumb foo.c -o foo.s
From objdump
从objdump生成
aarch64-linux-gnu-objdump -d -S prog
aarch64-linux-gnu-objdump -d -S prog
From GDB on target
在目标设备上通过GDB生成
(gdb) disassemble /s main
undefined(gdb) disassemble /s main
undefined2. AArch64 registers (AAPCS64)
2. AArch64寄存器(AAPCS64)
| Register | Alias | Role |
|---|---|---|
| — | Arguments 1–8 and return values |
| | Indirect result location (struct return) |
| — | Caller-saved temporaries |
| | Intra-procedure-call temporaries (used by linker) |
| | Platform register (reserved on some OS) |
| — | Callee-saved |
| | Frame pointer (callee-saved) |
| | Link register (return address) |
| — | Stack pointer (must be 16-byte aligned at call) |
| — | Program counter (not directly accessible) |
| | Zero register (reads as 0, writes discarded) |
| | FP/SIMD args and return |
| — | Callee-saved SIMD (lower 64 bits only) |
| — | Caller-saved temporaries |
Width variants: (64-bit), (32-bit, zero-extends to 64), (16), (8).
x0w0h0b0| 寄存器 | 别名 | 作用 |
|---|---|---|
| — | 参数1-8及返回值 |
| | 间接结果位置(结构体返回) |
| — | 调用者保存的临时寄存器 |
| | 过程内调用临时寄存器(链接器使用) |
| | 平台寄存器(部分系统中保留) |
| — | 被调用者保存的寄存器 |
| | 帧指针(被调用者保存) |
| | 链接寄存器(返回地址) |
| — | 栈指针(调用时必须16字节对齐) |
| — | 程序计数器(无法直接访问) |
| | 零寄存器(读取为0,写入会被丢弃) |
| | 浮点/SIMD参数及返回值 |
| — | 被调用者保存的SIMD寄存器(仅低64位) |
| — | 调用者保存的临时SIMD寄存器 |
宽度变体:(64位)、(32位,零扩展至64位)、(16位)、(8位)。
x0w0h0b03. AAPCS64 calling convention
3. AAPCS64调用约定
Integer/pointer args: –
Float/SIMD args: –
Return: (int), + (128-bit), (float/SIMD)
Callee-saved: –, (fp), (lr), – (lower 64 bits)
Caller-saved: everything else
x0x7v0v7x0x0x1v0x19x28x29x30v8v15Stack must be 16-byte aligned at any or instruction.
blblr整数/指针参数: –
浮点/SIMD参数: –
返回值: (整数)、+(128位)、(浮点/SIMD)
被调用者保存的寄存器: –、(fp)、(lr)、–(低64位)
调用者保存的寄存器: 其余所有寄存器
x0x7v0v7x0x0x1v0x19x28x29x30v8v15在任何或指令处,栈必须保持16字节对齐。
blblr4. Common AArch64 instructions
4. 常见AArch64指令
| Instruction | Effect |
|---|---|
| Copy register |
| Load immediate |
| Move zero-extended with shift |
| Move with keep (partial update) |
| Load 64-bit from address in x1 |
| Load from x1+8 |
| Store x0 to x1+8 |
| Load pair (two regs at once) |
| Store pair, pre-decrement sp |
| x0 = x1 + x2 |
| x0 = x1 + 8 |
| x0 = x1 - x2 |
| x0 = x1 * x2 |
| Signed divide |
| Unsigned divide |
| Set flags for x0 - x1 |
| Branch if x0 == 0 |
| Branch if x0 != 0 |
| Branch with link (call) |
| Branch with link to address in x0 |
| Return (branch to x30) |
| Return to address in x0 |
| PC-relative page address |
| Low 12 bits of symbol offset |
| 指令 | 作用 |
|---|---|
| 复制寄存器 |
| 加载立即数 |
| 带移位的零扩展移动 |
| 保留部分内容的移动(部分更新) |
| 从x1指向的地址加载64位数据 |
| 从x1+8的地址加载数据 |
| 将x0的值存储到x1+8的地址 |
| 成对加载(同时加载两个寄存器) |
| 成对存储,预递减栈指针 |
| x0 = x1 + x2 |
| x0 = x1 + 8 |
| x0 = x1 - x2 |
| x0 = x1 * x2 |
| 有符号除法 |
| 无符号除法 |
| 根据x0 - x1设置标志位 |
| 若x0 == 0则分支跳转 |
| 若x0 != 0则分支跳转 |
| 带链接的分支(函数调用) |
| 跳转到x0指向的地址并保存返回链接 |
| 返回(跳转到x30) |
| 跳转到x0指向的地址返回 |
| PC相对页面地址加载 |
| 符号偏移的低12位 |
5. Typical function prologue/epilogue
5. 典型函数序言/尾声
asm
// Non-leaf function
stp x29, x30, [sp, #-32]! // save fp, lr; allocate 32 bytes
mov x29, sp // set frame pointer
stp x19, x20, [sp, #16] // save callee-saved registers
// ... body ...
ldp x19, x20, [sp, #16] // restore
ldp x29, x30, [sp], #32 // restore fp, lr; deallocate
ret
// Leaf function (no calls, no callee-saved regs needed)
// Can use red zone (no rsp adjustment) — but AArch64 has no red zone
sub sp, sp, #16 // allocate locals
// ... body ...
add sp, sp, #16
retasm
// 非叶子函数
stp x29, x30, [sp, #-32]! // 保存fp、lr;分配32字节栈空间
mov x29, sp // 设置帧指针
stp x19, x20, [sp, #16] // 保存被调用者保存的寄存器
// ... 函数体 ...
ldp x19, x20, [sp, #16] // 恢复寄存器
ldp x29, x30, [sp], #32 // 恢复fp、lr;释放栈空间
ret
// 叶子函数(无函数调用,无需保存被调用者寄存器)
// 可使用红区(无需调整rsp)——但AArch64无红区
sub sp, sp, #16 // 分配局部变量空间
// ... 函数体 ...
add sp, sp, #16
ret6. Inline assembly (GCC/Clang)
6. 内联汇编(GCC/Clang)
c
// Barrier
__asm__ volatile ("dmb ish" ::: "memory");
// Load acquire
static inline int load_acquire(volatile int *p) {
int val;
__asm__ volatile ("ldar %w0, %1" : "=r"(val) : "Q"(*p));
return val;
}
// Store release
static inline void store_release(volatile int *p, int val) {
__asm__ volatile ("stlr %w1, %0" : "=Q"(*p) : "r"(val));
}
// Read system counter
static inline uint64_t read_cntvct(void) {
uint64_t val;
__asm__ volatile ("mrs %0, cntvct_el0" : "=r"(val));
return val;
}AArch64-specific constraints:
- — memory operand suitable for exclusive/acquire/release instructions
"Q" - — any general-purpose register
"r" - — any FP/SIMD register
"w"
c
// 内存屏障
__asm__ volatile ("dmb ish" ::: "memory");
// 加载获取操作
static inline int load_acquire(volatile int *p) {
int val;
__asm__ volatile ("ldar %w0, %1" : "=r"(val) : "Q"(*p));
return val;
}
// 存储释放操作
static inline void store_release(volatile int *p, int val) {
__asm__ volatile ("stlr %w1, %0" : "=Q"(*p) : "r"(val));
}
// 读取系统计数器
static inline uint64_t read_cntvct(void) {
uint64_t val;
__asm__ volatile ("mrs %0, cntvct_el0" : "=r"(val));
return val;
}AArch64特定约束:
- — 适用于独占/获取/释放指令的内存操作数
"Q" - — 任意通用寄存器
"r" - — 任意浮点/SIMD寄存器
"w"
7. NEON SIMD intrinsics
7. NEON SIMD 内在函数
c
#include <arm_neon.h>
// Add 4 floats at once
float32x4_t a = vld1q_f32(arr_a); // load 4 floats
float32x4_t b = vld1q_f32(arr_b);
float32x4_t c = vaddq_f32(a, b);
vst1q_f32(result, c);
// Horizontal sum
float32x4_t sum = vpaddq_f32(c, c);
sum = vpaddq_f32(sum, sum);
float total = vgetq_lane_f32(sum, 0);Naming convention:
v<op><q>_<type>- suffix: 128-bit (quad) vector
q - : float32,
_f32: int32,_s32: uint8, etc._u8
For a register reference, see references/reference.md.
c
#include <arm_neon.h>
// 同时对4个浮点数求和
float32x4_t a = vld1q_f32(arr_a); // 加载4个浮点数
float32x4_t b = vld1q_f32(arr_b);
float32x4_t c = vaddq_f32(a, b);
vst1q_f32(result, c);
// 水平求和
float32x4_t sum = vpaddq_f32(c, c);
sum = vpaddq_f32(sum, sum);
float total = vgetq_lane_f32(sum, 0);命名约定:
v<op><q>_<type>- 后缀:128位(四元)向量
q - :32位浮点数,
_f32:32位有符号整数,_s32:8位无符号整数等_u8
如需寄存器参考,请查看 references/reference.md。
Related skills
相关技能
- Use for x86-64 assembly
skills/low-level-programming/assembly-x86 - Use for cross-compilation toolchain
skills/compilers/cross-gcc - Use for debugging ARM code with gdbserver
skills/debuggers/gdb
- 若需x86-64汇编技能,请使用
skills/low-level-programming/assembly-x86 - 若需交叉编译工具链技能,请使用
skills/compilers/cross-gcc - 若需使用gdbserver调试ARM代码,请使用
skills/debuggers/gdb