assembly-arm

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

ARM / AArch64 Assembly

ARM / AArch64 汇编

Purpose

用途

Guide agents through AArch64 (64-bit) and ARM (32-bit Thumb) assembly: registers, calling conventions, inline asm, and NEON/SVE SIMD patterns.

指导Agent掌握AArch64（64位）和ARM（32位Thumb）汇编：寄存器、调用约定、内联汇编以及NEON/SVE SIMD模式。

Triggers

触发场景

"How do I read ARM64 assembly output?"
"What are the AArch64 registers and calling convention?"
"How do I write inline asm for ARM?"
"What is the difference between AArch64 and ARM Thumb?"
"How do I use NEON intrinsics?"

"如何读取ARM64汇编输出？"
"AArch64寄存器和调用约定是什么？"
"如何为ARM编写内联汇编？"
"AArch64和ARM Thumb有什么区别？"
"如何使用NEON intrinsics？"

Workflow

工作流程

1. Generate ARM assembly

1. 生成ARM汇编

bash

undefined

bash

undefined

AArch64 (native or cross-compile)

AArch64（原生或交叉编译）

aarch64-linux-gnu-gcc -S -O2 foo.c -o foo.s

32-bit ARM Thumb

32位ARM Thumb

arm-linux-gnueabihf-gcc -S -O2 -mthumb foo.c -o foo.s

From objdump

从objdump生成

aarch64-linux-gnu-objdump -d -S prog

From GDB on target

在目标设备上通过GDB生成

(gdb) disassemble /s main

undefined

(gdb) disassemble /s main

undefined

2. AArch64 registers (AAPCS64)

2. AArch64寄存器（AAPCS64）

Register	Alias	Role
`x0` – `x7`	—	Arguments 1–8 and return values
`x8`	`xr`	Indirect result location (struct return)
`x9` – `x15`	—	Caller-saved temporaries
`x16` – `x17`	`ip0` , `ip1`	Intra-procedure-call temporaries (used by linker)
`x18`	`pr`	Platform register (reserved on some OS)
`x19` – `x28`	—	Callee-saved
`x29`	`fp`	Frame pointer (callee-saved)
`x30`	`lr`	Link register (return address)
`sp`	—	Stack pointer (must be 16-byte aligned at call)
`pc`	—	Program counter (not directly accessible)
`xzr`	`wzr`	Zero register (reads as 0, writes discarded)
`v0` – `v7`	`q0` – `q7`	FP/SIMD args and return
`v8` – `v15`	—	Callee-saved SIMD (lower 64 bits only)
`v16` – `v31`	—	Caller-saved temporaries

Width variants:

x0

(64-bit),

w0

(32-bit, zero-extends to 64),

h0

(16),

b0

(8).

寄存器	别名	作用
`x0` – `x7`	—	参数1-8及返回值
`x8`	`xr`	间接结果位置（结构体返回）
`x9` – `x15`	—	调用者保存的临时寄存器
`x16` – `x17`	`ip0` , `ip1`	过程内调用临时寄存器（链接器使用）
`x18`	`pr`	平台寄存器（部分系统中保留）
`x19` – `x28`	—	被调用者保存的寄存器
`x29`	`fp`	帧指针（被调用者保存）
`x30`	`lr`	链接寄存器（返回地址）
`sp`	—	栈指针（调用时必须16字节对齐）
`pc`	—	程序计数器（无法直接访问）
`xzr`	`wzr`	零寄存器（读取为0，写入会被丢弃）
`v0` – `v7`	`q0` – `q7`	浮点/SIMD参数及返回值
`v8` – `v15`	—	被调用者保存的SIMD寄存器（仅低64位）
`v16` – `v31`	—	调用者保存的临时SIMD寄存器

宽度变体：

x0

（64位）、

w0

（32位，零扩展至64位）、

h0

（16位）、

b0

（8位）。

3. AAPCS64 calling convention

3. AAPCS64调用约定

Integer/pointer args:

x0

–

x7

Float/SIMD args:

v0

–

v7

Return:

x0

(int),

x0

x1

(128-bit),

v0

(float/SIMD) Callee-saved:

x19

–

x28

x29

(fp),

x30

(lr),

v8

–

v15

(lower 64 bits) Caller-saved: everything else

Stack must be 16-byte aligned at any

bl

blr

instruction.

整数/指针参数：

x0

–

x7

浮点/SIMD参数：

v0

–

v7

返回值：

x0

（整数）、

x0

x1

（128位）、

v0

（浮点/SIMD） 被调用者保存的寄存器：

x19

–

x28

、

x29

（fp）、

x30

（lr）、

v8

–

v15

（低64位） 调用者保存的寄存器： 其余所有寄存器

在任何

bl

或

blr

指令处，栈必须保持16字节对齐。

4. Common AArch64 instructions

4. 常见AArch64指令

Instruction	Effect
`mov x0, x1`	Copy register
`mov x0, #42`	Load immediate
`movz x0, #0x1234, lsl #16`	Move zero-extended with shift
`movk x0, #0xabcd`	Move with keep (partial update)
`ldr x0, [x1]`	Load 64-bit from address in x1
`ldr x0, [x1, #8]`	Load from x1+8
`str x0, [x1, #8]`	Store x0 to x1+8
`ldp x0, x1, [sp, #16]`	Load pair (two regs at once)
`stp x29, x30, [sp, #-16]!`	Store pair, pre-decrement sp
`add x0, x1, x2`	x0 = x1 + x2
`add x0, x1, #8`	x0 = x1 + 8
`sub x0, x1, x2`	x0 = x1 - x2
`mul x0, x1, x2`	x0 = x1 * x2
`sdiv x0, x1, x2`	Signed divide
`udiv x0, x1, x2`	Unsigned divide
`cmp x0, x1`	Set flags for x0 - x1
`cbz x0, label`	Branch if x0 == 0
`cbnz x0, label`	Branch if x0 != 0
`bl func`	Branch with link (call)
`blr x0`	Branch with link to address in x0
`ret`	Return (branch to x30)
`ret x0`	Return to address in x0
`adrp x0, symbol`	PC-relative page address
`add x0, x0, :lo12:symbol`	Low 12 bits of symbol offset

指令	作用
`mov x0, x1`	复制寄存器
`mov x0, #42`	加载立即数
`movz x0, #0x1234, lsl #16`	带移位的零扩展移动
`movk x0, #0xabcd`	保留部分内容的移动（部分更新）
`ldr x0, [x1]`	从x1指向的地址加载64位数据
`ldr x0, [x1, #8]`	从x1+8的地址加载数据
`str x0, [x1, #8]`	将x0的值存储到x1+8的地址
`ldp x0, x1, [sp, #16]`	成对加载（同时加载两个寄存器）
`stp x29, x30, [sp, #-16]!`	成对存储，预递减栈指针
`add x0, x1, x2`	x0 = x1 + x2
`add x0, x1, #8`	x0 = x1 + 8
`sub x0, x1, x2`	x0 = x1 - x2
`mul x0, x1, x2`	x0 = x1 * x2
`sdiv x0, x1, x2`	有符号除法
`udiv x0, x1, x2`	无符号除法
`cmp x0, x1`	根据x0 - x1设置标志位
`cbz x0, label`	若x0 == 0则分支跳转
`cbnz x0, label`	若x0 != 0则分支跳转
`bl func`	带链接的分支（函数调用）
`blr x0`	跳转到x0指向的地址并保存返回链接
`ret`	返回（跳转到x30）
`ret x0`	跳转到x0指向的地址返回
`adrp x0, symbol`	PC相对页面地址加载
`add x0, x0, :lo12:symbol`	符号偏移的低12位

5. Typical function prologue/epilogue

5. 典型函数序言/尾声

asm

// Non-leaf function
stp  x29, x30, [sp, #-32]!   // save fp, lr; allocate 32 bytes
mov  x29, sp                  // set frame pointer
stp  x19, x20, [sp, #16]     // save callee-saved registers
// ... body ...
ldp  x19, x20, [sp, #16]     // restore
ldp  x29, x30, [sp], #32     // restore fp, lr; deallocate
ret

// Leaf function (no calls, no callee-saved regs needed)
// Can use red zone (no rsp adjustment) — but AArch64 has no red zone
sub  sp, sp, #16             // allocate locals
// ... body ...
add  sp, sp, #16
ret

asm

// 非叶子函数
stp  x29, x30, [sp, #-32]!   // 保存fp、lr；分配32字节栈空间
mov  x29, sp                  // 设置帧指针
stp  x19, x20, [sp, #16]     // 保存被调用者保存的寄存器
// ... 函数体 ...
ldp  x19, x20, [sp, #16]     // 恢复寄存器
ldp  x29, x30, [sp], #32     // 恢复fp、lr；释放栈空间
ret

// 叶子函数（无函数调用，无需保存被调用者寄存器）
// 可使用红区（无需调整rsp）——但AArch64无红区
sub  sp, sp, #16             // 分配局部变量空间
// ... 函数体 ...
add  sp, sp, #16
ret

6. Inline assembly (GCC/Clang)

6. 内联汇编（GCC/Clang）

// Barrier
__asm__ volatile ("dmb ish" ::: "memory");

// Load acquire
static inline int load_acquire(volatile int *p) {
    int val;
    __asm__ volatile ("ldar %w0, %1" : "=r"(val) : "Q"(*p));
    return val;
}

// Store release
static inline void store_release(volatile int *p, int val) {
    __asm__ volatile ("stlr %w1, %0" : "=Q"(*p) : "r"(val));
}

// Read system counter
static inline uint64_t read_cntvct(void) {
    uint64_t val;
    __asm__ volatile ("mrs %0, cntvct_el0" : "=r"(val));
    return val;
}

AArch64-specific constraints:

```
"Q"
```
— memory operand suitable for exclusive/acquire/release instructions
```
"r"
```
— any general-purpose register
```
"w"
```
— any FP/SIMD register

// 内存屏障
__asm__ volatile ("dmb ish" ::: "memory");

// 加载获取操作
static inline int load_acquire(volatile int *p) {
    int val;
    __asm__ volatile ("ldar %w0, %1" : "=r"(val) : "Q"(*p));
    return val;
}

// 存储释放操作
static inline void store_release(volatile int *p, int val) {
    __asm__ volatile ("stlr %w1, %0" : "=Q"(*p) : "r"(val));
}

// 读取系统计数器
static inline uint64_t read_cntvct(void) {
    uint64_t val;
    __asm__ volatile ("mrs %0, cntvct_el0" : "=r"(val));
    return val;
}

AArch64特定约束：

```
"Q"
```
— 适用于独占/获取/释放指令的内存操作数
```
"r"
```
— 任意通用寄存器
```
"w"
```
— 任意浮点/SIMD寄存器

7. NEON SIMD intrinsics

7. NEON SIMD 内在函数

#include <arm_neon.h>

// Add 4 floats at once
float32x4_t a = vld1q_f32(arr_a);   // load 4 floats
float32x4_t b = vld1q_f32(arr_b);
float32x4_t c = vaddq_f32(a, b);
vst1q_f32(result, c);

// Horizontal sum
float32x4_t sum = vpaddq_f32(c, c);
sum = vpaddq_f32(sum, sum);
float total = vgetq_lane_f32(sum, 0);

Naming convention:

v<op><q>_<type>

```
q
```
suffix: 128-bit (quad) vector
```
_f32
```
: float32,
```
_s32
```
: int32,
```
_u8
```
: uint8, etc.

For a register reference, see references/reference.md.

#include <arm_neon.h>

// 同时对4个浮点数求和
float32x4_t a = vld1q_f32(arr_a);   // 加载4个浮点数
float32x4_t b = vld1q_f32(arr_b);
float32x4_t c = vaddq_f32(a, b);
vst1q_f32(result, c);

// 水平求和
float32x4_t sum = vpaddq_f32(c, c);
sum = vpaddq_f32(sum, sum);
float total = vgetq_lane_f32(sum, 0);

命名约定：

v<op><q>_<type>

```
q
```
后缀：128位（四元）向量
```
_f32
```
：32位浮点数，
```
_s32
```
：32位有符号整数，
```
_u8
```
：8位无符号整数等

如需寄存器参考，请查看 references/reference.md。