assembly-arm

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ARM / AArch64 Assembly

ARM / AArch64 汇编

Purpose

用途

Guide agents through AArch64 (64-bit) and ARM (32-bit Thumb) assembly: registers, calling conventions, inline asm, and NEON/SVE SIMD patterns.
指导Agent掌握AArch64(64位)和ARM(32位Thumb)汇编:寄存器、调用约定、内联汇编以及NEON/SVE SIMD模式。

Triggers

触发场景

  • "How do I read ARM64 assembly output?"
  • "What are the AArch64 registers and calling convention?"
  • "How do I write inline asm for ARM?"
  • "What is the difference between AArch64 and ARM Thumb?"
  • "How do I use NEON intrinsics?"
  • "如何读取ARM64汇编输出?"
  • "AArch64寄存器和调用约定是什么?"
  • "如何为ARM编写内联汇编?"
  • "AArch64和ARM Thumb有什么区别?"
  • "如何使用NEON intrinsics?"

Workflow

工作流程

1. Generate ARM assembly

1. 生成ARM汇编

bash
undefined
bash
undefined

AArch64 (native or cross-compile)

AArch64(原生或交叉编译)

aarch64-linux-gnu-gcc -S -O2 foo.c -o foo.s
aarch64-linux-gnu-gcc -S -O2 foo.c -o foo.s

32-bit ARM Thumb

32位ARM Thumb

arm-linux-gnueabihf-gcc -S -O2 -mthumb foo.c -o foo.s
arm-linux-gnueabihf-gcc -S -O2 -mthumb foo.c -o foo.s

From objdump

从objdump生成

aarch64-linux-gnu-objdump -d -S prog
aarch64-linux-gnu-objdump -d -S prog

From GDB on target

在目标设备上通过GDB生成

(gdb) disassemble /s main
undefined
(gdb) disassemble /s main
undefined

2. AArch64 registers (AAPCS64)

2. AArch64寄存器(AAPCS64)

RegisterAliasRole
x0
x7
Arguments 1–8 and return values
x8
xr
Indirect result location (struct return)
x9
x15
Caller-saved temporaries
x16
x17
ip0
,
ip1
Intra-procedure-call temporaries (used by linker)
x18
pr
Platform register (reserved on some OS)
x19
x28
Callee-saved
x29
fp
Frame pointer (callee-saved)
x30
lr
Link register (return address)
sp
Stack pointer (must be 16-byte aligned at call)
pc
Program counter (not directly accessible)
xzr
wzr
Zero register (reads as 0, writes discarded)
v0
v7
q0
q7
FP/SIMD args and return
v8
v15
Callee-saved SIMD (lower 64 bits only)
v16
v31
Caller-saved temporaries
Width variants:
x0
(64-bit),
w0
(32-bit, zero-extends to 64),
h0
(16),
b0
(8).
寄存器别名作用
x0
x7
参数1-8及返回值
x8
xr
间接结果位置(结构体返回)
x9
x15
调用者保存的临时寄存器
x16
x17
ip0
,
ip1
过程内调用临时寄存器(链接器使用)
x18
pr
平台寄存器(部分系统中保留)
x19
x28
被调用者保存的寄存器
x29
fp
帧指针(被调用者保存)
x30
lr
链接寄存器(返回地址)
sp
栈指针(调用时必须16字节对齐)
pc
程序计数器(无法直接访问)
xzr
wzr
零寄存器(读取为0,写入会被丢弃)
v0
v7
q0
q7
浮点/SIMD参数及返回值
v8
v15
被调用者保存的SIMD寄存器(仅低64位)
v16
v31
调用者保存的临时SIMD寄存器
宽度变体:
x0
(64位)、
w0
(32位,零扩展至64位)、
h0
(16位)、
b0
(8位)。

3. AAPCS64 calling convention

3. AAPCS64调用约定

Integer/pointer args:
x0
x7
Float/SIMD args:
v0
v7
Return:
x0
(int),
x0
+
x1
(128-bit),
v0
(float/SIMD) Callee-saved:
x19
x28
,
x29
(fp),
x30
(lr),
v8
v15
(lower 64 bits) Caller-saved: everything else
Stack must be 16-byte aligned at any
bl
or
blr
instruction.
整数/指针参数:
x0
x7
浮点/SIMD参数:
v0
v7
返回值:
x0
(整数)、
x0
+
x1
(128位)、
v0
(浮点/SIMD) 被调用者保存的寄存器:
x19
x28
x29
(fp)、
x30
(lr)、
v8
v15
(低64位) 调用者保存的寄存器: 其余所有寄存器
在任何
bl
blr
指令处,栈必须保持16字节对齐。

4. Common AArch64 instructions

4. 常见AArch64指令

InstructionEffect
mov x0, x1
Copy register
mov x0, #42
Load immediate
movz x0, #0x1234, lsl #16
Move zero-extended with shift
movk x0, #0xabcd
Move with keep (partial update)
ldr x0, [x1]
Load 64-bit from address in x1
ldr x0, [x1, #8]
Load from x1+8
str x0, [x1, #8]
Store x0 to x1+8
ldp x0, x1, [sp, #16]
Load pair (two regs at once)
stp x29, x30, [sp, #-16]!
Store pair, pre-decrement sp
add x0, x1, x2
x0 = x1 + x2
add x0, x1, #8
x0 = x1 + 8
sub x0, x1, x2
x0 = x1 - x2
mul x0, x1, x2
x0 = x1 * x2
sdiv x0, x1, x2
Signed divide
udiv x0, x1, x2
Unsigned divide
cmp x0, x1
Set flags for x0 - x1
cbz x0, label
Branch if x0 == 0
cbnz x0, label
Branch if x0 != 0
bl func
Branch with link (call)
blr x0
Branch with link to address in x0
ret
Return (branch to x30)
ret x0
Return to address in x0
adrp x0, symbol
PC-relative page address
add x0, x0, :lo12:symbol
Low 12 bits of symbol offset
指令作用
mov x0, x1
复制寄存器
mov x0, #42
加载立即数
movz x0, #0x1234, lsl #16
带移位的零扩展移动
movk x0, #0xabcd
保留部分内容的移动(部分更新)
ldr x0, [x1]
从x1指向的地址加载64位数据
ldr x0, [x1, #8]
从x1+8的地址加载数据
str x0, [x1, #8]
将x0的值存储到x1+8的地址
ldp x0, x1, [sp, #16]
成对加载(同时加载两个寄存器)
stp x29, x30, [sp, #-16]!
成对存储,预递减栈指针
add x0, x1, x2
x0 = x1 + x2
add x0, x1, #8
x0 = x1 + 8
sub x0, x1, x2
x0 = x1 - x2
mul x0, x1, x2
x0 = x1 * x2
sdiv x0, x1, x2
有符号除法
udiv x0, x1, x2
无符号除法
cmp x0, x1
根据x0 - x1设置标志位
cbz x0, label
若x0 == 0则分支跳转
cbnz x0, label
若x0 != 0则分支跳转
bl func
带链接的分支(函数调用)
blr x0
跳转到x0指向的地址并保存返回链接
ret
返回(跳转到x30)
ret x0
跳转到x0指向的地址返回
adrp x0, symbol
PC相对页面地址加载
add x0, x0, :lo12:symbol
符号偏移的低12位

5. Typical function prologue/epilogue

5. 典型函数序言/尾声

asm
// Non-leaf function
stp  x29, x30, [sp, #-32]!   // save fp, lr; allocate 32 bytes
mov  x29, sp                  // set frame pointer
stp  x19, x20, [sp, #16]     // save callee-saved registers
// ... body ...
ldp  x19, x20, [sp, #16]     // restore
ldp  x29, x30, [sp], #32     // restore fp, lr; deallocate
ret

// Leaf function (no calls, no callee-saved regs needed)
// Can use red zone (no rsp adjustment) — but AArch64 has no red zone
sub  sp, sp, #16             // allocate locals
// ... body ...
add  sp, sp, #16
ret
asm
// 非叶子函数
stp  x29, x30, [sp, #-32]!   // 保存fp、lr;分配32字节栈空间
mov  x29, sp                  // 设置帧指针
stp  x19, x20, [sp, #16]     // 保存被调用者保存的寄存器
// ... 函数体 ...
ldp  x19, x20, [sp, #16]     // 恢复寄存器
ldp  x29, x30, [sp], #32     // 恢复fp、lr;释放栈空间
ret

// 叶子函数(无函数调用,无需保存被调用者寄存器)
// 可使用红区(无需调整rsp)——但AArch64无红区
sub  sp, sp, #16             // 分配局部变量空间
// ... 函数体 ...
add  sp, sp, #16
ret

6. Inline assembly (GCC/Clang)

6. 内联汇编(GCC/Clang)

c
// Barrier
__asm__ volatile ("dmb ish" ::: "memory");

// Load acquire
static inline int load_acquire(volatile int *p) {
    int val;
    __asm__ volatile ("ldar %w0, %1" : "=r"(val) : "Q"(*p));
    return val;
}

// Store release
static inline void store_release(volatile int *p, int val) {
    __asm__ volatile ("stlr %w1, %0" : "=Q"(*p) : "r"(val));
}

// Read system counter
static inline uint64_t read_cntvct(void) {
    uint64_t val;
    __asm__ volatile ("mrs %0, cntvct_el0" : "=r"(val));
    return val;
}
AArch64-specific constraints:
  • "Q"
    — memory operand suitable for exclusive/acquire/release instructions
  • "r"
    — any general-purpose register
  • "w"
    — any FP/SIMD register
c
// 内存屏障
__asm__ volatile ("dmb ish" ::: "memory");

// 加载获取操作
static inline int load_acquire(volatile int *p) {
    int val;
    __asm__ volatile ("ldar %w0, %1" : "=r"(val) : "Q"(*p));
    return val;
}

// 存储释放操作
static inline void store_release(volatile int *p, int val) {
    __asm__ volatile ("stlr %w1, %0" : "=Q"(*p) : "r"(val));
}

// 读取系统计数器
static inline uint64_t read_cntvct(void) {
    uint64_t val;
    __asm__ volatile ("mrs %0, cntvct_el0" : "=r"(val));
    return val;
}
AArch64特定约束:
  • "Q"
    — 适用于独占/获取/释放指令的内存操作数
  • "r"
    — 任意通用寄存器
  • "w"
    — 任意浮点/SIMD寄存器

7. NEON SIMD intrinsics

7. NEON SIMD 内在函数

c
#include <arm_neon.h>

// Add 4 floats at once
float32x4_t a = vld1q_f32(arr_a);   // load 4 floats
float32x4_t b = vld1q_f32(arr_b);
float32x4_t c = vaddq_f32(a, b);
vst1q_f32(result, c);

// Horizontal sum
float32x4_t sum = vpaddq_f32(c, c);
sum = vpaddq_f32(sum, sum);
float total = vgetq_lane_f32(sum, 0);
Naming convention:
v<op><q>_<type>
  • q
    suffix: 128-bit (quad) vector
  • _f32
    : float32,
    _s32
    : int32,
    _u8
    : uint8, etc.
For a register reference, see references/reference.md.
c
#include <arm_neon.h>

// 同时对4个浮点数求和
float32x4_t a = vld1q_f32(arr_a);   // 加载4个浮点数
float32x4_t b = vld1q_f32(arr_b);
float32x4_t c = vaddq_f32(a, b);
vst1q_f32(result, c);

// 水平求和
float32x4_t sum = vpaddq_f32(c, c);
sum = vpaddq_f32(sum, sum);
float total = vgetq_lane_f32(sum, 0);
命名约定:
v<op><q>_<type>
  • q
    后缀:128位(四元)向量
  • _f32
    :32位浮点数,
    _s32
    :32位有符号整数,
    _u8
    :8位无符号整数等
如需寄存器参考,请查看 references/reference.md

Related skills

相关技能

  • Use
    skills/low-level-programming/assembly-x86
    for x86-64 assembly
  • Use
    skills/compilers/cross-gcc
    for cross-compilation toolchain
  • Use
    skills/debuggers/gdb
    for debugging ARM code with gdbserver
  • 若需x86-64汇编技能,请使用
    skills/low-level-programming/assembly-x86
  • 若需交叉编译工具链技能,请使用
    skills/compilers/cross-gcc
  • 若需使用gdbserver调试ARM代码,请使用
    skills/debuggers/gdb