interpreters
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseInterpreters and Bytecode VMs
解释器与字节码VM
Purpose
用途
Guide agents through implementing efficient bytecode interpreters and simple JITs in C/C++: dispatch strategies, VM architecture choices, and performance patterns.
指导开发者使用C/C++实现高效的字节码解释器与简单JIT:涵盖分发策略、VM架构选择以及性能优化模式。
Triggers
触发场景
- "How do I implement a fast bytecode dispatch loop?"
- "What is the difference between switch dispatch and computed goto?"
- "How do I implement a register-based vs stack-based VM?"
- "How do I add basic JIT compilation to my interpreter?"
- "Why is my interpreter slow?"
- "如何实现快速的字节码分发循环?"
- "switch分发与计算式goto有什么区别?"
- "如何实现基于寄存器与基于栈的VM?"
- "如何为我的解释器添加基础JIT编译功能?"
- "为什么我的解释器运行缓慢?"
Workflow
工作流程
1. VM architecture choice
1. VM架构选择
| Style | Description | Examples |
|---|---|---|
| Stack-based | Operands on a value stack; compact bytecode | JVM, CPython, WebAssembly |
| Register-based | Operands in virtual registers; fewer instructions | Lua 5+, Dalvik |
| Direct threading | Each "instruction" is a function call | Some Forth implementations |
| Continuation-passing | Interpreter functions return continuations | Academic |
Stack-based: easier to implement, compile to; code generation is simpler. More instructions per expression.
Register-based: fewer dispatch iterations; needs register allocation in the compiler; better cache behaviour for complex expressions.
| 架构类型 | 说明 | 示例 |
|---|---|---|
| 基于栈 | 操作数存储在值栈中;字节码紧凑 | JVM, CPython, WebAssembly |
| 基于寄存器 | 操作数存储在虚拟寄存器中;指令数量更少 | Lua 5+, Dalvik |
| 直接线程化 | 每个"指令"都是一次函数调用 | Some Forth implementations |
| 续体传递 | 解释器函数返回续体 | Academic |
基于栈: 更易于实现与编译;代码生成更简单。每个表达式需要更多指令。
基于寄存器: 分发迭代次数更少;编译器中需要寄存器分配;复杂表达式的缓存表现更优。
2. Dispatch loop strategies
2. 分发循环策略
Switch dispatch (simplest, baseline)
Switch分发(最简单的基准方案)
c
while (1) {
uint8_t op = *ip++;
switch (op) {
case OP_LOAD: push(constants[*ip++]); break;
case OP_ADD: { Value b = pop(); Value a = pop(); push(a + b); } break;
case OP_HALT: return;
// ...
}
}Problem: compiles to a single indirect branch from a jump table. Modern CPUs can mispredict it heavily because the same indirect branch is used for all opcodes.
switchc
while (1) {
uint8_t op = *ip++;
switch (op) {
case OP_LOAD: push(constants[*ip++]); break;
case OP_ADD: { Value b = pop(); Value a = pop(); push(a + b); } break;
case OP_HALT: return;
// ...
}
}问题:会被编译为跳转表中的单次间接分支。现代CPU很容易出现严重的预测错误,因为所有操作码都使用同一个间接分支。
switchComputed goto (GCC/Clang extension — fastest portable approach)
计算式goto(GCC/Clang扩展——最快的可移植方案)
c
// Table of label addresses
static const void *dispatch_table[] = {
[OP_LOAD] = &&op_load,
[OP_ADD] = &&op_add,
[OP_HALT] = &&op_halt,
// ...
};
#define DISPATCH() goto *dispatch_table[*ip++]
DISPATCH(); // start
op_load:
push(constants[*ip++]);
DISPATCH();
op_add: {
Value b = pop(); Value a = pop(); push(a + b);
DISPATCH();
}
op_halt:
return;Each opcode ends with its own indirect branch. The CPU can train the branch predictor per-opcode, dramatically improving prediction rates.
Note: is a GCC/Clang extension, not standard C. Use to guard and fall back to switch for other compilers.
&&label#ifdef __GNUC__c
// Table of label addresses
static const void *dispatch_table[] = {
[OP_LOAD] = &&op_load,
[OP_ADD] = &&op_add,
[OP_HALT] = &&op_halt,
// ...
};
#define DISPATCH() goto *dispatch_table[*ip++]
DISPATCH(); // start
op_load:
push(constants[*ip++]);
DISPATCH();
op_add: {
Value b = pop(); Value a = pop(); push(a + b);
DISPATCH();
}
op_halt:
return;每个操作码都以自身的间接分支结束。CPU可以针对每个操作码训练分支预测器,大幅提升预测准确率。
注意: 是GCC/Clang扩展,不属于标准C。使用进行条件编译,在其他编译器下回退到switch方案。
&&label#ifdef __GNUC__Direct threaded code (most aggressive)
直接线程化代码(最激进的方案)
Each bytecode word is a function pointer or label address; the VM is the fetch-decode-execute loop itself.
c
typedef void (*Handler)(VM *vm);
// Bytecode is an array of handlers
Handler bytecode[] = { op_load, op_push_1, op_add, op_halt };
for (int i = 0; ; i++) {
bytecode[i](vm);
}每个字节码字都是函数指针或标签地址;VM本身就是取指-译码-执行循环。
c
typedef void (*Handler)(VM *vm);
// Bytecode is an array of handlers
Handler bytecode[] = { op_load, op_push_1, op_add, op_halt };
for (int i = 0; ; i++) {
bytecode[i](vm);
}3. Value representation
3. 值表示方式
Tagged pointer: Store type tag in low bits of pointer (pointer alignment guarantees ≥ 2 bits free):
c
typedef uintptr_t Value;
#define TAG_INT 0x0
#define TAG_FLOAT 0x1
#define TAG_PTR 0x2
#define TAG_MASK 0x3
#define INT_VAL(v) ((int64_t)(v) >> 2)
#define FLOAT_VAL(v) (*(float*)((v) & ~TAG_MASK))
#define IS_INT(v) (((v) & TAG_MASK) == TAG_INT)NaN boxing (64-bit): Store non-double values in NaN bit patterns:
c
// IEEE 754 quiet NaN: exponent all 1s, mantissa != 0
// Use high mantissa bits as type tag, low 48 bits as payload
// Allows pointer/int/bool/nil to fit in a double-sized slotUsed by V8 (formerly), LuaJIT, JavaScriptCore.
标记指针: 将类型标记存储在指针的低位(指针对齐保证至少有2位空闲):
c
typedef uintptr_t Value;
#define TAG_INT 0x0
#define TAG_FLOAT 0x1
#define TAG_PTR 0x2
#define TAG_MASK 0x3
#define INT_VAL(v) ((int64_t)(v) >> 2)
#define FLOAT_VAL(v) (*(float*)((v) & ~TAG_MASK))
#define IS_INT(v) (((v) & TAG_MASK) == TAG_INT)NaN装箱(64位): 将非双精度值存储在NaN位模式中:
c
// IEEE 754 quiet NaN: exponent all 1s, mantissa != 0
// Use high mantissa bits as type tag, low 48 bits as payload
// Allows pointer/int/bool/nil to fit in a double-sized slot被V8(旧版本)、LuaJIT、JavaScriptCore采用。
4. Stack management
4. 栈管理
c
#define STACK_SIZE 4096
Value stack[STACK_SIZE];
Value *sp = stack; // stack pointer
#define PUSH(v) (*sp++ = (v))
#define POP() (*--sp)
#define TOP() (sp[-1])
#define PEEK(n) (sp[-(n)-1])
// Check for overflow
#define PUSH_SAFE(v) do { \
if (sp >= stack + STACK_SIZE) { vm_error("stack overflow"); } \
PUSH(v); \
} while(0)c
#define STACK_SIZE 4096
Value stack[STACK_SIZE];
Value *sp = stack; // stack pointer
#define PUSH(v) (*sp++ = (v))
#define POP() (*--sp)
#define TOP() (sp[-1])
#define PEEK(n) (sp[-(n)-1])
// Check for overflow
#define PUSH_SAFE(v) do { \
if (sp >= stack + STACK_SIZE) { vm_error("stack overflow"); } \
PUSH(v); \
} while(0)5. Inline caching (IC)
5. 内联缓存(IC)
Inline caching speeds up property lookups and method dispatch by caching the last observed type at each call site.
c
struct CallSite {
Type cached_type; // Last observed receiver type
void *cached_method; // Cached function pointer
int miss_count; // Number of misses
};
void invoke_method(VM *vm, CallSite *cs, Value receiver, ...) {
Type t = GET_TYPE(receiver);
if (t == cs->cached_type) {
// Cache hit: direct call, no lookup
cs->cached_method(vm, receiver, ...);
} else {
// Cache miss: look up, update cache
void *method = lookup_method(t, name);
cs->cached_type = t;
cs->cached_method = method;
cs->miss_count++;
method(vm, receiver, ...);
}
}Polymorphic IC (PIC): cache up to N (typ. 4) type-method pairs.
内联缓存通过在每个调用点缓存最近观察到的类型,来加速属性查找与方法分发。
c
struct CallSite {
Type cached_type; // Last observed receiver type
void *cached_method; // Cached function pointer
int miss_count; // Number of misses
};
void invoke_method(VM *vm, CallSite *cs, Value receiver, ...) {
Type t = GET_TYPE(receiver);
if (t == cs->cached_type) {
// Cache hit: direct call, no lookup
cs->cached_method(vm, receiver, ...);
} else {
// Cache miss: look up, update cache
void *method = lookup_method(t, name);
cs->cached_type = t;
cs->cached_method = method;
cs->miss_count++;
method(vm, receiver, ...);
}
}多态IC(PIC):最多缓存N个(通常为4个)类型-方法对。
6. Simple JIT (mmap + machine code)
6. 简单JIT(mmap + 机器码)
For x86-64: allocate executable memory, write machine code bytes, call it.
c
#include <sys/mman.h>
#include <string.h>
typedef int (*JitFn)(int a, int b);
JitFn jit_compile_add(void) {
// x86-64: add rdi, rsi; mov rax, rdi; ret
static const uint8_t code[] = {
0x48, 0x01, 0xF7, // add rdi, rsi
0x48, 0x89, 0xF8, // mov rax, rdi
0xC3 // ret
};
void *mem = mmap(NULL, sizeof(code),
PROT_READ | PROT_WRITE | PROT_EXEC,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mem == MAP_FAILED) return NULL;
memcpy(mem, code, sizeof(code));
// On some systems (macOS Apple Silicon): must use mprotect
// mprotect(mem, sizeof(code), PROT_READ | PROT_EXEC);
return (JitFn)mem;
}On macOS Apple Silicon (M-series): use or flag.
pthread_jit_write_protect_np()MAP_JIT针对x86-64:分配可执行内存,写入机器码字节,然后调用它。
c
#include <sys/mman.h>
#include <string.h>
typedef int (*JitFn)(int a, int b);
JitFn jit_compile_add(void) {
// x86-64: add rdi, rsi; mov rax, rdi; ret
static const uint8_t code[] = {
0x48, 0x01, 0xF7, // add rdi, rsi
0x48, 0x89, 0xF8, // mov rax, rdi
0xC3 // ret
};
void *mem = mmap(NULL, sizeof(code),
PROT_READ | PROT_WRITE | PROT_EXEC,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mem == MAP_FAILED) return NULL;
memcpy(mem, code, sizeof(code));
// On some systems (macOS Apple Silicon): must use mprotect
// mprotect(mem, sizeof(code), PROT_READ | PROT_EXEC);
return (JitFn)mem;
}在macOS Apple Silicon(M系列)上:使用或标志。
pthread_jit_write_protect_np()MAP_JIT7. Performance tips
7. 性能优化技巧
- Dispatch: Use computed goto over switch on GCC/Clang
- Values: Use NaN boxing or tagged pointers; avoid boxing/unboxing in hot paths
- Stack: Keep stack pointer in a callee-saved register (— GCC global register variable)
register Value *sp asm("r15") - Locals access: Keep frequently accessed locals in VM registers (struct fields), not stack
- Profiling: Use or sampling to find dispatch overhead vs actual work
perf - Specialisation: Generate specialised handler variants for common type combinations (int+int add vs generic add)
- Trace recording: Trace JITs (LuaJIT approach) compile hot traces instead of full functions
For a benchmark of dispatch strategies, see references/benchmarks.md.
- 分发: 在GCC/Clang上优先使用计算式goto而非switch
- 值表示: 使用NaN装箱或标记指针;避免在热点路径中进行装箱/拆箱操作
- 栈: 将栈指针保存在被调用者保存的寄存器中(— GCC全局寄存器变量)
register Value *sp asm("r15") - 本地变量访问: 将频繁访问的本地变量保存在VM寄存器(结构体字段)中,而非栈上
- 性能分析: 使用或采样工具找出分发开销与实际工作的占比
perf - 特化: 为常见类型组合生成专用的处理器变体(如int+int加法与通用加法)
- 追踪记录: 追踪式JIT(LuaJIT方案)编译热点追踪而非完整函数
关于分发策略的基准测试,参见references/benchmarks.md。
Related skills
相关技能
- Use to profile the interpreter dispatch loop
skills/profilers/linux-perf - Use to understand JIT output
skills/low-level-programming/assembly-x86 - Use to fuzz the bytecode parser/loader
skills/runtimes/fuzzing - Use for LLVM IR-based JIT (MCJIT / ORC JIT)
skills/compilers/llvm
- 使用分析解释器分发循环
skills/profilers/linux-perf - 使用理解JIT输出
skills/low-level-programming/assembly-x86 - 使用对字节码解析器/加载器进行模糊测试
skills/runtimes/fuzzing - 使用实现基于LLVM IR的JIT(MCJIT / ORC JIT)
skills/compilers/llvm