interpreters

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Interpreters and Bytecode VMs

解释器与字节码VM

Purpose

用途

Guide agents through implementing efficient bytecode interpreters and simple JITs in C/C++: dispatch strategies, VM architecture choices, and performance patterns.

指导开发者使用C/C++实现高效的字节码解释器与简单JIT：涵盖分发策略、VM架构选择以及性能优化模式。

Triggers

触发场景

"How do I implement a fast bytecode dispatch loop?"
"What is the difference between switch dispatch and computed goto?"
"How do I implement a register-based vs stack-based VM?"
"How do I add basic JIT compilation to my interpreter?"
"Why is my interpreter slow?"

"如何实现快速的字节码分发循环？"
"switch分发与计算式goto有什么区别？"
"如何实现基于寄存器与基于栈的VM？"
"如何为我的解释器添加基础JIT编译功能？"
"为什么我的解释器运行缓慢？"

Workflow

工作流程

1. VM architecture choice

1. VM架构选择

Style	Description	Examples
Stack-based	Operands on a value stack; compact bytecode	JVM, CPython, WebAssembly
Register-based	Operands in virtual registers; fewer instructions	Lua 5+, Dalvik
Direct threading	Each "instruction" is a function call	Some Forth implementations
Continuation-passing	Interpreter functions return continuations	Academic

Stack-based: easier to implement, compile to; code generation is simpler. More instructions per expression. Register-based: fewer dispatch iterations; needs register allocation in the compiler; better cache behaviour for complex expressions.

架构类型	说明	示例
基于栈	操作数存储在值栈中；字节码紧凑	JVM, CPython, WebAssembly
基于寄存器	操作数存储在虚拟寄存器中；指令数量更少	Lua 5+, Dalvik
直接线程化	每个"指令"都是一次函数调用	Some Forth implementations
续体传递	解释器函数返回续体	Academic

基于栈： 更易于实现与编译；代码生成更简单。每个表达式需要更多指令。 基于寄存器： 分发迭代次数更少；编译器中需要寄存器分配；复杂表达式的缓存表现更优。

2. Dispatch loop strategies

2. 分发循环策略

Switch dispatch (simplest, baseline)

Switch分发（最简单的基准方案）

while (1) {
    uint8_t op = *ip++;
    switch (op) {
        case OP_LOAD:  push(constants[*ip++]); break;
        case OP_ADD:   { Value b = pop(); Value a = pop(); push(a + b); } break;
        case OP_HALT:  return;
        // ...
    }
}

Problem:

switch

compiles to a single indirect branch from a jump table. Modern CPUs can mispredict it heavily because the same indirect branch is used for all opcodes.

while (1) {
    uint8_t op = *ip++;
    switch (op) {
        case OP_LOAD:  push(constants[*ip++]); break;
        case OP_ADD:   { Value b = pop(); Value a = pop(); push(a + b); } break;
        case OP_HALT:  return;
        // ...
    }
}

问题：

switch

会被编译为跳转表中的单次间接分支。现代CPU很容易出现严重的预测错误，因为所有操作码都使用同一个间接分支。

Computed goto (GCC/Clang extension — fastest portable approach)

计算式goto（GCC/Clang扩展——最快的可移植方案）

// Table of label addresses
static const void *dispatch_table[] = {
    [OP_LOAD]  = &&op_load,
    [OP_ADD]   = &&op_add,
    [OP_HALT]  = &&op_halt,
    // ...
};

#define DISPATCH() goto *dispatch_table[*ip++]

DISPATCH();  // start

op_load:
    push(constants[*ip++]);
    DISPATCH();

op_add: {
    Value b = pop(); Value a = pop(); push(a + b);
    DISPATCH();
}

op_halt:
    return;

Each opcode ends with its own indirect branch. The CPU can train the branch predictor per-opcode, dramatically improving prediction rates.

Note:

&&label

is a GCC/Clang extension, not standard C. Use

#ifdef __GNUC__

to guard and fall back to switch for other compilers.

// Table of label addresses
static const void *dispatch_table[] = {
    [OP_LOAD]  = &&op_load,
    [OP_ADD]   = &&op_add,
    [OP_HALT]  = &&op_halt,
    // ...
};

#define DISPATCH() goto *dispatch_table[*ip++]

DISPATCH();  // start

op_load:
    push(constants[*ip++]);
    DISPATCH();

op_add: {
    Value b = pop(); Value a = pop(); push(a + b);
    DISPATCH();
}

op_halt:
    return;

每个操作码都以自身的间接分支结束。CPU可以针对每个操作码训练分支预测器，大幅提升预测准确率。

注意：

&&label

是GCC/Clang扩展，不属于标准C。使用

#ifdef __GNUC__

进行条件编译，在其他编译器下回退到switch方案。

Direct threaded code (most aggressive)

直接线程化代码（最激进的方案）

Each bytecode word is a function pointer or label address; the VM is the fetch-decode-execute loop itself.

typedef void (*Handler)(VM *vm);

// Bytecode is an array of handlers
Handler bytecode[] = { op_load, op_push_1, op_add, op_halt };

for (int i = 0; ; i++) {
    bytecode[i](vm);
}

每个字节码字都是函数指针或标签地址；VM本身就是取指-译码-执行循环。

typedef void (*Handler)(VM *vm);

// Bytecode is an array of handlers
Handler bytecode[] = { op_load, op_push_1, op_add, op_halt };

for (int i = 0; ; i++) {
    bytecode[i](vm);
}

3. Value representation

3. 值表示方式

Tagged pointer: Store type tag in low bits of pointer (pointer alignment guarantees ≥ 2 bits free):

typedef uintptr_t Value;
#define TAG_INT    0x0
#define TAG_FLOAT  0x1
#define TAG_PTR    0x2
#define TAG_MASK   0x3

#define INT_VAL(v)   ((int64_t)(v) >> 2)
#define FLOAT_VAL(v) (*(float*)((v) & ~TAG_MASK))
#define IS_INT(v)    (((v) & TAG_MASK) == TAG_INT)

NaN boxing (64-bit): Store non-double values in NaN bit patterns:

// IEEE 754 quiet NaN: exponent all 1s, mantissa != 0
// Use high mantissa bits as type tag, low 48 bits as payload
// Allows pointer/int/bool/nil to fit in a double-sized slot

Used by V8 (formerly), LuaJIT, JavaScriptCore.

标记指针： 将类型标记存储在指针的低位（指针对齐保证至少有2位空闲）：

typedef uintptr_t Value;
#define TAG_INT    0x0
#define TAG_FLOAT  0x1
#define TAG_PTR    0x2
#define TAG_MASK   0x3

#define INT_VAL(v)   ((int64_t)(v) >> 2)
#define FLOAT_VAL(v) (*(float*)((v) & ~TAG_MASK))
#define IS_INT(v)    (((v) & TAG_MASK) == TAG_INT)

NaN装箱（64位）： 将非双精度值存储在NaN位模式中：

// IEEE 754 quiet NaN: exponent all 1s, mantissa != 0
// Use high mantissa bits as type tag, low 48 bits as payload
// Allows pointer/int/bool/nil to fit in a double-sized slot

被V8（旧版本）、LuaJIT、JavaScriptCore采用。

4. Stack management

4. 栈管理

#define STACK_SIZE 4096
Value stack[STACK_SIZE];
Value *sp = stack;  // stack pointer

#define PUSH(v) (*sp++ = (v))
#define POP()   (*--sp)
#define TOP()   (sp[-1])
#define PEEK(n) (sp[-(n)-1])

// Check for overflow
#define PUSH_SAFE(v) do { \
    if (sp >= stack + STACK_SIZE) { vm_error("stack overflow"); } \
    PUSH(v); \
} while(0)

#define STACK_SIZE 4096
Value stack[STACK_SIZE];
Value *sp = stack;  // stack pointer

#define PUSH(v) (*sp++ = (v))
#define POP()   (*--sp)
#define TOP()   (sp[-1])
#define PEEK(n) (sp[-(n)-1])

// Check for overflow
#define PUSH_SAFE(v) do { \
    if (sp >= stack + STACK_SIZE) { vm_error("stack overflow"); } \
    PUSH(v); \
} while(0)

5. Inline caching (IC)

5. 内联缓存（IC）

Inline caching speeds up property lookups and method dispatch by caching the last observed type at each call site.

struct CallSite {
    Type   cached_type;      // Last observed receiver type
    void  *cached_method;    // Cached function pointer
    int    miss_count;       // Number of misses
};

void invoke_method(VM *vm, CallSite *cs, Value receiver, ...) {
    Type t = GET_TYPE(receiver);
    if (t == cs->cached_type) {
        // Cache hit: direct call, no lookup
        cs->cached_method(vm, receiver, ...);
    } else {
        // Cache miss: look up, update cache
        void *method = lookup_method(t, name);
        cs->cached_type = t;
        cs->cached_method = method;
        cs->miss_count++;
        method(vm, receiver, ...);
    }
}

Polymorphic IC (PIC): cache up to N (typ. 4) type-method pairs.

内联缓存通过在每个调用点缓存最近观察到的类型，来加速属性查找与方法分发。

struct CallSite {
    Type   cached_type;      // Last observed receiver type
    void  *cached_method;    // Cached function pointer
    int    miss_count;       // Number of misses
};

void invoke_method(VM *vm, CallSite *cs, Value receiver, ...) {
    Type t = GET_TYPE(receiver);
    if (t == cs->cached_type) {
        // Cache hit: direct call, no lookup
        cs->cached_method(vm, receiver, ...);
    } else {
        // Cache miss: look up, update cache
        void *method = lookup_method(t, name);
        cs->cached_type = t;
        cs->cached_method = method;
        cs->miss_count++;
        method(vm, receiver, ...);
    }
}

多态IC（PIC）：最多缓存N个（通常为4个）类型-方法对。

6. Simple JIT (mmap + machine code)

6. 简单JIT（mmap + 机器码）

For x86-64: allocate executable memory, write machine code bytes, call it.

#include <sys/mman.h>
#include <string.h>

typedef int (*JitFn)(int a, int b);

JitFn jit_compile_add(void) {
    // x86-64: add rdi, rsi; mov rax, rdi; ret
    static const uint8_t code[] = {
        0x48, 0x01, 0xF7,   // add rdi, rsi
        0x48, 0x89, 0xF8,   // mov rax, rdi
        0xC3                 // ret
    };

    void *mem = mmap(NULL, sizeof(code),
                     PROT_READ | PROT_WRITE | PROT_EXEC,
                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (mem == MAP_FAILED) return NULL;

    memcpy(mem, code, sizeof(code));

    // On some systems (macOS Apple Silicon): must use mprotect
    // mprotect(mem, sizeof(code), PROT_READ | PROT_EXEC);

    return (JitFn)mem;
}

On macOS Apple Silicon (M-series): use

pthread_jit_write_protect_np()

MAP_JIT

flag.

针对x86-64：分配可执行内存，写入机器码字节，然后调用它。

#include <sys/mman.h>
#include <string.h>

typedef int (*JitFn)(int a, int b);

JitFn jit_compile_add(void) {
    // x86-64: add rdi, rsi; mov rax, rdi; ret
    static const uint8_t code[] = {
        0x48, 0x01, 0xF7,   // add rdi, rsi
        0x48, 0x89, 0xF8,   // mov rax, rdi
        0xC3                 // ret
    };

    void *mem = mmap(NULL, sizeof(code),
                     PROT_READ | PROT_WRITE | PROT_EXEC,
                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (mem == MAP_FAILED) return NULL;

    memcpy(mem, code, sizeof(code));

    // On some systems (macOS Apple Silicon): must use mprotect
    // mprotect(mem, sizeof(code), PROT_READ | PROT_EXEC);

    return (JitFn)mem;
}

在macOS Apple Silicon（M系列）上：使用

pthread_jit_write_protect_np()

或

MAP_JIT

标志。

7. Performance tips

7. 性能优化技巧

Dispatch: Use computed goto over switch on GCC/Clang
Values: Use NaN boxing or tagged pointers; avoid boxing/unboxing in hot paths
Stack: Keep stack pointer in a callee-saved register (
```
register Value *sp asm("r15")
```
— GCC global register variable)
Locals access: Keep frequently accessed locals in VM registers (struct fields), not stack
Profiling: Use
```
perf
```
or sampling to find dispatch overhead vs actual work
Specialisation: Generate specialised handler variants for common type combinations (int+int add vs generic add)
Trace recording: Trace JITs (LuaJIT approach) compile hot traces instead of full functions

For a benchmark of dispatch strategies, see references/benchmarks.md.

分发： 在GCC/Clang上优先使用计算式goto而非switch
值表示： 使用NaN装箱或标记指针；避免在热点路径中进行装箱/拆箱操作
栈：将栈指针保存在被调用者保存的寄存器中（
```
register Value *sp asm("r15")
```
— GCC全局寄存器变量）
本地变量访问： 将频繁访问的本地变量保存在VM寄存器（结构体字段）中，而非栈上
性能分析： 使用
```
perf
```
或采样工具找出分发开销与实际工作的占比
特化： 为常见类型组合生成专用的处理器变体（如int+int加法与通用加法）
追踪记录： 追踪式JIT（LuaJIT方案）编译热点追踪而非完整函数

关于分发策略的基准测试，参见references/benchmarks.md。