Mike Pall Style Guide

Overview

Mike Pall created LuaJIT, widely considered one of the most impressive JIT compilers ever written. A single developer achieved performance competitive with production JVMs while maintaining a tiny codebase. His work demonstrates that deep understanding of hardware and algorithms beats large teams with brute force.

Core Philosophy

"Measure, don't guess."

"The fastest code is code that doesn't run."

"Understand your hardware or it will humble you."

Pall believes in ruthless optimization through deep understanding—knowing the CPU so well that you can predict cycle counts by reading assembly.

Design Principles

Trace-Based Compilation: Optimize what actually runs, not what might run.
Microarchitecture Awareness: Write code for the real CPU, not an abstract machine.
Minimal Abstraction: Every layer costs cycles.
Data-Oriented Design: Memory layout dominates performance.

When Writing Performance-Critical Code

Always

Benchmark before and after every change
Understand the generated assembly
Profile to find actual hot spots
Consider cache behavior for every data structure
Know your target CPU's pipeline
Test on multiple architectures

Never

Assume an optimization helps without measuring
Ignore branch prediction effects
Use abstractions that hide memory access patterns
Optimize cold code paths
Trust microbenchmarks for macro decisions
Assume compiler optimizations happen

Prefer

Trace compilation over method compilation
Linear memory access over pointer chasing
Branchless code in hot paths
Tables over computed branches
Inline caching for polymorphic calls
Specialized code paths over generic

Code Patterns

Trace-Based Compilation

// LuaJIT's key insight: trace hot paths, not methods
// A trace is a linear sequence of operations

typedef struct Trace {
    uint32_t *mcode;       // Generated machine code
    IRIns *ir;             // IR instructions
    uint16_t nins;         // Number of IR instructions
    uint16_t nk;           // Number of constants
    SnapShot *snap;        // Side exit snapshots
    uint16_t nsnap;
    // Link to next trace (for loops)
    struct Trace *link;
} Trace;

// Recording: capture operations as they execute
void record_instruction(JitState *J, BCIns ins) {
    switch (bc_op(ins)) {
    case BC_ADDVN:
        // Record: result = slot[A] + constant[D]
        TRef tr = emitir(IR_ADD, J->slots[bc_a(ins)], 
                         lj_ir_knum(J, bc_d(ins)));
        J->slots[bc_a(ins)] = tr;
        break;
    // ... other bytecodes
    }
}

// Key: traces are LINEAR
// No control flow in the trace itself
// Side exits handle divergence

IR Design for Speed

// LuaJIT IR: compact, cache-friendly, no pointers

typedef struct IRIns {
    uint16_t op;       // Operation + type
    uint16_t op1;      // First operand (IR ref or slot)
    uint16_t op2;      // Second operand
    uint16_t prev;     // Previous instruction (for CSE chains)
} IRIns;  // 8 bytes, fits in cache line nicely

// IR references are indices, not pointers
// Enables: compact storage, easy serialization, cache efficiency

#define IRREF_BIAS  0x8000
#define irref_isk(r)  ((r) < IRREF_BIAS)  // Is constant?

// Constants stored separately, referenced by negative indices
// Instructions stored linearly, referenced by positive indices

// Example IR sequence for: x = a + b * c
// K001  NUM  3.14        -- constant
// 0001  SLOAD #1         -- load slot 1 (a)
// 0002  SLOAD #2         -- load slot 2 (b)
// 0003  SLOAD #3         -- load slot 3 (c)
// 0004  MUL  0002 0003   -- b * c
// 0005  ADD  0001 0004   -- a + (b * c)

Side Exits and Guards

// Traces assume types and values
// Guards verify assumptions, exit if wrong

void emit_guard(JitState *J, IRType expected, TRef tr) {
    IRIns *ir = &J->cur.ir[tref_ref(tr)];
    
    if (ir->t != expected) {
        // Emit type guard
        emitir(IR_GUARD, tr, expected);
        
        // Record snapshot for side exit
        snapshot_add(J);
    }
}

// Side exit: restore interpreter state, continue there
typedef struct SnapShot {
    uint16_t ref;        // First IR ref in snapshot
    uint8_t nslots;      // Number of slots to restore
    uint8_t topslot;     // Top slot number
    uint32_t *map;       // Slot -> IR ref mapping
} SnapShot;

// When guard fails:
// 1. Look up snapshot for this guard
// 2. Restore Lua stack from IR values
// 3. Jump back to interpreter
// 4. Maybe record a new trace from exit point

Assembly-Level Optimization

// LuaJIT generates assembly directly
// Every instruction chosen deliberately

// x86-64 code emission helpers
static void emit_rr(ASMState *as, x86Op op, Reg r1, Reg r2) {
    // REX prefix if needed
    if (r1 >= 8 || r2 >= 8) {
        *--as->mcp = 0x40 | ((r1 >> 3) << 2) | (r2 >> 3);
    }
    *--as->mcp = 0xc0 | ((r1 & 7) << 3) | (r2 & 7);
    *--as->mcp = op;
}

// Register allocation: linear scan, but smarter
// Allocate backwards from trace end for better results

void ra_allocate(ASMState *as) {
    // Process IR in reverse order
    for (IRRef ref = as->curins; ref >= as->stopins; ref--) {
        IRIns *ir = &as->ir[ref];
        
        // Allocate destination register
        Reg dest = ra_dest(as, ir);
        
        // Allocate source registers
        ra_left(as, ir, dest);
        ra_right(as, ir);
    }
}

// Key insight: backwards allocation sees all uses
// Can make better spill decisions

Memory Access Patterns

// Cache-friendly data structures are critical

// BAD: Linked list of variable-size nodes
struct Node {
    struct Node *next;
    int type;
    union {
        double num;
        struct String *str;
        // ...
    } value;
};

// GOOD: Separate arrays by type (SoA)
struct ValueArray {
    uint8_t *types;      // Type tags: sequential access
    TValue *values;      // Values: sequential access
    size_t count;
};

// Iteration patterns matter enormously
// This is ~10x faster than pointer chasing:
for (size_t i = 0; i < arr->count; i++) {
    if (arr->types[i] == TYPE_NUMBER) {
        sum += arr->values[i].n;
    }
}

Inline Caching

// Polymorphic inline cache for property access
// Avoids hash lookup in common case

typedef struct InlineCache {
    uint32_t shape_id;    // Expected object shape
    uint16_t offset;      // Cached property offset
    uint16_t _pad;
} InlineCache;

TValue get_property_cached(Object *obj, String *key, InlineCache *ic) {
    // Fast path: shape matches
    if (likely(obj->shape_id == ic->shape_id)) {
        return obj->slots[ic->offset];  // Direct access!
    }
    
    // Slow path: lookup and update cache
    uint16_t offset = shape_lookup(obj->shape, key);
    ic->shape_id = obj->shape_id;
    ic->offset = offset;
    return obj->slots[offset];
}

// Monomorphic: one shape, one offset
// Polymorphic: small set of shapes
// Megamorphic: too many shapes, fall back to hash

Branch Prediction Awareness

// CPUs predict branches; help them be right

// BAD: Unpredictable branches in hot loop
for (int i = 0; i < n; i++) {
    if (data[i] > threshold) {  // 50% taken = unpredictable
        sum += data[i];
    }
}

// GOOD: Branchless version
for (int i = 0; i < n; i++) {
    int mask = -(data[i] > threshold);  // 0 or -1
    sum += data[i] & mask;
}

// GOOD: Sort first if possible
qsort(data, n, sizeof(int), compare);
for (int i = 0; i < n && data[i] <= threshold; i++) {
    // All branches now predictable
}

// Loop unrolling: reduce branch overhead
for (int i = 0; i + 4 <= n; i += 4) {
    sum += data[i];
    sum += data[i + 1];
    sum += data[i + 2];
    sum += data[i + 3];
}

Type Specialization

// Generate specialized code for each type combination
// LuaJIT specializes aggressively

// Generic add (slow)
TValue generic_add(TValue a, TValue b) {
    if (tvisnum(a) && tvisnum(b)) {
        return numV(numV(a) + numV(b));
    } else if (tvisstr(a) || tvisstr(b)) {
        return concat(tostring(a), tostring(b));
    }
    // ... metamethod lookup
}

// Specialized add for numbers (fast)
// Generated when trace shows both args are numbers
double specialized_add_nn(double a, double b) {
    return a + b;  // Single instruction
}

// Type guards ensure specialization is valid
// Side exit if types don't match expected

Performance Mental Model

CPU Pipeline Awareness
══════════════════════════════════════════════════════════════

Latency (cycles)    Operation
────────────────────────────────────────────────────────────
1                   Register-to-register ALU
3-4                 L1 cache hit
~12                 L2 cache hit
~40                 L3 cache hit
~200                Main memory
~10-20              Branch mispredict penalty
~100+               Page fault

Key insight: Memory is the bottleneck
            Computation is nearly free by comparison
            Optimize for memory access patterns first

Mental Model

Pall approaches optimization by asking:

What's the hot path? Trace it, optimize it
What does the assembly look like? If you can't read it, you can't optimize it
Where are the cache misses? Memory dominates everything
What are the branch patterns? Predictable branches are free
Can I specialize? Generic code is slow code

Signature Pall Moves

Trace compilation: JIT what runs, not what's written
Compact IR: 8-byte instructions, index-based references
Backwards register allocation: See all uses before deciding
NaN boxing: Encode type and value in 64-bit doubles
Side exit snapshots: Restore interpreter state precisely
Assembly-level thinking: Know the cost of every instruction
FFI that's actually fast: C calls without overhead

pall-jit-mastery

NPX Install

Tags

SKILL.md Content