Skip to content

Copy-and-Patch Compilation

Stencil-driven code generation that harvests pre-compiled opcode snippets at build time and stitches them into native code at runtime via relocation patches. No IR, no register allocator, no instruction selector. The technique CPython 3.13+ ships in production.

§1 Provenance

§2 Technique / contribution

The system is split between build time (when you compile the language runtime) and runtime (when the JIT runs).

Build time:

  1. For every opcode in your bytecode, write a C function that implements its semantics. Use placeholders (“holes”) for compile-time-known values like jump targets, immediates, and continuation pointers.
  2. Compile each such function with Clang/LLVM at -O2 or -O3, producing an ELF/Mach-O object file with relocation entries.
  3. Run a Python script that parses the object file, extracts the body of each function as raw bytes (“stencil”), and records the relocation entries as “hole” positions in those bytes.
  4. Emit a giant C array of stencils, one entry per (opcode, signature) pair.

Runtime:

  1. Walk the input bytecode.
  2. For each op, look up the stencil, allocate executable memory, memcpy the stencil bytes into the buffer (the “copy” step).
  3. For each hole in the stencil, write the runtime-known value into the byte offset (the “patch” step). Patches handle absolute addresses, PC-relative offsets, and immediates uniformly.
  4. Mark the buffer executable and jump.

Pseudo-code of the patcher:

void emit(Op op, Operands ops, uint8_t **pc_out) {
    Stencil *s = stencils[op];
    uint8_t *dst = *pc_out;
    memcpy(dst, s->bytes, s->size);
    for (Hole *h = s->holes; h != NULL; h = h->next) {
        uint64_t value = resolve(h, ops, dst, pc_out);
        switch (h->kind) {
        case R_X86_64_64:      *(uint64_t*)(dst + h->offset) = value; break;
        case R_X86_64_PC32:    *(int32_t*)(dst + h->offset) = (int32_t)(value - (uint64_t)(dst + h->offset + 4)); break;
        case R_X86_64_PLT32:   /* same as PC32 for our purposes */ break;
        }
    }
    *pc_out += s->size;
}

What is conspicuously absent: no IR, no SSA, no register allocator, no instruction selector, no peephole pass. The LLVM optimizer ran at build time once, on each stencil, in isolation.

§3 Where it shines, where it fails

Shines:

  • Compile time is ~100x faster than LLVM, because runtime work is just memcpy + a handful of stores.
  • Generated code is roughly 2x slower than LLVM -O2, but ~5x faster than a switch-dispatch interpreter.
  • The patcher is tiny (~100 LOC C, ~1000 LOC Python for stencil extraction). All the complexity sits in LLVM, used once at build time.
  • Stencils can be regenerated by re-running Clang, so quality scales with whatever LLVM does. Free improvements over time.
  • Easy to cross-compile: just run Clang for the target triple.

Fails:

  • Register allocation across stencils is nil. Each stencil clobbers a fixed set of registers per its build-time compile.
  • Cannot constant-fold across opcodes.
  • Stencil binaries are platform-specific; you ship one set of stencils per (OS, ISA, ABI) tuple.
  • Bug class: ABI mismatches between Clang’s stencil output and the JIT runtime are notoriously hard to debug.

Compile-time profile: O(n_bytecodes) with very small constants. CPython measured ~10x faster than LLVM, ~2-9% faster than the tier-2 micro-op interpreter.

§4 Status (May 2026)

  • CPython 3.13 (Oct 2024) ships an opt-in copy-and-patch JIT (--enable-experimental-jit). Brandt Bucher led the port. Performance: 2-9% over the tier-2 interpreter as of 3.13, with improvements expected in 3.14 and 3.15.
  • Microsoft laid off most of the Faster CPython team in early 2025, but Bucher and several others continue the work.
  • A follow-up academic line includes Bansal et al., “Lightweight and Locality-Aware Composition of Black-Box Subroutines” (PLDI 2025, https://dl.acm.org/doi/10.1145/3729292) which generalizes the stencil idea to library composition.
  • Research uptake: the technique is taught in compiler courses and has spawned several thesis projects.
  • Caveat: the CPython JIT still has only ~1000 lines of complex Python build-time tooling plus ~100 lines of C runtime. So the technique remains genuinely simple in production.

§5 Engineering cost for Mochi

A Mochi copy-and-patch implementation would consist of:

  1. Build-time stencil extractor (Python or Go script, ~500-1500 LOC):

    • One C function per vm3.Op, written by hand or generated from runtime/vm3/op.go.
    • Compile each with Clang at -O2 -fno-asynchronous-unwind-tables -fno-jump-tables.
    • Parse the resulting .o files using debug/elf or debug/macho from the Go stdlib, no Clang dependency at runtime.
    • Emit a generated Go file with var stencils = [...]Stencil{...}.
  2. Runtime patcher (Go, ~200 LOC):

    • Allocate an mmap’d executable region (use golang.org/x/sys/unix + syscall.Mprotect).
    • Implement copy() and patch() for x86-64 first (relocations R_X86_64_64, R_X86_64_PC32, R_X86_64_PLT32).
    • Add arm64 (R_AARCH64_ADR_PREL_PG_HI21, R_AARCH64_ADD_ABS_LO12_NC, R_AARCH64_CALL26) as a phase 2.
  3. Compiler3 driver (~300 LOC):

    • Walk compiler3/ir/, emit one stencil per IR op.
    • Resolve jump targets via a label-pass before patching.

Total estimated effort: 6-10 weeks for x86-64 macOS+linux end-to-end. arm64 adds another 2-3 weeks.

Notably, this fits “naive but correct” perfectly: we get LLVM-quality opcode bodies for free, with a runtime that is essentially memcpy + store.

§6 Mochi adaptation note

  • runtime/vm3/op.go enumerates 100+ Mochi opcodes. Each gets a C stencil function.
  • runtime/vm3/cell.go defines the 8-byte handle Cell. The C stencils manipulate Cell-shaped uint64_t values.
  • runtime/vm3/arenas.go defines the 12 typed arenas. Stencil prologues load arena base pointers from fixed register slots (e.g., R13 = int-arena base, R14 = string-arena base).
  • compiler3/emit/ becomes the runtime patcher. The build-time stencil set lives in a new compiler3/stencils/ package.
  • The three-bank register file in runtime/vm3/frame.go maps to fixed callee-save registers in the stencil ABI. This is the load-bearing design decision.

The key Mochi simplification: we already have a typed bytecode. CPython had to fit untyped values into LLVM. Mochi can have a stencil_add_int_int distinct from stencil_add_int_float, eliminating runtime type tests.

§7 Open questions for MEP-42

  • Do we hand-write the C stencil functions, or auto-generate them from a DSL?
  • How do we test that the build-time Clang and the runtime patcher agree on ABI? Differential testing against the vm3 interpreter is a must.
  • Stencil size budget: at what point does code-cache pressure dominate?
  • Multi-op fusion: can we ship pre-fused stencils for common pairs (e.g., load+add) the way Erlang BEAM ships super-instructions?
  • Cross-platform stencils: do we ship one set of stencils per (OS, ISA) tuple, or compile them lazily on first use of each platform?
  • Relocation kinds: the minimal set is 64, PC32, PLT32 for x86-64. Do we want to support the GOT relocations for shared-library calls from stencils?

§8 References