Stencil-driven code generation that harvests pre-compiled opcode snippets at build time and stitches them into native code at runtime via relocation patches. No IR, no register allocator, no instruction selector. The technique CPython 3.13+ ships in production.
§1 Provenance
- Authors: Haoran Xu and Fredrik Kjolstad (Stanford University).
- Venue: PLDI 2021, “Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode.”
- Canonical PDF: https://fredrikbk.com/publications/copy-and-patch.pdf.
- ACM DOI: 10.1145/3485513.
- Production deployment: CPython 3.13 JIT (Brandt Bucher, Sept-Oct 2024).
- PEP 744: https://peps.python.org/pep-0744/.
- Python 3.13 release notes: https://docs.python.org/3/whatsnew/3.13.html.
- LWN article “Adding a JIT compiler to CPython” (https://lwn.net/Articles/977855/).
- Author’s homepage and lab page: https://fredrikbk.com/, https://kjolstad.io/.
§2 Technique / contribution
The system is split between build time (when you compile the language runtime) and runtime (when the JIT runs).
Build time:
- For every opcode in your bytecode, write a C function that implements its semantics. Use placeholders (“holes”) for compile-time-known values like jump targets, immediates, and continuation pointers.
- Compile each such function with Clang/LLVM at -O2 or -O3, producing an ELF/Mach-O object file with relocation entries.
- Run a Python script that parses the object file, extracts the body of each function as raw bytes (“stencil”), and records the relocation entries as “hole” positions in those bytes.
- Emit a giant C array of stencils, one entry per (opcode, signature) pair.
Runtime:
- Walk the input bytecode.
- For each op, look up the stencil, allocate executable memory,
memcpythe stencil bytes into the buffer (the “copy” step). - For each hole in the stencil, write the runtime-known value into the byte offset (the “patch” step). Patches handle absolute addresses, PC-relative offsets, and immediates uniformly.
- Mark the buffer executable and jump.
Pseudo-code of the patcher:
void emit(Op op, Operands ops, uint8_t **pc_out) {
Stencil *s = stencils[op];
uint8_t *dst = *pc_out;
memcpy(dst, s->bytes, s->size);
for (Hole *h = s->holes; h != NULL; h = h->next) {
uint64_t value = resolve(h, ops, dst, pc_out);
switch (h->kind) {
case R_X86_64_64: *(uint64_t*)(dst + h->offset) = value; break;
case R_X86_64_PC32: *(int32_t*)(dst + h->offset) = (int32_t)(value - (uint64_t)(dst + h->offset + 4)); break;
case R_X86_64_PLT32: /* same as PC32 for our purposes */ break;
}
}
*pc_out += s->size;
}What is conspicuously absent: no IR, no SSA, no register allocator, no instruction selector, no peephole pass. The LLVM optimizer ran at build time once, on each stencil, in isolation.
§3 Where it shines, where it fails
Shines:
- Compile time is ~100x faster than LLVM, because runtime work is just memcpy + a handful of stores.
- Generated code is roughly 2x slower than LLVM -O2, but ~5x faster than a switch-dispatch interpreter.
- The patcher is tiny (~100 LOC C, ~1000 LOC Python for stencil extraction). All the complexity sits in LLVM, used once at build time.
- Stencils can be regenerated by re-running Clang, so quality scales with whatever LLVM does. Free improvements over time.
- Easy to cross-compile: just run Clang for the target triple.
Fails:
- Register allocation across stencils is nil. Each stencil clobbers a fixed set of registers per its build-time compile.
- Cannot constant-fold across opcodes.
- Stencil binaries are platform-specific; you ship one set of stencils per (OS, ISA, ABI) tuple.
- Bug class: ABI mismatches between Clang’s stencil output and the JIT runtime are notoriously hard to debug.
Compile-time profile: O(n_bytecodes) with very small constants. CPython measured ~10x faster than LLVM, ~2-9% faster than the tier-2 micro-op interpreter.
§4 Status (May 2026)
- CPython 3.13 (Oct 2024) ships an opt-in copy-and-patch JIT (
--enable-experimental-jit). Brandt Bucher led the port. Performance: 2-9% over the tier-2 interpreter as of 3.13, with improvements expected in 3.14 and 3.15. - Microsoft laid off most of the Faster CPython team in early 2025, but Bucher and several others continue the work.
- A follow-up academic line includes Bansal et al., “Lightweight and Locality-Aware Composition of Black-Box Subroutines” (PLDI 2025, https://dl.acm.org/doi/10.1145/3729292) which generalizes the stencil idea to library composition.
- Research uptake: the technique is taught in compiler courses and has spawned several thesis projects.
- Caveat: the CPython JIT still has only ~1000 lines of complex Python build-time tooling plus ~100 lines of C runtime. So the technique remains genuinely simple in production.
§5 Engineering cost for Mochi
A Mochi copy-and-patch implementation would consist of:
Build-time stencil extractor (Python or Go script, ~500-1500 LOC):
- One C function per
vm3.Op, written by hand or generated fromruntime/vm3/op.go. - Compile each with Clang at
-O2 -fno-asynchronous-unwind-tables -fno-jump-tables. - Parse the resulting
.ofiles usingdebug/elfordebug/machofrom the Go stdlib, no Clang dependency at runtime. - Emit a generated Go file with
var stencils = [...]Stencil{...}.
- One C function per
Runtime patcher (Go, ~200 LOC):
- Allocate an mmap’d executable region (use
golang.org/x/sys/unix+syscall.Mprotect). - Implement
copy()andpatch()for x86-64 first (relocations R_X86_64_64, R_X86_64_PC32, R_X86_64_PLT32). - Add arm64 (R_AARCH64_ADR_PREL_PG_HI21, R_AARCH64_ADD_ABS_LO12_NC, R_AARCH64_CALL26) as a phase 2.
- Allocate an mmap’d executable region (use
Compiler3 driver (~300 LOC):
- Walk
compiler3/ir/, emit one stencil per IR op. - Resolve jump targets via a label-pass before patching.
- Walk
Total estimated effort: 6-10 weeks for x86-64 macOS+linux end-to-end. arm64 adds another 2-3 weeks.
Notably, this fits “naive but correct” perfectly: we get LLVM-quality opcode bodies for free, with a runtime that is essentially memcpy + store.
§6 Mochi adaptation note
runtime/vm3/op.goenumerates 100+ Mochi opcodes. Each gets a C stencil function.runtime/vm3/cell.godefines the 8-byte handle Cell. The C stencils manipulate Cell-shapeduint64_tvalues.runtime/vm3/arenas.godefines the 12 typed arenas. Stencil prologues load arena base pointers from fixed register slots (e.g., R13 = int-arena base, R14 = string-arena base).compiler3/emit/becomes the runtime patcher. The build-time stencil set lives in a newcompiler3/stencils/package.- The three-bank register file in
runtime/vm3/frame.gomaps to fixed callee-save registers in the stencil ABI. This is the load-bearing design decision.
The key Mochi simplification: we already have a typed bytecode. CPython had to fit untyped values into LLVM. Mochi can have a stencil_add_int_int distinct from stencil_add_int_float, eliminating runtime type tests.
§7 Open questions for MEP-42
- Do we hand-write the C stencil functions, or auto-generate them from a DSL?
- How do we test that the build-time Clang and the runtime patcher agree on ABI? Differential testing against the vm3 interpreter is a must.
- Stencil size budget: at what point does code-cache pressure dominate?
- Multi-op fusion: can we ship pre-fused stencils for common pairs (e.g., load+add) the way Erlang BEAM ships super-instructions?
- Cross-platform stencils: do we ship one set of stencils per (OS, ISA) tuple, or compile them lazily on first use of each platform?
- Relocation kinds: the minimal set is
64,PC32,PLT32for x86-64. Do we want to support the GOT relocations for shared-library calls from stencils?
§8 References
- Haoran Xu, Fredrik Kjolstad, “Copy-and-Patch Compilation,” PLDI 2021. PDF: https://fredrikbk.com/publications/copy-and-patch.pdf.
- PEP 744 (CPython JIT): https://peps.python.org/pep-0744/.
- Python 3.13 release notes: https://docs.python.org/3/whatsnew/3.13.html.
- LWN, “Adding a JIT compiler to CPython”: https://lwn.net/Articles/977855/.
- LWN, “Following up on the Python JIT”: https://lwn.net/Articles/1029307/.
- Bansal et al., “Lightweight and Locality-Aware Composition of Black-Box Subroutines,” PLDI 2025: https://dl.acm.org/doi/10.1145/3729292.