# Copy-and-Patch Compilation

## §1 Provenance

- Authors: Haoran Xu and Fredrik Kjolstad (Stanford University).
- Venue: PLDI 2021, "Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode."
- Canonical PDF: https://fredrikbk.com/publications/copy-and-patch.pdf.
- ACM DOI: 10.1145/3485513.
- Production deployment: CPython 3.13 JIT (Brandt Bucher, Sept-Oct 2024).
  - PEP 744: https://peps.python.org/pep-0744/.
  - Python 3.13 release notes: https://docs.python.org/3/whatsnew/3.13.html.
  - LWN article "Adding a JIT compiler to CPython" (https://lwn.net/Articles/977855/).
- Author's homepage and lab page: https://fredrikbk.com/, https://kjolstad.io/.

## §2 Technique / contribution

The system is split between **build time** (when you compile the language runtime) and **runtime** (when the JIT runs).

**Build time:**
1. For every opcode in your bytecode, write a C function that implements its semantics. Use placeholders ("holes") for compile-time-known values like jump targets, immediates, and continuation pointers.
2. Compile each such function with Clang/LLVM at -O2 or -O3, producing an ELF/Mach-O object file with relocation entries.
3. Run a Python script that parses the object file, extracts the body of each function as raw bytes ("stencil"), and records the relocation entries as "hole" positions in those bytes.
4. Emit a giant C array of stencils, one entry per (opcode, signature) pair.

**Runtime:**
1. Walk the input bytecode.
2. For each op, look up the stencil, allocate executable memory, `memcpy` the stencil bytes into the buffer (the "copy" step).
3. For each hole in the stencil, write the runtime-known value into the byte offset (the "patch" step). Patches handle absolute addresses, PC-relative offsets, and immediates uniformly.
4. Mark the buffer executable and jump.

**Pseudo-code of the patcher:**

```c
void emit(Op op, Operands ops, uint8_t **pc_out) {
    Stencil *s = stencils[op];
    uint8_t *dst = *pc_out;
    memcpy(dst, s->bytes, s->size);
    for (Hole *h = s->holes; h != NULL; h = h->next) {
        uint64_t value = resolve(h, ops, dst, pc_out);
        switch (h->kind) {
        case R_X86_64_64:      *(uint64_t*)(dst + h->offset) = value; break;
        case R_X86_64_PC32:    *(int32_t*)(dst + h->offset) = (int32_t)(value - (uint64_t)(dst + h->offset + 4)); break;
        case R_X86_64_PLT32:   /* same as PC32 for our purposes */ break;
        }
    }
    *pc_out += s->size;
}
```

What is conspicuously **absent**: no IR, no SSA, no register allocator, no instruction selector, no peephole pass. The LLVM optimizer ran at build time once, on each stencil, in isolation.

## §3 Where it shines, where it fails

**Shines:**
- Compile time is ~100x faster than LLVM, because runtime work is just memcpy + a handful of stores.
- Generated code is roughly 2x slower than LLVM -O2, but ~5x faster than a switch-dispatch interpreter.
- The patcher is tiny (~100 LOC C, ~1000 LOC Python for stencil extraction). All the complexity sits in LLVM, used once at build time.
- Stencils can be regenerated by re-running Clang, so quality scales with whatever LLVM does. Free improvements over time.
- Easy to cross-compile: just run Clang for the target triple.

**Fails:**
- Register allocation across stencils is nil. Each stencil clobbers a fixed set of registers per its build-time compile.
- Cannot constant-fold across opcodes.
- Stencil binaries are platform-specific; you ship one set of stencils per (OS, ISA, ABI) tuple.
- Bug class: ABI mismatches between Clang's stencil output and the JIT runtime are notoriously hard to debug.

Compile-time profile: O(n_bytecodes) with very small constants. CPython measured ~10x faster than LLVM, ~2-9% faster than the tier-2 micro-op interpreter.

## §4 Status (May 2026)

- **CPython 3.13 (Oct 2024)** ships an opt-in copy-and-patch JIT (`--enable-experimental-jit`). Brandt Bucher led the port. Performance: 2-9% over the tier-2 interpreter as of 3.13, with improvements expected in 3.14 and 3.15.
- Microsoft laid off most of the Faster CPython team in early 2025, but Bucher and several others continue the work.
- A follow-up academic line includes Bansal et al., "Lightweight and Locality-Aware Composition of Black-Box Subroutines" (PLDI 2025, https://dl.acm.org/doi/10.1145/3729292) which generalizes the stencil idea to library composition.
- Research uptake: the technique is taught in compiler courses and has spawned several thesis projects.
- Caveat: the CPython JIT still has only ~1000 lines of complex Python build-time tooling plus ~100 lines of C runtime. So the technique remains genuinely simple in production.

## §5 Engineering cost for Mochi

A Mochi copy-and-patch implementation would consist of:

1. **Build-time stencil extractor** (Python or Go script, ~500-1500 LOC):
   - One C function per `vm3.Op`, written by hand or generated from `runtime/vm3/op.go`.
   - Compile each with Clang at `-O2 -fno-asynchronous-unwind-tables -fno-jump-tables`.
   - Parse the resulting `.o` files using `debug/elf` or `debug/macho` from the Go stdlib, no Clang dependency at runtime.
   - Emit a generated Go file with `var stencils = [...]Stencil{...}`.

2. **Runtime patcher** (Go, ~200 LOC):
   - Allocate an mmap'd executable region (use `golang.org/x/sys/unix` + `syscall.Mprotect`).
   - Implement `copy()` and `patch()` for x86-64 first (relocations R_X86_64_64, R_X86_64_PC32, R_X86_64_PLT32).
   - Add arm64 (R_AARCH64_ADR_PREL_PG_HI21, R_AARCH64_ADD_ABS_LO12_NC, R_AARCH64_CALL26) as a phase 2.

3. **Compiler3 driver** (~300 LOC):
   - Walk `compiler3/ir/`, emit one stencil per IR op.
   - Resolve jump targets via a label-pass before patching.

Total estimated effort: 6-10 weeks for x86-64 macOS+linux end-to-end. arm64 adds another 2-3 weeks.

Notably, this fits "naive but correct" perfectly: we get LLVM-quality opcode bodies for free, with a runtime that is essentially `memcpy + store`.

## §6 Mochi adaptation note

- `runtime/vm3/op.go` enumerates 100+ Mochi opcodes. Each gets a C stencil function.
- `runtime/vm3/cell.go` defines the 8-byte handle Cell. The C stencils manipulate Cell-shaped `uint64_t` values.
- `runtime/vm3/arenas.go` defines the 12 typed arenas. Stencil prologues load arena base pointers from fixed register slots (e.g., R13 = int-arena base, R14 = string-arena base).
- `compiler3/emit/` becomes the runtime patcher. The build-time stencil set lives in a new `compiler3/stencils/` package.
- The three-bank register file in `runtime/vm3/frame.go` maps to fixed callee-save registers in the stencil ABI. This is the load-bearing design decision.

The key Mochi simplification: we already have a typed bytecode. CPython had to fit untyped values into LLVM. Mochi can have a `stencil_add_int_int` distinct from `stencil_add_int_float`, eliminating runtime type tests.

## §7 Open questions for MEP-42

- Do we hand-write the C stencil functions, or auto-generate them from a DSL?
- How do we test that the build-time Clang and the runtime patcher agree on ABI? Differential testing against the vm3 interpreter is a must.
- Stencil size budget: at what point does code-cache pressure dominate?
- Multi-op fusion: can we ship pre-fused stencils for common pairs (e.g., load+add) the way Erlang BEAM ships super-instructions?
- Cross-platform stencils: do we ship one set of stencils per (OS, ISA) tuple, or compile them lazily on first use of each platform?
- Relocation kinds: the minimal set is `64`, `PC32`, `PLT32` for x86-64. Do we want to support the GOT relocations for shared-library calls from stencils?

## §8 References

- Haoran Xu, Fredrik Kjolstad, "Copy-and-Patch Compilation," PLDI 2021. PDF: https://fredrikbk.com/publications/copy-and-patch.pdf.
- PEP 744 (CPython JIT): https://peps.python.org/pep-0744/.
- Python 3.13 release notes: https://docs.python.org/3/whatsnew/3.13.html.
- LWN, "Adding a JIT compiler to CPython": https://lwn.net/Articles/977855/.
- LWN, "Following up on the Python JIT": https://lwn.net/Articles/1029307/.
- Bansal et al., "Lightweight and Locality-Aware Composition of Black-Box Subroutines," PLDI 2025: https://dl.acm.org/doi/10.1145/3729292.
