The Per-Opcode Template JIT Pattern

§1 Provenance

HotSpot Template Interpreter (Sun Microsystems, ~2002+): the urtext of this pattern. See "The Java HotSpot Performance Engine Architecture" (https://www.oracle.com/java/technologies/whitepaper.html).
Erlang BEAM (Ericsson, 1998+): the JIT in OTP 24+ (BeamAsm, https://www.erlang.org/blog/a-first-look-at-the-jit/) is template-style, written in C++ over asmjit.
Lua/LuaJIT 2.x interpreter (Mike Pall): the interpreter is itself a hand-written assembler with one template per op. JIT is trace-based, but interpreter design is template.
Sparkplug (V8, 2021), Liftoff (V8, 2018), JSC Baseline (Apple, ~2008): see neighbor docs 01, 02, 05.
General writeup: Anton Ertl, "The Structure and Performance of Efficient Interpreters" JILP 5 (2003), https://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.pdf.

§2 Technique / contribution

The pattern has these load-bearing elements:

Fixed register convention. Choose ~3-6 callee-save registers as "VM registers": typically PC, frame-pointer, accumulator, scratch1, scratch2. Caller-save registers are free for templates to clobber within a single op.
One emit function per opcode. Each function takes the current EmitContext and the decoded operands, and emits a short native sequence (typically 5-30 bytes per op).
Slow-path stubs. When an op needs heavyweight semantics (allocation, type miss, GC barrier), the template emits a single call slow_path_stub to a pre-compiled function. The stub uses the same VM-register convention so it can clobber freely.
Inline cache slots (optional). A patch site is a few NOPs that get rewritten on first execution with a fast-path check + jump. JSC and V8 use this heavily; copy-and-patch and Liftoff do not.
Per-arch backend. The emit functions are ISA-specific. Code-quality work is per-arch; the rest of the framework is shared.

Pure-Go implementation outline (no cgo):

type Emitter struct {
    buf  []byte   // mmap'd RWX region
    pos  int
    labels map[Label]int
}

func (e *Emitter) emit_load(arena_reg, dst_reg, slot int) {
    // mov dst_reg, [arena_reg + slot*8]
    e.emitREX(0, dst_reg, arena_reg)
    e.emitByte(0x8B)
    e.emitModRM(0x80, dst_reg & 7, arena_reg & 7)
    e.emitInt32(int32(slot * 8))
}

The full set of x86-64 instruction encodings is ~3,000 LOC of pure Go. ARM64 is similar (instructions are fixed 32-bit so encoding is simpler in some ways, harder in others due to immediate quirks).

For mmap and mprotect:

import "golang.org/x/sys/unix"

func allocExec(size int) ([]byte, error) {
    return unix.Mmap(-1, 0, size,
        unix.PROT_READ|unix.PROT_WRITE|unix.PROT_EXEC,
        unix.MAP_PRIVATE|unix.MAP_ANON)
}

On Apple Silicon (arm64 macOS) we must use MAP_JIT plus pthread_jit_write_protect_np() flips to switch between writable and executable. On Linux any sane W^X discipline works.

§3 Where it shines, where it fails

Shines:

Tiny runtime footprint: emitter + handler set fits in ~10K LOC for one ISA.
Compile speed: ~10-50 MB/s of machine code.
Each op template can be tuned by hand for a hot path.
Pure Go implementation needs no toolchain at runtime.
Predictable: no LLVM black-box performance cliffs.

Fails:

Cross-op optimization is zero (per-op only).
Hand-written templates rot when the ISA grows new addressing modes or instructions.
IC management is genuinely hard to get right (atomic patches, instruction cache flush, concurrent execution).
Generated code is 2-5x slower than an optimizing backend.

§4 Status (May 2026)

BEAM's BeamAsm is the most recent production deployment (OTP 24, 2021), using asmjit for x86-64 and arm64. It is the default in Erlang/Elixir releases.
Sparkplug, JSC Baseline, Liftoff, sm-base, and Winch are all production template JITs.
Pure-Go template JITs are rarer. Notable: github.com/twitchyliquid64/golang-asm (a fork of Go runtime's internal asm) and github.com/modern-go/gls. Neither is a full Mochi-ready toolkit.
Titzer 2024 (CGO) is the current state-of-the-art analytical comparison.

§5 Engineering cost for Mochi

A pure-Go, no-cgo template JIT for Mochi:

2 weeks: pick or fork a Go x86-64 assembler library. The Go runtime's internal cmd/internal/obj/x86 is GPL-incompatible with Mochi's MIT-style license; we likely need a from-scratch encoder or a fork of golang-asm.
3 weeks: per-op template emit functions for the ~100 Mochi ops, x86-64.
1 week: mmap/mprotect plumbing for Linux, macOS, Windows.
1 week: macOS arm64 JIT-write-protect ergonomics.
2 weeks: slow-path stub library (reuse vm3 op handlers via Go function pointers).
2 weeks: smoke tests against compiler3/corpus/.

Total: ~11 weeks for an x86-64 template JIT. arm64 adds ~6 weeks (encoder + per-op).

Inline caches add another 4-6 weeks if we want them. For MEP-42 phase 1, skip ICs.

§6 Mochi adaptation note

runtime/vm3/op.go: each Op needs an emit function.
runtime/vm3/cell.go: Cell is a uint64 that the templates load and store.
runtime/vm3/frame.go: the three-bank register file dictates which physical registers we reserve. Suggested mapping (x86-64 System V):
- R12 = int arena base
- R13 = float arena base
- R14 = pointer arena base
- R15 = frame pointer
- rbx = current Cell accumulator
runtime/vm3/arenas.go: the arena base loads in the function prologue use these regs.
compiler3/emit/ is a natural home for the emit functions; we add a sibling compiler3/jit/ for the runtime.

Pure-Go-no-cgo is a major Mochi constraint. It means we cannot rely on LLVM at runtime. Template JIT is the natural fit; copy-and-patch needs Clang at build time, which is acceptable.

§7 Open questions for MEP-42

Do we fork golang-asm or write from scratch?
Slow-path stubs as Go function calls or as pre-compiled native stubs? Go ABI prevents direct calls from JIT'd code without an asm trampoline.
Goroutine-safe codegen: the Mochi runtime is goroutine-heavy. Code cache writes must be protected.
macOS arm64 JIT entitlement: this requires an entitlement plist on signed binaries. Do we ship pre-signed binaries or document a workaround?
Code cache size limit: how do we cap memory growth?
Tier-up trigger: function call count threshold?
Per-op vs super-op templates: pre-fuse common pairs like load+add+store?

§8 References

HotSpot Template Interpreter: https://www.oracle.com/java/technologies/whitepaper.html.
BeamAsm (Erlang JIT): https://www.erlang.org/blog/a-first-look-at-the-jit/.
LuaJIT 2.x source: https://github.com/LuaJIT/LuaJIT.
Anton Ertl, "The Structure and Performance of Efficient Interpreters" JILP 5 (2003): https://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.pdf.
Apple, "Porting Just-In-Time Compilers to Apple Silicon" (https://developer.apple.com/documentation/apple-silicon/porting-just-in-time-compilers-to-apple-silicon).