VM
The VM is the eval loop. It is a single function,
_PyEval_EvalFrameDefault, that pulls one code unit from the
current frame's bytecode, dispatches on the opcode, executes the
case, and advances. The function is the longest in CPython and
also the one most aggressively shaped for the C compiler: the
dispatch uses computed gotos where the compiler allows; the per-
opcode bodies are generated from a DSL that lets a single source
of truth drive the Tier-1 eval loop, the Tier-2 uop interpreter,
the specializer's metadata, and the JIT.
Where the code lives
| File | Role |
|---|---|
Python/ceval.c | The eval loop. _PyEval_EvalFrameDefault, helper functions, breaker. |
Python/ceval_macros.h | DISPATCH, NEXTOPARG, TARGET, PREDICT. The dispatch core. |
Python/bytecodes.c | The DSL source. One C-flavoured definition per opcode and per micro-op. |
Python/generated_cases.c.h (generated) | The Tier-1 case bodies emitted by Tools/cases_generator/tier1_generator.py. Included by ceval.c. |
Python/executor_cases.c.h (generated) | The Tier-2 uop case bodies. Included by optimizer.c. |
Python/opcode_targets.h (generated) | The opcode-to-label table for computed-goto dispatch. |
Include/internal/pycore_opcode_metadata.h (generated) | Per-opcode metadata: cache size, family, stack effect, flags. |
Tools/cases_generator/ | The DSL generator. Python scripts that produce the .c.h files. |
The eval loop
/* Python/ceval.c:1145 _PyEval_EvalFrameDefault */
PyObject *
_PyEval_EvalFrameDefault(PyThreadState *tstate, _PyInterpreterFrame *frame,
int throwflag);
The signature names the three inputs: the current thread state (holds the GIL, the eval breaker, the recursion limit), the frame to execute (holds the bytecode, the value stack, the locals), and a flag indicating whether to enter at the top or rethrow a pending exception (used to resume a generator that was thrown into).
The body is a single loop:
/* Python/ceval.c (sketch) */
DISPATCH_GOTO();
TARGET(LOAD_FAST):
/* body */
DISPATCH();
TARGET(LOAD_CONST):
/* body */
DISPATCH();
/* ... 250 more cases ... */
TARGET(NAME) is a label that the dispatch macro jumps to.
DISPATCH() advances next_instr past the cache slots for the
current opcode, reads the next code unit, and jumps to the
matching TARGET.
Dispatch
Dispatch is the hot path of the entire interpreter. CPython supports four implementations:
- Computed gotos. A GCC extension that allows
goto *ptrwhereptris a label address. The dispatcher computesopcode_targets[opcode]and jumps. One indirect branch per instruction, predicted well by modern CPUs because each opcode's return-from-dispatch is a separate branch with its own history. - Tail-calling threaded dispatch. Each
TARGETis a separate function annotated[[clang::musttail]]; dispatch becomes a forced tail call into the next function. Available on clang and recent GCC. Lets each opcode have its own function and its own branch-history slot, with no per-call overhead. - Switch fallback. A plain
switch (opcode). The slowest option; used on compilers that support neither computed gotos normusttail.
The selection is at compile time. The fast paths are documented in
Python/ceval_macros.h:
/* Python/ceval_macros.h:91 */
#define Py_MUSTTAIL [[clang::musttail]]
/* Python/ceval_macros.h:118 */
#define DISPATCH_GOTO() \
goto *opcode_targets[opcode];
/* Python/ceval_macros.h:164 */
#define NEXTOPARG() \
do { \
_Py_CODEUNIT word = {.cache = FT_ATOMIC_LOAD_UINT16_RELAXED(*(uint16_t*)next_instr)}; \
opcode = word.op.code; \
oparg = word.op.arg; \
} while (0)
Each code unit is 16 bits packed as 8-bit opcode plus 8-bit oparg.
FT_ATOMIC_LOAD_UINT16_RELAXED is a relaxed atomic load; on the
non-free-threaded build it compiles to a plain load.
The bytecodes.c DSL
Python/bytecodes.c is a pseudo-C file. The Python toolchain
parses it; the C compiler never sees it. The DSL describes each
opcode's body, its inputs and outputs on the value stack, the
inline cache layout, and the family relationships used by the
specializer.
A simple instruction:
inst(LOAD_FAST, (-- value)) {
value = GETLOCAL(oparg);
Py_INCREF(value);
}
inst(NAME, (inputs -- outputs)) declares the stack effect: the
items before -- are popped, the items after are pushed. The body
runs with the inputs already bound to C variables and the outputs
expected to be assigned before the case ends. The generator
synthesises the pops and pushes from the signature.
A specialised instruction with a cache slot:
inst(LOAD_GLOBAL_MODULE, (unused/1, unused/1, version/1, index/1 -- res, null if (oparg & 1))) {
PyDictObject *dict = (PyDictObject *)GLOBALS();
DEOPT_IF(dict->ma_keys->dk_version != version, LOAD_GLOBAL);
/* ... */
}
The version/1 and index/1 declare cache slots of one code unit
each, named for use in the body. DEOPT_IF(cond, op) is the
escape hatch that falls back to the unspecialised opcode when the
cache no longer matches.
A family declaration ties a generic opcode to its specialisations:
family(LOAD_GLOBAL, INLINE_CACHE_ENTRIES_LOAD_GLOBAL) = {
LOAD_GLOBAL_MODULE,
LOAD_GLOBAL_BUILTIN,
};
The generator emits the family table the specializer reads, the metadata header the assembler reads (to know how much cache to reserve), and the case bodies the eval loop runs.
A super-instruction fuses two opcodes:
super(LOAD_FAST_LOAD_FAST) = LOAD_FAST + LOAD_FAST;
The optimiser pass in Python/flowgraph.c rewrites a LOAD_FAST a; LOAD_FAST b pair into the fused super-instruction. The generator
emits the fused case body as the concatenation of the two
component bodies.
The cases generator
Tools/cases_generator/ is a small compiler that reads
bytecodes.c and produces multiple outputs. The pipeline:
parsing.pytokenises the DSL.analyzer.pybuilds the graph of instructions, families, macros, and super-instructions; computes per-instruction metadata (stack effect, error effect, cache size).tier1_generator.pyemitsgenerated_cases.c.hfor the Tier-1 eval loop inceval.c.tier2_generator.pyemitsexecutor_cases.c.hfor the Tier-2 uop interpreter inoptimizer.c.opcode_metadata_generator.pyemits the metadata header.jit_generator.pyemits the JIT template tables.
The generator is what makes the DSL practical. Without it, every edit to an opcode would need to be made in four places (Tier 1, Tier 2, metadata, JIT) and kept in sync by convention. With it, one edit propagates.
Inline caches
Specialisable opcodes reserve cache slots immediately after the
opcode in co_code. The eval loop skips them at dispatch
(next_instr += INLINE_CACHE_ENTRIES_*) and reads them
explicitly in the body. The cache layout is described in
Include/internal/pycore_code.h:
/* Include/internal/pycore_code.h _PyAttrCache */
typedef struct {
uint16_t counter; /* backoff counter for specialisation */
uint16_t version[2]; /* 32-bit type version split into two u16 */
uint16_t index; /* descriptor index or dict offset */
} _PyAttrCache;
The first slot of every specialisable instruction is the backoff counter, which controls when the specializer next looks at this site. See specializer.
The eval breaker
The eval loop checks a per-thread bitfield, the eval breaker, on every backward branch and every function entry:
/* Include/internal/pycore_ceval.h */
#define _PY_GIL_DROP_REQUEST_BIT 0
#define _PY_SIGNALS_PENDING_BIT 1
#define _PY_CALLS_TO_DO_BIT 2
#define _PY_ASYNC_EXCEPTION_BIT 3
#define _PY_GC_SCHEDULED_BIT 4
#define _PY_EVAL_PLEASE_STOP_BIT 5
Bits are set by other threads (signal handlers, GIL contention,
pending-call scheduling). The eval loop polls
tstate->eval_breaker; if non-zero it calls
_Py_HandlePending, which drains the bits in order. See
gil for the bits the GIL uses and how it cooperates with
this machinery.
Frame transitions
Function calls, returns, and generator suspends are not separate
opcodes; they are flow transitions in the loop. CALL dispatches
into a helper that pushes a new _PyInterpreterFrame and either
re-enters the eval loop or hands off to a C function. RETURN_VALUE
pops the frame, pushes the result on the caller's stack, and
continues with the caller's next_instr. The fact that frame
transitions stay in the loop avoids the cost of recursing the C
stack on every Python call. See frame.
The Tier-2 path
When a backward branch hits its threshold, the eval loop hands off to the Tier-2 optimiser to project a trace and (optionally) JIT- compile it. The handoff:
/* Python/ceval.c (sketch) */
TARGET(JUMP_BACKWARD):
if (--counter == 0) {
executor = _PyOptimizer_Optimize(frame, next_instr, ...);
if (executor) {
ENTER_EXECUTOR(executor);
}
}
DISPATCH();
ENTER_EXECUTOR either jumps into JIT-compiled machine code or
runs the uop trace through executor_cases.c.h. See
optimizer.
CPython 3.14 changes
- Tail-calling dispatch. The
[[clang::musttail]]path is promoted to a first-class option in 3.14, with per-target branch prediction noticeably better than computed gotos on modern CPUs. - Tier-2 default enablement. Tier 2 is no longer behind a
build flag; it is on by default in 3.14, with the JIT (PEP 744)
remaining opt-in via
--enable-experimental-jit. Py_TIER2build. A small build matrix; the interpreter picks the right code at compile time.- Free-threaded interpreter (PEP 703). A separate build
(
./configure --disable-gil) replaces several macros and adds per-thread bytecode copies. The eval loop's structure is unchanged but the dispatch macros expand differently.
PEP touchpoints
- PEP 626. The eval loop reads the location table; the traceback machinery uses it.
- PEP 659. The specializer rewrites opcodes in place; the loop dispatches the specialised variants.
- PEP 669. Instrumentation rewrites individual opcodes to
their
INSTRUMENTED_*siblings. - PEP 703. Free-threaded build uses atomic loads in
NEXTOPARGand per-thread bytecode copies. - PEP 744. Tier-2 entry from
JUMP_BACKWARDand the JIT.
Reference
Python/ceval.c,Python/ceval_macros.h,Python/bytecodes.c,Tools/cases_generator/.Include/internal/pycore_opcode_metadata.h(generated).- PEP 659. Specializing adaptive interpreter.
- PEP 669. Low impact monitoring.
- PEP 703. Free threading.
- PEP 744. JIT compilation.
- Shannon, Mark. Faster CPython design notes, github.com/markshannon/faster-cpython.