Specializer
The specializer is the Tier-1 optimiser. It watches generic
opcodes such as LOAD_ATTR, BINARY_OP, CALL, and LOAD_GLOBAL
and, when one of them is executed often enough at a stable type,
rewrites it in place to a specialised variant that skips the
generic dispatch. If the assumption later breaks, the specialised
opcode deopts back to the generic form. The mechanism is
described by PEP 659.
Where the code lives
| File | Role |
|---|---|
Python/specialize.c | The specialisation logic, one function per family. |
Include/internal/pycore_code.h | Cache structs (_PyAttrCache, _PyLoadGlobalCache, ...). |
Include/internal/pycore_opcode_metadata.h (generated) | INLINE_CACHE_ENTRIES_* per opcode. |
Python/bytecodes.c | DSL: family declarations, DEOPT_IF, cache reads. |
The specializer entry point is one function per family:
/* Python/specialize.c _Py_Specialize_LoadAttr */
void _Py_Specialize_LoadAttr(_PyStackRef owner, _Py_CODEUNIT *instr,
PyObject *name);
The eval loop calls it from the generic opcode when the backoff counter hits zero. The function inspects the runtime types and either rewrites the opcode plus its inline cache or marks the site as unspecialisable (and arranges to look at it less often).
The cache
Every specialisable opcode reserves a fixed number of cache slots
right after itself in co_code. The count is per family and is
emitted into the metadata header by the cases generator:
/* Include/internal/pycore_opcode_metadata.h (generated) */
#define INLINE_CACHE_ENTRIES_LOAD_ATTR 9
#define INLINE_CACHE_ENTRIES_BINARY_OP 5
#define INLINE_CACHE_ENTRIES_CALL 3
/* ... */
The eval loop's DISPATCH() macro skips past the cache after each
instruction. The opcode body reads cache slots explicitly. The
first slot is always the backoff counter:
/* Include/internal/pycore_code.h _PyAttrCache */
typedef struct {
uint16_t counter; /* backoff */
uint16_t version[2]; /* 32-bit type version, split */
uint16_t index; /* descriptor index or dict offset */
} _PyAttrCache;
version[2] is a 32-bit type version tag split across two code
units; reading it reassembles the 32-bit value. The version is
incremented on every type modification (__class__ assignment,
__set_name__, monkey-patch); a cached entry compares its
captured version against the current one to decide whether the
specialisation still applies.
The backoff counter
Counter values are interpreted with two bits of state in the low bits and the rest as a saturating count. The state machine:
- Counter hits zero: call the specialiser.
- Specialiser succeeds: rewrite the opcode; reset counter for cache invalidation tracking.
- Specialiser fails: bump the backoff power; reset the counter
to
2^power - 1so we try again later, but rarer. - After several failures the site is marked
LOAD_ATTR_ADAPTIVEwith maximum backoff; it effectively stops trying.
The exponential backoff is what makes specialisation cheap in aggregate: sites that will never specialise stop being checked, while hot sites converge quickly.
The rewrite
When _Py_Specialize_LoadAttr sees a stable type, it:
- Picks a specialised opcode (
LOAD_ATTR_INSTANCE_VALUE,LOAD_ATTR_SLOT,LOAD_ATTR_METHOD, ...). - Fills the cache: the type version, the descriptor offset or dict offset, anything else the specialised body needs.
- Replaces the opcode byte in
co_codewith the specialised opcode.
The write is a plain store; in the free-threaded build the eval
loop reads the opcode through FT_ATOMIC_LOAD_UINT16_RELAXED,
which matches a relaxed atomic store. The per-thread bytecode
copy (PEP 703) avoids contention between threads writing
different specialisations for the same site.
The body and the deopt
Each specialised opcode body checks the assumptions cheaply, then takes the fast path:
inst(LOAD_ATTR_INSTANCE_VALUE, (unused/1, type_version/2, index/1,
unused/5, owner -- attr, null if (oparg & 1))) {
PyTypeObject *tp = Py_TYPE(owner);
DEOPT_IF(tp->tp_version_tag != type_version, LOAD_ATTR);
PyDictValues *values = _PyObject_InlineValues(owner);
DEOPT_IF(!values->valid, LOAD_ATTR);
attr = values->values[index];
DEOPT_IF(attr == NULL, LOAD_ATTR);
Py_INCREF(attr);
}
DEOPT_IF(cond, opname) is the deopt macro. If the condition
fires, the eval loop:
- Restores the opcode byte to the generic family member.
- Re-dispatches to the now-generic opcode, which executes the slow path and updates the cache for next time.
Deopt is cheap because the cache slots are still there with useful data; the next specialisation attempt has warm context.
Families
The opcodes touched by the specializer in 3.14:
| Family | Variants (selected) |
|---|---|
LOAD_ATTR | INSTANCE_VALUE, SLOT, MODULE, METHOD, CLASS, ... |
STORE_ATTR | INSTANCE_VALUE, SLOT, WITH_HINT |
LOAD_GLOBAL | MODULE, BUILTIN |
BINARY_OP | ADD_INT, ADD_FLOAT, ADD_UNICODE, SUBTRACT_INT, ... |
BINARY_SUBSCR | LIST_INT, TUPLE_INT, DICT, GETITEM |
STORE_SUBSCR | LIST_INT, DICT |
CALL | PY_EXACT_ARGS, BOUND_METHOD, BUILTIN_O, LIST_APPEND, ... |
COMPARE_OP | INT, FLOAT, STR |
FOR_ITER | LIST, TUPLE, RANGE, GEN |
SEND | GEN |
UNPACK_SEQUENCE | TWO_TUPLE, LIST, TUPLE |
TO_BOOL | BOOL, INT, LIST, NONE, STR, ALWAYS_TRUE |
The list grows over releases. The authoritative source is
Python/bytecodes.c plus the family declarations there.
Specialisation statistics
CPython can be built with --enable-pystats to gather
specialisation hit/miss/deopt rates per opcode. The collected data
is what drives further additions to the specializer; the in-tree
Tools/scripts/summarize_stats.py renders a digest.
Free-threaded considerations
The specializer's atomicity contract in the free-threaded build:
- Reads of the opcode and the cache use
FT_ATOMIC_LOAD_*_RELAXED. - Writes are made through the per-thread bytecode copy, so two threads do not contend on the same memory.
- Type-version invalidation uses the type's atomic version tag; a bumped version makes every cached entry stale on next read.
CPython 3.14 changes
- More specialised forms. 3.14 added several new specialised
variants, notably extending
CALLandBINARY_OP. - Reduced cache size. Some families had their cache size trimmed; the net effect of more variants with smaller caches is modestly higher specialisation coverage with similar code-size cost.
- Tier-2 cooperation. Tier-2 trace projection reads the specialised opcodes (it cannot project through a generic one). See optimizer.
PEP touchpoints
- PEP 659. Specializing adaptive interpreter (the design document).
- PEP 703. Free threading; per-thread bytecode and atomic reads.
Reference
Python/specialize.c,Python/bytecodes.c,Include/internal/pycore_code.h,Include/internal/pycore_opcode_metadata.h.- PEP 659. Specializing adaptive interpreter.
- Shannon, Mark. Faster CPython design notes.