Skip to main content

Specializer

The specializer is the Tier-1 optimiser. It watches generic opcodes such as LOAD_ATTR, BINARY_OP, CALL, and LOAD_GLOBAL and, when one of them is executed often enough at a stable type, rewrites it in place to a specialised variant that skips the generic dispatch. If the assumption later breaks, the specialised opcode deopts back to the generic form. The mechanism is described by PEP 659.

Where the code lives

FileRole
Python/specialize.cThe specialisation logic, one function per family.
Include/internal/pycore_code.hCache structs (_PyAttrCache, _PyLoadGlobalCache, ...).
Include/internal/pycore_opcode_metadata.h (generated)INLINE_CACHE_ENTRIES_* per opcode.
Python/bytecodes.cDSL: family declarations, DEOPT_IF, cache reads.

The specializer entry point is one function per family:

/* Python/specialize.c _Py_Specialize_LoadAttr */
void _Py_Specialize_LoadAttr(_PyStackRef owner, _Py_CODEUNIT *instr,
PyObject *name);

The eval loop calls it from the generic opcode when the backoff counter hits zero. The function inspects the runtime types and either rewrites the opcode plus its inline cache or marks the site as unspecialisable (and arranges to look at it less often).

The cache

Every specialisable opcode reserves a fixed number of cache slots right after itself in co_code. The count is per family and is emitted into the metadata header by the cases generator:

/* Include/internal/pycore_opcode_metadata.h (generated) */
#define INLINE_CACHE_ENTRIES_LOAD_ATTR 9
#define INLINE_CACHE_ENTRIES_BINARY_OP 5
#define INLINE_CACHE_ENTRIES_CALL 3
/* ... */

The eval loop's DISPATCH() macro skips past the cache after each instruction. The opcode body reads cache slots explicitly. The first slot is always the backoff counter:

/* Include/internal/pycore_code.h _PyAttrCache */
typedef struct {
uint16_t counter; /* backoff */
uint16_t version[2]; /* 32-bit type version, split */
uint16_t index; /* descriptor index or dict offset */
} _PyAttrCache;

version[2] is a 32-bit type version tag split across two code units; reading it reassembles the 32-bit value. The version is incremented on every type modification (__class__ assignment, __set_name__, monkey-patch); a cached entry compares its captured version against the current one to decide whether the specialisation still applies.

The backoff counter

Counter values are interpreted with two bits of state in the low bits and the rest as a saturating count. The state machine:

  • Counter hits zero: call the specialiser.
  • Specialiser succeeds: rewrite the opcode; reset counter for cache invalidation tracking.
  • Specialiser fails: bump the backoff power; reset the counter to 2^power - 1 so we try again later, but rarer.
  • After several failures the site is marked LOAD_ATTR_ADAPTIVE with maximum backoff; it effectively stops trying.

The exponential backoff is what makes specialisation cheap in aggregate: sites that will never specialise stop being checked, while hot sites converge quickly.

The rewrite

When _Py_Specialize_LoadAttr sees a stable type, it:

  1. Picks a specialised opcode (LOAD_ATTR_INSTANCE_VALUE, LOAD_ATTR_SLOT, LOAD_ATTR_METHOD, ...).
  2. Fills the cache: the type version, the descriptor offset or dict offset, anything else the specialised body needs.
  3. Replaces the opcode byte in co_code with the specialised opcode.

The write is a plain store; in the free-threaded build the eval loop reads the opcode through FT_ATOMIC_LOAD_UINT16_RELAXED, which matches a relaxed atomic store. The per-thread bytecode copy (PEP 703) avoids contention between threads writing different specialisations for the same site.

The body and the deopt

Each specialised opcode body checks the assumptions cheaply, then takes the fast path:

inst(LOAD_ATTR_INSTANCE_VALUE, (unused/1, type_version/2, index/1,
unused/5, owner -- attr, null if (oparg & 1))) {
PyTypeObject *tp = Py_TYPE(owner);
DEOPT_IF(tp->tp_version_tag != type_version, LOAD_ATTR);
PyDictValues *values = _PyObject_InlineValues(owner);
DEOPT_IF(!values->valid, LOAD_ATTR);
attr = values->values[index];
DEOPT_IF(attr == NULL, LOAD_ATTR);
Py_INCREF(attr);
}

DEOPT_IF(cond, opname) is the deopt macro. If the condition fires, the eval loop:

  1. Restores the opcode byte to the generic family member.
  2. Re-dispatches to the now-generic opcode, which executes the slow path and updates the cache for next time.

Deopt is cheap because the cache slots are still there with useful data; the next specialisation attempt has warm context.

Families

The opcodes touched by the specializer in 3.14:

FamilyVariants (selected)
LOAD_ATTRINSTANCE_VALUE, SLOT, MODULE, METHOD, CLASS, ...
STORE_ATTRINSTANCE_VALUE, SLOT, WITH_HINT
LOAD_GLOBALMODULE, BUILTIN
BINARY_OPADD_INT, ADD_FLOAT, ADD_UNICODE, SUBTRACT_INT, ...
BINARY_SUBSCRLIST_INT, TUPLE_INT, DICT, GETITEM
STORE_SUBSCRLIST_INT, DICT
CALLPY_EXACT_ARGS, BOUND_METHOD, BUILTIN_O, LIST_APPEND, ...
COMPARE_OPINT, FLOAT, STR
FOR_ITERLIST, TUPLE, RANGE, GEN
SENDGEN
UNPACK_SEQUENCETWO_TUPLE, LIST, TUPLE
TO_BOOLBOOL, INT, LIST, NONE, STR, ALWAYS_TRUE

The list grows over releases. The authoritative source is Python/bytecodes.c plus the family declarations there.

Specialisation statistics

CPython can be built with --enable-pystats to gather specialisation hit/miss/deopt rates per opcode. The collected data is what drives further additions to the specializer; the in-tree Tools/scripts/summarize_stats.py renders a digest.

Free-threaded considerations

The specializer's atomicity contract in the free-threaded build:

  • Reads of the opcode and the cache use FT_ATOMIC_LOAD_*_RELAXED.
  • Writes are made through the per-thread bytecode copy, so two threads do not contend on the same memory.
  • Type-version invalidation uses the type's atomic version tag; a bumped version makes every cached entry stale on next read.

CPython 3.14 changes

  • More specialised forms. 3.14 added several new specialised variants, notably extending CALL and BINARY_OP.
  • Reduced cache size. Some families had their cache size trimmed; the net effect of more variants with smaller caches is modestly higher specialisation coverage with similar code-size cost.
  • Tier-2 cooperation. Tier-2 trace projection reads the specialised opcodes (it cannot project through a generic one). See optimizer.

PEP touchpoints

  • PEP 659. Specializing adaptive interpreter (the design document).
  • PEP 703. Free threading; per-thread bytecode and atomic reads.

Reference

  • Python/specialize.c, Python/bytecodes.c, Include/internal/pycore_code.h, Include/internal/pycore_opcode_metadata.h.
  • PEP 659. Specializing adaptive interpreter.
  • Shannon, Mark. Faster CPython design notes.