Pipeline
The compile pipeline turns a .py source file into a
PyCodeObject and hands the result to the eval loop. The pipeline
is five stages: tokenize, parse, build a symbol table, generate
intermediate code, and assemble. Each stage has its own page in
this group; this page is the high-level map a reader uses to find
the right one.
The interpreter never sees the source text directly. The eval loop
walks co_code (a bytes object holding 16-bit code units) and
consults co_consts, co_names, co_varnames, the exception
table, and the location table. Everything in the pipeline exists
to fill those fields.
Where the code lives
| File | Role | Entry points |
|---|---|---|
Parser/lexer/lexer.c, lexer/state.h | Tokenizer: byte stream to Token stream. | _PyTokenizer_Get, tok_get, tok_get_normal_mode, tok_get_fstring_mode |
Parser/parser.c (generated) | PEG parser: tokens to mod_ty AST. | _PyParser_ASTFromString, _PyParser_ASTFromFile |
Python/Python-ast.c (generated) | AST node constructors. Generated from Parser/Python.asdl. | _PyAST_Module, _PyAST_FunctionDef, ... |
Python/ast.c | AST validation. Rejects shapes the grammar allows but the language doesn't. | _PyAST_Validate |
Python/ast_preprocess.c | Constant folding and PEP 649 annotation rewrites on the AST. | _PyAST_Preprocess |
Python/symtable.c | Two-pass symbol table: collect, then analyse. | _PySymtable_Build, symtable_analyze |
Python/compile.c | The compiler driver. Owns the per-scope unit and the const cache. | _PyAST_Compile, compile_mod, compile_unit |
Python/codegen.c | AST-to-pseudo-instruction walk. Macros for ADDOP*. | _PyCodegen_Module, _PyCodegen_Expression, compiler_visit_stmt |
Python/instruction_sequence.c | The growable array of pseudo-instructions per scope. | _PyInstructionSequence_Addop, _PyInstructionSequence_UseLabel |
Python/flowgraph.c | Control-flow graph passes: jump threading, dead-block removal, stack-depth analysis. | _PyCfg_FromInstructionSequence, _PyCfg_OptimizeCodeUnit |
Python/assemble.c | Linearise the CFG, emit co_code, the location table, the exception table. | _PyAssemble_MakeCodeObject |
The pipeline is driven from a single function:
/* Python/compile.c:1478 _PyAST_Compile */
PyCodeObject *
_PyAST_Compile(mod_ty mod, PyObject *filename, PyCompilerFlags *flags,
int optimize, PyArena *arena);
Three public entry points converge on this function:
Py_CompileStringFlags (Python/pythonrun.c:1719) for the C API;
PyRun_FileExFlags (Python/pythonrun.c:1306) for python file.py;
builtin_compile (Python/bltinmodule.c:771) for compile() at
the Python level.
What each stage produces
source bytes
|
v Parser/lexer/lexer.c _PyTokenizer_Get
Token *tokens
|
v Parser/parser.c _PyParser_ASTFromString
mod_ty ast
|
v Python/ast.c _PyAST_Validate
v Python/ast_preprocess.c _PyAST_Preprocess
mod_ty (validated, folded)
|
v Python/symtable.c _PySymtable_Build
struct symtable *st
|
v Python/compile.c compile_unit
v Python/codegen.c _PyCodegen_Module
instr_sequence per scope
|
v Python/flowgraph.c _PyCfg_FromInstructionSequence
v _PyCfg_OptimizeCodeUnit
cfg_builder (optimised)
|
v Python/assemble.c _PyAssemble_MakeCodeObject
PyCodeObject *
Each stage owns its data structure. The tokenizer owns
struct tok_state; the parser owns struct Parser; the AST is a
forest of mod_ty, stmt_ty, expr_ty nodes allocated in the
shared PyArena; the symbol table owns struct symtable and a
tree of PySTEntryObject; the compiler owns compiler and a
stack of struct compiler_unit. Every allocation that does not
end up on the final code object lives in the arena and is freed
when compilation returns.
The arena
Python/pyarena.c provides PyArena, a bump-allocator used for
the AST, intermediate strings, the symbol table, and the
instruction sequences. The arena is one of the load-bearing
shortcuts in CPython: it lets the compiler skip per-node refcount
discipline and instead free everything in a single sweep at the
end. The arena is passed through every entry point that needs to
build AST nodes.
/* Include/internal/pycore_pyarena.h */
PyArena *_PyArena_New(void);
void _PyArena_Free(PyArena *arena);
void *_PyArena_Malloc(PyArena *arena, size_t size);
int _PyArena_AddPyObject(PyArena *arena, PyObject *obj);
_PyArena_AddPyObject registers a PyObject * whose lifetime
should match the arena; the arena holds a reference and drops it
on free. This is how strings interned during parsing stay alive
through compilation.
The compiler unit
Each lexical scope (module, function, class, lambda, comprehension,
PEP 695 type-parameter scope) compiles into a separate
PyCodeObject. The compiler maintains a stack of compiler_unit
structs to track the nested scopes:
/* Python/compile.c:55 compiler_unit */
struct compiler_unit {
PySTEntryObject *u_ste;
int u_scope_type; /* COMPILE_SCOPE_MODULE, ... */
instr_sequence *u_instr_sequence;
_PyCompile_CodeUnitMetadata u_metadata; /* name, consts, vars */
PyObject *u_deferred_annotations; /* PEP 649 */
int u_nfblocks;
};
When the compiler enters a nested scope (a function body, say) it
pushes a fresh compiler_unit, walks the body, assembles the
inner PyCodeObject, then pops back and emits a LOAD_CONST of
the inner code object in the outer unit's instruction sequence.
Pseudo-ops, real ops, the CFG
The codegen stage emits pseudo-instructions into the unit's
instr_sequence. Pseudo-ops are real opcodes plus synthetic ones
that the assembler resolves later: jump labels (not yet bound to
offsets), inserted EXTENDED_ARG placeholders, and PEP 626
location anchors. The flowgraph stage breaks the pseudo-op
sequence into basic blocks and runs the optimisation passes. The
assembler walks the optimised CFG, resolves labels to offsets,
encodes the location and exception tables, and produces the
bytecode bytes that go into co_code.
The pseudo-op encoding is the same as the real one
(_PyInstruction with an opcode and oparg), so a single
representation walks every pass.
PEP touchpoints
- PEP 617. A PEG parser replaced the LL(1) parser in 3.9. The
parser lives in
Parser/parser.c, generated fromParser/Python.grambyTools/peg_generator/. See parser. - PEP 626. Every reachable instruction carries a source
location. The location table is emitted in
Python/assemble.c. See compile. - PEP 657. The exception table maps bytecode ranges to
handlers, with stack depth and the
lastiflag. Encoded inPython/assemble.c assemble_exception_table. See compile. - PEP 695. Generic type parameters introduce
TypeParametersBlocksymbol-table scopes. See symtable. - PEP 649. Deferred annotations defer evaluation until
__annotate__()is called. The compiler buffers annotation expressions inu_deferred_annotations. See compile.
Reading order
Read parser for stage 1, ast for stage 2,
symtable for stage 3, compile for stages 4
through 6 (the codegen, the CFG passes, and the assembler all live
together in modern CPython and are easier to learn as one unit).
The output, a PyCodeObject, is the input to the
vm.
Reference
Parser/lexer/lexer.c,Parser/parser.c,Python/compile.c,Python/codegen.c,Python/symtable.c,Python/flowgraph.c,Python/assemble.c.- PEP 617. New PEG parser for CPython.
- PEP 626. Precise line numbers for debugging and other tools.
- PEP 657. Including fine-grained error locations in tracebacks.
- PEP 695. Type parameter syntax.
- PEP 649. Deferred evaluation of annotations using descriptors.