AST
The AST is the bridge between the parser and the compiler. The
parser builds it; the validator rejects shapes the grammar accepts
but the language does not; the preprocess pass folds constants and
rewrites annotations; the compiler walks it to emit code. The
shape of every node is described once, in
Parser/Python.asdl, and a build-time tool generates the C
declarations, constructors, and node visitors.
Where the code lives
| File | Role | Entry points |
|---|---|---|
Parser/Python.asdl | The AST grammar (Abstract Syntax Definition Language). | (input) |
Parser/asdl_c.py | The ASDL-to-C generator. Run by make regen-ast. | make regen-ast |
Include/internal/pycore_ast.h | Generated: typedefs and constructor declarations. | mod_ty, stmt_ty, expr_ty |
Python/Python-ast.c | Generated: constructor bodies, repr, copy-into-Python helpers. | _PyAST_Module, _PyAST_FunctionDef, ... |
Python/ast.c | AST validation: rejects malformed shapes. | _PyAST_Validate |
Python/ast_preprocess.c | Constant folding and PEP 649 annotation rewrites. | _PyAST_Preprocess |
Python/pyarena.c | The arena allocator that owns AST nodes. | _PyArena_New, _PyArena_Malloc |
The ASDL grammar
ASDL is a small DSL for describing recursive sum-of-product types.
The Python AST is described in roughly 200 lines of Python.asdl
that look like this:
module Python {
mod = Module(stmt* body, type_ignore* type_ignores)
| Interactive(stmt* body)
| Expression(expr body)
| FunctionType(expr* argtypes, expr returns)
stmt = FunctionDef(identifier name, arguments args,
stmt* body, expr* decorator_list,
expr? returns, string? type_comment,
type_param* type_params)
| ClassDef(identifier name, expr* bases, keyword* keywords,
stmt* body, expr* decorator_list, type_param* type_params)
| Return(expr? value)
| ...
attributes (int lineno, int col_offset,
int? end_lineno, int? end_col_offset)
}
The asdl declarations name the constructors (each |
alternative) and their fields. The * suffix is "zero or more"
(generated as a asdl_*_seq), ? is "zero or one" (generated as
a pointer that may be NULL). The attributes clause adds fields
to every constructor of the sum.
The generated types
Parser/asdl_c.py emits a discriminated union for each ASDL sum:
/* Include/internal/pycore_ast.h (generated): stmt_ty */
struct _stmt {
enum _stmt_kind kind;
union {
struct {
identifier name;
arguments_ty args;
asdl_stmt_seq *body;
asdl_expr_seq *decorator_list;
expr_ty returns;
string type_comment;
asdl_type_param_seq *type_params;
} FunctionDef;
struct { /* ClassDef fields */ } ClassDef;
/* ... */
} v;
int lineno;
int col_offset;
int end_lineno;
int end_col_offset;
};
typedef struct _stmt *stmt_ty;
Every constructor has a matching factory:
/* Python/Python-ast.c (generated): _PyAST_FunctionDef */
stmt_ty _PyAST_FunctionDef(identifier name, arguments_ty args,
asdl_stmt_seq *body, asdl_expr_seq *decorator_list,
expr_ty returns, string type_comment,
asdl_type_param_seq *type_params,
int lineno, int col_offset,
int end_lineno, int end_col_offset,
PyArena *arena);
The factory allocates a stmt_ty from the arena, sets kind to
FunctionDef_kind, fills in v.FunctionDef, and copies the four
location fields. The factory is what the parser's grammar actions
call.
Source locations
Every node carries a location quadruple: the start line and
column, plus the end line and column (PEP 626). The end location
is what traceback uses for PEP 657 fine-grained error squiggles.
Locations are stored as plain int fields; the parser fills them
from the corresponding Token ranges, and the compiler threads
them through to the location table on the final code object.
The column fields are byte offsets, not character offsets. This matters for tooling that consumes them: a column of 4 in a line that begins with a 2-byte UTF-8 character is two characters in, not four.
The arena
AST nodes never live in the refcounted heap. They are allocated
from a PyArena, a bump-allocator that lives for the duration of
compilation:
/* Python/pyarena.c */
PyArena *_PyArena_New(void);
void _PyArena_Free(PyArena *arena);
void *_PyArena_Malloc(PyArena *arena, size_t size);
int _PyArena_AddPyObject(PyArena *arena, PyObject *obj);
_PyArena_AddPyObject is the bridge for the few PyObject *
values that live inside AST nodes (identifiers, strings, byte
literals). The arena owns a reference; on free, the reference is
released. The arena thus subsumes both raw memory and the
refcounts of any objects it pins.
The arena is the load-bearing reason CPython's parser can ignore
refcount discipline: the alternative would be Py_INCREF /
Py_DECREF for every node, every list, every literal, with the
matching cleanup on every error path.
AST validation
The parser produces ASTs the grammar accepts. The validator rejects ASTs that the language does not.
/* Python/ast.c _PyAST_Validate */
int _PyAST_Validate(mod_ty mod);
Examples of validator-only checks:
awaitoutsideasync def(when the function is not flaggedasync).- Starred expressions in invalid positions (the grammar allows
*xin many contexts that the language forbids). yieldin the top-level expression of a module.- Augmented assignment with a tuple, list, or starred target.
- Duplicate keyword arguments in a call.
Validation is a single pass; on failure it raises SyntaxError
with the offending node's location.
The preprocess pass
The validated AST passes through one more stage before codegen:
/* Python/ast_preprocess.c:971 _PyAST_Preprocess */
int _PyAST_Preprocess(mod_ty mod, PyArena *arena,
PyObject *filename, int optimize, int features);
_PyAST_Preprocess runs three rewrites:
- Constant folding.
1 + 2becomesConstant(3). The fold runs on every binary operator, unary operator, subscript, and conditional expression where every operand is aConstant. - Deferred annotation rewriting. With
from __future__ import annotationsor PEP 649 active, annotations are wrapped or stringified rather than evaluated at function-definition time. - PEP 765 control-flow warnings. A
returnorbreakin afinallyblock triggers aSyntaxWarningat compile time.
The preprocess pass rewrites the AST in place; nodes can be replaced because the arena owns all of them.
Walking the AST
The compiler walks the AST top-down once per scope. The walk uses
visitor macros defined in Python/codegen.c:
#define VISIT(C, TYPE, V) \
do { if (compiler_visit_ ## TYPE((C), (V)) == ERROR) return ERROR; } while (0)
Generic visitors (compiler_visit_stmt, compiler_visit_expr)
dispatch on kind and call the per-constructor visitor
(compiler_function, compiler_call, ...). The visitor for each
constructor reads the fields of v.<Constructor> and emits the
corresponding pseudo-instructions. See compile for the
emit side.
ASDL sequences
asdl_*_seq is a small flat array type generated for each ASDL
sum that is mentioned with * in the grammar:
typedef struct {
Py_ssize_t size;
void *elements[1];
} asdl_seq;
typedef struct {
Py_ssize_t size;
stmt_ty typed_elements[1];
} asdl_stmt_seq;
The macros asdl_seq_LEN, asdl_seq_GET, asdl_seq_SET are the
read/write API. Sequences are arena-allocated like nodes.
CPython 3.14 additions
TemplateStrandInterpolationnodes. Added for PEP 750 t-strings. The shape mirrorsJoinedStr/FormattedValuebut carries the template machinery information through to codegen.type_paramnodes for PEP 695.TypeVar,TypeVarTuple, andParamSpecconstructors live in thetype_paramsum and appear onFunctionDef,ClassDef, andTypeAlias.TypeAliasstatement. The newtype X = Ysyntax produces aTypeAlias(name, type_params, value)node.- PEP 649 deferred annotations. The preprocess pass now wraps
annotations into
__annotate__functions instead of inlining them; the AST shape is unchanged but the emitted code is different.
PEP touchpoints
- PEP 484, 526, 591, 695. Annotations and type parameters.
- PEP 617. The PEG parser produces the AST shape consumed by the rest of the pipeline.
- PEP 626. Source locations are stored on every node.
- PEP 649. Deferred annotation evaluation.
- PEP 657. End line and end column locations.
- PEP 750. Template strings.
Reference
Parser/Python.asdl,Parser/asdl_c.py,Include/internal/pycore_ast.h,Python/Python-ast.c,Python/ast.c,Python/ast_preprocess.c,Python/pyarena.c.- Wang, D. C. et al. The Zephyr Abstract Syntax Description Language. (The ASDL CPython uses descends from Zephyr ASDL.)
- PEP 339. Design of the CPython compiler.