Skip to main content

Parser

Parsing turns source bytes into a mod_ty AST. The work happens in two passes that are wired together but conceptually distinct: the tokenizer reads bytes and emits a Token stream; the parser consumes the token stream and builds the AST. Both stages live under Parser/. The parser is generated from a PEG grammar by a Python tool that runs at build time.

Where the code lives

FileRoleEntry points
Parser/lexer/lexer.cThe tokenizer state machine. Normal mode and f-string mode._PyTokenizer_Get, tok_get, tok_get_normal_mode, tok_get_fstring_mode
Parser/lexer/state.htok_state, tokenizer_mode. Buffer pointers, indent stack, paren stack, f-string stack.struct tok_state
Parser/lexer/buffer.cBuffer management: dynamic resize, CRLF normalisation, line tracking._PyLexer_tok_underflow_string, tok_underflow_file
Parser/pegen.cPEG runtime: memoisation, error recovery, AST node helpers._PyPegen_parse, _PyPegen_run_parser, _PyPegen_insert_memo
Parser/parser.c (generated)The grammar as a recursive-descent function per rule.file_rule, expression_rule, ...
Parser/Python.gramThe PEG grammar source.(input to Tools/peg_generator)
Tools/peg_generator/pegen/c_generator.pyThe grammar-to-C generator.make regen-pegen
Parser/Python.asdlThe AST definition.(input to Parser/asdl_c.py)

The public entry points are:

/* Include/internal/pycore_parser.h */
mod_ty _PyParser_ASTFromString(const char *str, PyObject *filename,
int mode, PyCompilerFlags *flags,
PyArena *arena);
mod_ty _PyParser_ASTFromFile(FILE *fp, PyObject *filename, const char *enc,
int mode, ..., PyArena *arena);

Both call _PyPegen_run_parser which owns the tokenizer and the Parser struct for the duration of the parse.

The tokenizer

The tokenizer is a state machine keyed on a tok_state struct. The struct holds a sliding window into the source and a couple of explicit stacks:

/* Parser/lexer/state.h:74 tok_state */
struct tok_state {
char *buf; /* malloc'd input buffer */
char *cur; /* current read position */
char *inp; /* end of valid data in buf */
int indent; /* current indent level (in spaces) */
int indstack[MAXINDENT]; /* indent stack (max depth 100) */
int level; /* () [] {} nesting depth */
char parenstack[MAXLEVEL]; /* paren kind stack (max 200) */
tokenizer_mode tok_mode_stack[MAXFSTRINGLEVEL]; /* f-string stack */
int tok_mode_stack_index;
int lineno;
int col_offset;
int starting_col_offset;
int filename_lineno;
PyObject *filename;
/* ... */
};

buf holds a chunk of the source. In file mode the buffer is refilled as the tokenizer advances; in string mode the whole source lives in buf once. CRLF is normalised to LF in the buffer, so the tokenizer itself sees only LF.

The dispatcher tok_get (Parser/lexer/lexer.c:1616) checks the top of the mode stack and routes to either tok_get_normal_mode or tok_get_fstring_mode. The normal-mode tokenizer handles indentation, operators, keywords, numbers, ordinary strings, and the NEWLINE / INDENT / DEDENT synthetic tokens. The f-string-mode tokenizer is a separate state machine that tracks the f-string's quote style and the nesting of expression substitutions.

Indentation

The indent stack tracks the column offsets of currently-open blocks. When a logical line begins (after a NEWLINE outside any brackets), the tokenizer measures the leading whitespace and compares against the stack:

  • Greater than the top: push the new column, emit INDENT.
  • Equal to the top: emit no indentation token.
  • Less than the top: pop entries from the stack until the measurement matches one of them; emit one DEDENT per pop. If no match is found, raise IndentationError.

The bracketing depth level suppresses the indentation machinery: inside (, [, or {, line breaks emit NL (non-logical newline) rather than NEWLINE, and indentation is not measured.

F-strings

F-strings are tokenized through a separate mode because the body of an f-string interleaves literal chunks, replacement fields, and optional format specifications, each of which has its own tokenization rules:

/* Parser/lexer/state.h:48 tokenizer_mode */
typedef struct {
int kind; /* TOK_REGULAR_MODE / TOK_FSTRING_MODE */
int curly_bracket_depth; /* { } depth inside the f-string */
int curly_bracket_expr_start_depth; /* depth when expression started */
Py_UCS4 quote; /* ' " */
int quote_size; /* 1 or 3 */
int string_kind; /* FSTRING or TSTRING (3.14 t-strings) */
const char *start;
const char *multi_line_start;
int start_lineno;
int last_expr_size;
char *last_expr_buffer;
int f_string_debug; /* {expr=} forms */
} tokenizer_mode;

When tok_get_fstring_mode encounters { (not {{), it pushes a fresh regular-mode entry onto the stack and tokenizes the expression normally. The matching } pops back to the f-string mode. Format specifications introduce another nested mode. CPython 3.14 also adds t-strings (template strings); the string_kind field distinguishes them from f-strings.

The PEG parser

CPython has used a PEG parser since 3.9 (PEP 617). The grammar is written in a custom EBNF-like notation in Parser/Python.gram and compiled to C by Tools/peg_generator/pegen/c_generator.py. The generator emits one C function per rule into Parser/parser.c.

The Parser struct

/* Parser/pegen.h:54 Parser */
typedef struct {
struct tok_state *tok;
Token **tokens; /* lookahead buffer */
int mark; /* current token index */
int fill, size; /* token buffer bookkeeping */
PyArena *arena;
int start_rule; /* file_rule, eval_rule, fstring_rule, ... */
Memo *memo; /* per-rule memo */
int error_indicator;
int flags;
int feature_version;
int level; /* recursion depth */
location last_stmt_location;
} Parser;

The token buffer grows as the tokenizer is asked for more lookahead. mark is the cursor; mark rewinds on backtrack. The parser never re-tokenizes; once a token is in the buffer it stays.

Rule functions

Each rule generates a function shaped like this:

/* Parser/parser.c (generated): expression_rule */
static expr_ty expression_rule(Parser *p) {
int _start_mark = p->mark;
expr_ty _res = NULL;
if (_PyPegen_is_memoized(p, expression_rule, &_res)) return _res;
/* alternative 1: try; on success, memoise and return */
/* alternative 2: rewind p->mark; try */
/* ... */
_PyPegen_insert_memo(p, _start_mark, expression_rule, _res);
return _res;
}

Memoisation makes the PEG parser linear in input size. Without it the same prefix could be re-tried at exponential cost when an alternative fails. The memo is keyed on (rule, mark); the value is the result of the parse plus the resulting mark.

Left recursion

The grammar uses left-recursive rules for left-associative binary operators. The generator handles them by seeding the memo with a failure, parsing as much as it can, then iteratively extending the result until the parse no longer grows. The algorithm is described in Left Recursion in Parsing Expression Grammars (Medeiros, Mascarenhas, Ierusalimschy, 2012); Tools/peg_generator/pegen/grammar_visualizer.py is a useful companion when reading the grammar.

Error recovery

PEG parsers do not produce great error messages by default; a failed parse usually backtracks all the way to the start. CPython mitigates this with invalid alternatives in the grammar that match the most common malformed forms and raise targeted errors:

invalid_expression:
| !(NAME STRING) a=disjunction expression_without_invalid {
RAISE_SYNTAX_ERROR_KNOWN_LOCATION(a, ...) }

The convention is that invalid rules run only on a failed primary parse, so the success path never pays the cost of invalid-rule matching.

AST construction

Rule actions construct AST nodes via the generated constructors in Python/Python-ast.c. The constructors are thin wrappers around _PyArena_Malloc plus assignment of the per-node fields, including the source location quadruple (lineno, col_offset, end_lineno, end_col_offset). The arena lifetime owns every node; nothing is refcounted.

The mod_ty returned by the top-level rule is the root of the tree. The four entry shapes are Module (top-level script), Interactive (one REPL statement), Expression (eval-mode), and FunctionType (PEP 484 function-type comments).

Build-time generation

Parser/parser.c is generated. Edit Parser/Python.gram and run make regen-pegen to regenerate. The generator lives under Tools/peg_generator/; it is invoked during the CPython build and ships in source releases (the generated .c files ship checked-in to avoid a circular bootstrap).

Include/internal/pycore_ast.h and Python/Python-ast.c are generated from Parser/Python.asdl by Parser/asdl_c.py. The ASDL file is the authoritative description of the AST shape; the generator emits the mod_ty / stmt_ty / expr_ty unions, the constructor helpers, and the visitor scaffolding the compiler uses. See ast for the AST shape itself.

CPython 3.14 changes

  • PEP 750. Template strings (t"..."). A new tokenizer_mode.string_kind value (TSTRING) distinguishes them from f-strings. AST nodes Interpolation and TemplateStr hold the resulting structure.
  • Error message refinements. Parser/lexer/lexer.c and the invalid_* grammar rules have grown several targeted messages in 3.14: invalid f-string conversions, missing : in inline statements, and a clearer message for parenthesised context managers.

PEP touchpoints

  • PEP 617. New PEG parser for CPython.
  • PEP 526. Variable annotations: parser produces AnnAssign.
  • PEP 695. Type parameter syntax: parser produces type_param nodes.
  • PEP 750. Template strings.

Reference

  • Parser/lexer/lexer.c, Parser/lexer/state.h, Parser/pegen.c, Parser/pegen.h, Parser/parser.c (generated), Parser/Python.gram.
  • Tools/peg_generator/.
  • PEP 617. New PEG parser for CPython.
  • PEP 750. Template strings.
  • Medeiros, Mascarenhas, Ierusalimschy. Left Recursion in Parsing Expression Grammars. 2012.