Build a Programming Language Lexer
A lexer is the first stage of many programming language tools.
Build a Programming Language Lexer
A lexer is the first stage of many programming language tools.
Its job is simple:
input text
↓
sequence of tokens
For example, this source code:
let x = 42 + y;
becomes:
KeywordLet
Identifier("x")
Equal
Number("42")
Plus
Identifier("y")
Semicolon
A lexer does not understand meaning yet. It only splits text into pieces.
Parsers, compilers, interpreters, formatters, syntax highlighters, and linters usually begin with lexing.
The Goal
We will build a lexer for a tiny language.
It will support:
identifiers
numbers
keywords
operators
punctuation
strings
whitespace skipping
comments
Input:
let answer = 42;
print(answer);
Output:
KeywordLet
Identifier(answer)
Equal
Number(42)
Semicolon
Identifier(print)
LeftParen
Identifier(answer)
RightParen
Semicolon
EOF
What Is a Token
A token has:
kind
text
position
Example:
Identifier("answer")
The token kind is:
Identifier
The text is:
answer
The lexer keeps the original text slice because later stages may need it.
Token Types
Start with an enum:
const TokenKind = enum {
eof,
identifier,
number,
string,
keyword_let,
keyword_if,
keyword_else,
keyword_return,
plus,
minus,
star,
slash,
equal,
left_paren,
right_paren,
left_brace,
right_brace,
comma,
semicolon,
};
This defines every token the lexer can produce.
Token Struct
Now define the token itself:
const Token = struct {
kind: TokenKind,
text: []const u8,
line: usize,
column: usize,
};
The lexer stores line and column numbers for error reporting.
Example:
unexpected token at line 3 column 14
Without positions, compiler errors become hard to understand.
The Lexer State
A lexer walks through text one character at a time.
const Lexer = struct {
input: []const u8,
index: usize,
line: usize,
column: usize,
};
The lexer tracks:
current byte index
current line
current column
Initialize the Lexer
Add:
fn init(input: []const u8) Lexer {
return .{
.input = input,
.index = 0,
.line = 1,
.column = 1,
};
}
We start at line 1, column 1.
Peek and Advance
The lexer needs helper functions.
fn peek(self: *Lexer) ?u8 {
if (self.index >= self.input.len) {
return null;
}
return self.input[self.index];
}
peek looks at the current character without moving.
Now add advance:
fn advance(self: *Lexer) ?u8 {
const ch = self.peek() orelse return null;
self.index += 1;
if (ch == '\n') {
self.line += 1;
self.column = 1;
} else {
self.column += 1;
}
return ch;
}
This moves the lexer forward.
Notice:
if (ch == '\n')
Newlines update both line and column counters.
Skip Whitespace
Programming languages usually ignore spaces and tabs between tokens.
fn skipWhitespace(self: *Lexer) void {
while (self.peek()) |ch| {
switch (ch) {
' ', '\t', '\r', '\n' => _ = self.advance(),
else => return,
}
}
}
This consumes whitespace until a non-whitespace character appears.
Create Tokens
Add a helper:
fn makeToken(
self: *Lexer,
kind: TokenKind,
start: usize,
end: usize,
line: usize,
column: usize,
) Token {
_ = self;
return Token{
.kind = kind,
.text = self.input[start..end],
.line = line,
.column = column,
};
}
A token references a slice of the original input.
The lexer does not allocate new strings.
Lexing Identifiers
Identifiers look like:
x
answer
my_variable
Keywords also start as identifiers.
Add helpers:
fn isIdentifierStart(ch: u8) bool {
return std.ascii.isAlphabetic(ch) or ch == '_';
}
fn isIdentifierContinue(ch: u8) bool {
return std.ascii.isAlphanumeric(ch) or ch == '_';
}
Now add identifier lexing:
fn lexIdentifier(self: *Lexer) Token {
const start = self.index;
const line = self.line;
const column = self.column;
_ = self.advance();
while (self.peek()) |ch| {
if (!isIdentifierContinue(ch)) {
break;
}
_ = self.advance();
}
const text = self.input[start..self.index];
const kind = keywordKind(text) orelse .identifier;
return Token{
.kind = kind,
.text = text,
.line = line,
.column = column,
};
}
Recognizing Keywords
Keywords are reserved words:
let
if
else
return
Add:
fn keywordKind(text: []const u8) ?TokenKind {
if (std.mem.eql(u8, text, "let")) return .keyword_let;
if (std.mem.eql(u8, text, "if")) return .keyword_if;
if (std.mem.eql(u8, text, "else")) return .keyword_else;
if (std.mem.eql(u8, text, "return")) return .keyword_return;
return null;
}
If the text matches a keyword, the token becomes a keyword token.
Otherwise it stays an identifier.
Lexing Numbers
Numbers are sequences of digits.
fn lexNumber(self: *Lexer) Token {
const start = self.index;
const line = self.line;
const column = self.column;
while (self.peek()) |ch| {
if (!std.ascii.isDigit(ch)) {
break;
}
_ = self.advance();
}
return Token{
.kind = .number,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
}
This accepts:
0
123
9999
This first lexer only supports integers.
Lexing Strings
Strings begin and end with quotes.
fn lexString(self: *Lexer) !Token {
const start = self.index;
const line = self.line;
const column = self.column;
_ = self.advance();
while (self.peek()) |ch| {
if (ch == '"') {
_ = self.advance();
return Token{
.kind = .string,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
}
if (ch == '\n') {
return error.UnterminatedString;
}
_ = self.advance();
}
return error.UnterminatedString;
}
This lexer keeps the quotes inside the token text:
"hello"
Later stages can remove them if needed.
Single Character Tokens
Operators and punctuation are simpler.
'+' -> plus
'-' -> minus
'(' -> left_paren
';' -> semicolon
We can lex them directly.
Lexing Comments
Add support for line comments:
// this is a comment
Add:
fn skipComment(self: *Lexer) void {
while (self.peek()) |ch| {
if (ch == '\n') {
return;
}
_ = self.advance();
}
}
The Main Lex Function
Now combine everything.
fn nextToken(self: *Lexer) !Token {
while (true) {
self.skipWhitespace();
const start = self.index;
const line = self.line;
const column = self.column;
const ch = self.peek() orelse {
return Token{
.kind = .eof,
.text = "",
.line = line,
.column = column,
};
};
if (isIdentifierStart(ch)) {
return self.lexIdentifier();
}
if (std.ascii.isDigit(ch)) {
return self.lexNumber();
}
switch (ch) {
'"' => return try self.lexString(),
'/brain/' => {
_ = self.advance();
if (self.peek() == '/brain/') {
_ = self.advance();
self.skipComment();
continue;
}
return Token{
.kind = .slash,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'+' => {
_ = self.advance();
return Token{
.kind = .plus,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'-' => {
_ = self.advance();
return Token{
.kind = .minus,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'*' => {
_ = self.advance();
return Token{
.kind = .star,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'=' => {
_ = self.advance();
return Token{
.kind = .equal,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'(' => {
_ = self.advance();
return Token{
.kind = .left_paren,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
')' => {
_ = self.advance();
return Token{
.kind = .right_paren,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'{' => {
_ = self.advance();
return Token{
.kind = .left_brace,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'}' => {
_ = self.advance();
return Token{
.kind = .right_brace,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
',' => {
_ = self.advance();
return Token{
.kind = .comma,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
';' => {
_ = self.advance();
return Token{
.kind = .semicolon,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
else => return error.InvalidCharacter,
}
}
}
Complete Program
Put this in src/main.zig:
const std = @import("std");
const TokenKind = enum {
eof,
identifier,
number,
string,
keyword_let,
keyword_if,
keyword_else,
keyword_return,
plus,
minus,
star,
slash,
equal,
left_paren,
right_paren,
left_brace,
right_brace,
comma,
semicolon,
};
const Token = struct {
kind: TokenKind,
text: []const u8,
line: usize,
column: usize,
};
const Lexer = struct {
input: []const u8,
index: usize,
line: usize,
column: usize,
fn init(input: []const u8) Lexer {
return .{
.input = input,
.index = 0,
.line = 1,
.column = 1,
};
}
fn peek(self: *Lexer) ?u8 {
if (self.index >= self.input.len) {
return null;
}
return self.input[self.index];
}
fn advance(self: *Lexer) ?u8 {
const ch = self.peek() orelse return null;
self.index += 1;
if (ch == '\n') {
self.line += 1;
self.column = 1;
} else {
self.column += 1;
}
return ch;
}
fn skipWhitespace(self: *Lexer) void {
while (self.peek()) |ch| {
switch (ch) {
' ', '\t', '\r', '\n' => _ = self.advance(),
else => return,
}
}
}
fn skipComment(self: *Lexer) void {
while (self.peek()) |ch| {
if (ch == '\n') {
return;
}
_ = self.advance();
}
}
fn lexIdentifier(self: *Lexer) Token {
const start = self.index;
const line = self.line;
const column = self.column;
_ = self.advance();
while (self.peek()) |ch| {
if (!isIdentifierContinue(ch)) {
break;
}
_ = self.advance();
}
const text = self.input[start..self.index];
return Token{
.kind = keywordKind(text) orelse .identifier,
.text = text,
.line = line,
.column = column,
};
}
fn lexNumber(self: *Lexer) Token {
const start = self.index;
const line = self.line;
const column = self.column;
while (self.peek()) |ch| {
if (!std.ascii.isDigit(ch)) {
break;
}
_ = self.advance();
}
return Token{
.kind = .number,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
}
fn lexString(self: *Lexer) !Token {
const start = self.index;
const line = self.line;
const column = self.column;
_ = self.advance();
while (self.peek()) |ch| {
if (ch == '"') {
_ = self.advance();
return Token{
.kind = .string,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
}
if (ch == '\n') {
return error.UnterminatedString;
}
_ = self.advance();
}
return error.UnterminatedString;
}
fn nextToken(self: *Lexer) !Token {
while (true) {
self.skipWhitespace();
const start = self.index;
const line = self.line;
const column = self.column;
const ch = self.peek() orelse {
return Token{
.kind = .eof,
.text = "",
.line = line,
.column = column,
};
};
if (isIdentifierStart(ch)) {
return self.lexIdentifier();
}
if (std.ascii.isDigit(ch)) {
return self.lexNumber();
}
switch (ch) {
'"' => return try self.lexString(),
'/brain/' => {
_ = self.advance();
if (self.peek() == '/brain/') {
_ = self.advance();
self.skipComment();
continue;
}
return Token{
.kind = .slash,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'+' => {
_ = self.advance();
return Token{
.kind = .plus,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'-' => {
_ = self.advance();
return Token{
.kind = .minus,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'*' => {
_ = self.advance();
return Token{
.kind = .star,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'=' => {
_ = self.advance();
return Token{
.kind = .equal,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'(' => {
_ = self.advance();
return Token{
.kind = .left_paren,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
')' => {
_ = self.advance();
return Token{
.kind = .right_paren,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'{' => {
_ = self.advance();
return Token{
.kind = .left_brace,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'}' => {
_ = self.advance();
return Token{
.kind = .right_brace,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
',' => {
_ = self.advance();
return Token{
.kind = .comma,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
';' => {
_ = self.advance();
return Token{
.kind = .semicolon,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
else => return error.InvalidCharacter,
}
}
}
};
fn isIdentifierStart(ch: u8) bool {
return std.ascii.isAlphabetic(ch) or ch == '_';
}
fn isIdentifierContinue(ch: u8) bool {
return std.ascii.isAlphanumeric(ch) or ch == '_';
}
fn keywordKind(text: []const u8) ?TokenKind {
if (std.mem.eql(u8, text, "let")) return .keyword_let;
if (std.mem.eql(u8, text, "if")) return .keyword_if;
if (std.mem.eql(u8, text, "else")) return .keyword_else;
if (std.mem.eql(u8, text, "return")) return .keyword_return;
return null;
}
pub fn main() !void {
const source =
\\let answer = 42;
\\print(answer);
\\
\\// this is a comment
\\if answer {
\\ return "done";
\\}
;
var lexer = Lexer.init(source);
while (true) {
const token = try lexer.nextToken();
std.debug.print(
"{s:<20} text={s:<12} line={d} column={d}\n",
.{
@tagName(token.kind),
token.text,
token.line,
token.column,
},
);
if (token.kind == .eof) {
break;
}
}
}
Run:
zig build run
Example output:
keyword_let text=let line=1 column=1
identifier text=answer line=1 column=5
equal text== line=1 column=12
number text=42 line=1 column=14
semicolon text=; line=1 column=16
identifier text=print line=2 column=1
left_paren text=( line=2 column=6
identifier text=answer line=2 column=7
right_paren text=) line=2 column=13
semicolon text=; line=2 column=14
Why Lexers Usually Use Slices
Notice that tokens store:
text: []const u8
The lexer does not allocate new strings for every token.
Instead, each token references the original source text.
That is efficient because lexers create many tokens.
A parser for a large file might produce hundreds of thousands of tokens.
Avoiding allocations matters.
Why Comments Are Removed Here
The lexer skips comments:
self.skipComment();
continue;
So comments never become tokens.
That is common in compilers because comments usually do not affect execution.
Some tools preserve comments:
formatters
documentation generators
IDEs
Those tools often emit comment tokens too.
Why Line and Column Tracking Matters
Without positions, compiler errors become frustrating.
This is bad:
syntax error
This is much better:
syntax error at line 12 column 7
Position tracking begins in the lexer because the lexer sees every character.
What You Learned
You built a lexer for a small programming language.
You scanned identifiers, keywords, numbers, strings, operators, and punctuation.
You skipped whitespace and comments.
You tracked line and column positions.
You returned tokens as slices into the original source.
This is the first stage of many language tools. Parsers, compilers, interpreters, linters, and syntax highlighters usually start with lexing.