Build a Programming Language Lexer

A lexer is the first stage of many programming language tools.

Its job is simple:

input text
    ↓
sequence of tokens

For example, this source code:

let x = 42 + y;

becomes:

KeywordLet
Identifier("x")
Equal
Number("42")
Plus
Identifier("y")
Semicolon

A lexer does not understand meaning yet. It only splits text into pieces.

Parsers, compilers, interpreters, formatters, syntax highlighters, and linters usually begin with lexing.

The Goal

We will build a lexer for a tiny language.

It will support:

identifiers
numbers
keywords
operators
punctuation
strings
whitespace skipping
comments

Input:

let answer = 42;
print(answer);

Output:

KeywordLet
Identifier(answer)
Equal
Number(42)
Semicolon
Identifier(print)
LeftParen
Identifier(answer)
RightParen
Semicolon
EOF

What Is a Token

A token has:

kind
text
position

Example:

Identifier("answer")

The token kind is:

Identifier

The text is:

answer

The lexer keeps the original text slice because later stages may need it.

Token Types

Start with an enum:

const TokenKind = enum {
    eof,

    identifier,
    number,
    string,

    keyword_let,
    keyword_if,
    keyword_else,
    keyword_return,

    plus,
    minus,
    star,
    slash,
    equal,

    left_paren,
    right_paren,
    left_brace,
    right_brace,

    comma,
    semicolon,
};

This defines every token the lexer can produce.

Token Struct

Now define the token itself:

const Token = struct {
    kind: TokenKind,
    text: []const u8,
    line: usize,
    column: usize,
};

The lexer stores line and column numbers for error reporting.

Example:

unexpected token at line 3 column 14

Without positions, compiler errors become hard to understand.

The Lexer State

A lexer walks through text one character at a time.

const Lexer = struct {
    input: []const u8,
    index: usize,
    line: usize,
    column: usize,
};

The lexer tracks:

current byte index
current line
current column

Initialize the Lexer

Add:

fn init(input: []const u8) Lexer {
    return .{
        .input = input,
        .index = 0,
        .line = 1,
        .column = 1,
    };
}

We start at line 1, column 1.

Peek and Advance

The lexer needs helper functions.

fn peek(self: *Lexer) ?u8 {
    if (self.index >= self.input.len) {
        return null;
    }

    return self.input[self.index];
}

peek looks at the current character without moving.

Now add advance:

fn advance(self: *Lexer) ?u8 {
    const ch = self.peek() orelse return null;

    self.index += 1;

    if (ch == '\n') {
        self.line += 1;
        self.column = 1;
    } else {
        self.column += 1;
    }

    return ch;
}

This moves the lexer forward.

Notice:

if (ch == '\n')

Newlines update both line and column counters.

Skip Whitespace

Programming languages usually ignore spaces and tabs between tokens.

fn skipWhitespace(self: *Lexer) void {
    while (self.peek()) |ch| {
        switch (ch) {
            ' ', '\t', '\r', '\n' => _ = self.advance(),
            else => return,
        }
    }
}

This consumes whitespace until a non-whitespace character appears.

Create Tokens

Add a helper:

fn makeToken(
    self: *Lexer,
    kind: TokenKind,
    start: usize,
    end: usize,
    line: usize,
    column: usize,
) Token {
    _ = self;

    return Token{
        .kind = kind,
        .text = self.input[start..end],
        .line = line,
        .column = column,
    };
}

A token references a slice of the original input.

The lexer does not allocate new strings.

Lexing Identifiers

Identifiers look like:

x
answer
my_variable

Keywords also start as identifiers.

Add helpers:

fn isIdentifierStart(ch: u8) bool {
    return std.ascii.isAlphabetic(ch) or ch == '_';
}

fn isIdentifierContinue(ch: u8) bool {
    return std.ascii.isAlphanumeric(ch) or ch == '_';
}

Now add identifier lexing:

fn lexIdentifier(self: *Lexer) Token {
    const start = self.index;
    const line = self.line;
    const column = self.column;

    _ = self.advance();

    while (self.peek()) |ch| {
        if (!isIdentifierContinue(ch)) {
            break;
        }

        _ = self.advance();
    }

    const text = self.input[start..self.index];

    const kind = keywordKind(text) orelse .identifier;

    return Token{
        .kind = kind,
        .text = text,
        .line = line,
        .column = column,
    };
}

Recognizing Keywords

Keywords are reserved words:

let
if
else
return

Add:

fn keywordKind(text: []const u8) ?TokenKind {
    if (std.mem.eql(u8, text, "let")) return .keyword_let;
    if (std.mem.eql(u8, text, "if")) return .keyword_if;
    if (std.mem.eql(u8, text, "else")) return .keyword_else;
    if (std.mem.eql(u8, text, "return")) return .keyword_return;

    return null;
}

If the text matches a keyword, the token becomes a keyword token.

Otherwise it stays an identifier.

Lexing Numbers

Numbers are sequences of digits.

fn lexNumber(self: *Lexer) Token {
    const start = self.index;
    const line = self.line;
    const column = self.column;

    while (self.peek()) |ch| {
        if (!std.ascii.isDigit(ch)) {
            break;
        }

        _ = self.advance();
    }

    return Token{
        .kind = .number,
        .text = self.input[start..self.index],
        .line = line,
        .column = column,
    };
}

This accepts:

0
123
9999

This first lexer only supports integers.

Lexing Strings

Strings begin and end with quotes.

fn lexString(self: *Lexer) !Token {
    const start = self.index;
    const line = self.line;
    const column = self.column;

    _ = self.advance();

    while (self.peek()) |ch| {
        if (ch == '"') {
            _ = self.advance();

            return Token{
                .kind = .string,
                .text = self.input[start..self.index],
                .line = line,
                .column = column,
            };
        }

        if (ch == '\n') {
            return error.UnterminatedString;
        }

        _ = self.advance();
    }

    return error.UnterminatedString;
}

This lexer keeps the quotes inside the token text:

"hello"

Later stages can remove them if needed.

Single Character Tokens

Operators and punctuation are simpler.

'+' -> plus
'-' -> minus
'(' -> left_paren
';' -> semicolon

We can lex them directly.

Lexing Comments

Add support for line comments:

// this is a comment

Add:

fn skipComment(self: *Lexer) void {
    while (self.peek()) |ch| {
        if (ch == '\n') {
            return;
        }

        _ = self.advance();
    }
}

The Main Lex Function

Now combine everything.

fn nextToken(self: *Lexer) !Token {
    while (true) {
        self.skipWhitespace();

        const start = self.index;
        const line = self.line;
        const column = self.column;

        const ch = self.peek() orelse {
            return Token{
                .kind = .eof,
                .text = "",
                .line = line,
                .column = column,
            };
        };

        if (isIdentifierStart(ch)) {
            return self.lexIdentifier();
        }

        if (std.ascii.isDigit(ch)) {
            return self.lexNumber();
        }

        switch (ch) {
            '"' => return try self.lexString(),

            '/brain/' => {
                _ = self.advance();

                if (self.peek() == '/brain/') {
                    _ = self.advance();
                    self.skipComment();
                    continue;
                }

                return Token{
                    .kind = .slash,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            '+' => {
                _ = self.advance();

                return Token{
                    .kind = .plus,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            '-' => {
                _ = self.advance();

                return Token{
                    .kind = .minus,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            '*' => {
                _ = self.advance();

                return Token{
                    .kind = .star,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            '=' => {
                _ = self.advance();

                return Token{
                    .kind = .equal,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            '(' => {
                _ = self.advance();

                return Token{
                    .kind = .left_paren,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            ')' => {
                _ = self.advance();

                return Token{
                    .kind = .right_paren,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            '{' => {
                _ = self.advance();

                return Token{
                    .kind = .left_brace,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            '}' => {
                _ = self.advance();

                return Token{
                    .kind = .right_brace,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            ',' => {
                _ = self.advance();

                return Token{
                    .kind = .comma,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            ';' => {
                _ = self.advance();

                return Token{
                    .kind = .semicolon,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            else => return error.InvalidCharacter,
        }
    }
}

Complete Program

Put this in src/main.zig:

const std = @import("std");

const TokenKind = enum {
    eof,

    identifier,
    number,
    string,

    keyword_let,
    keyword_if,
    keyword_else,
    keyword_return,

    plus,
    minus,
    star,
    slash,
    equal,

    left_paren,
    right_paren,
    left_brace,
    right_brace,

    comma,
    semicolon,
};

const Token = struct {
    kind: TokenKind,
    text: []const u8,
    line: usize,
    column: usize,
};

const Lexer = struct {
    input: []const u8,
    index: usize,
    line: usize,
    column: usize,

    fn init(input: []const u8) Lexer {
        return .{
            .input = input,
            .index = 0,
            .line = 1,
            .column = 1,
        };
    }

    fn peek(self: *Lexer) ?u8 {
        if (self.index >= self.input.len) {
            return null;
        }

        return self.input[self.index];
    }

    fn advance(self: *Lexer) ?u8 {
        const ch = self.peek() orelse return null;

        self.index += 1;

        if (ch == '\n') {
            self.line += 1;
            self.column = 1;
        } else {
            self.column += 1;
        }

        return ch;
    }

    fn skipWhitespace(self: *Lexer) void {
        while (self.peek()) |ch| {
            switch (ch) {
                ' ', '\t', '\r', '\n' => _ = self.advance(),
                else => return,
            }
        }
    }

    fn skipComment(self: *Lexer) void {
        while (self.peek()) |ch| {
            if (ch == '\n') {
                return;
            }

            _ = self.advance();
        }
    }

    fn lexIdentifier(self: *Lexer) Token {
        const start = self.index;
        const line = self.line;
        const column = self.column;

        _ = self.advance();

        while (self.peek()) |ch| {
            if (!isIdentifierContinue(ch)) {
                break;
            }

            _ = self.advance();
        }

        const text = self.input[start..self.index];

        return Token{
            .kind = keywordKind(text) orelse .identifier,
            .text = text,
            .line = line,
            .column = column,
        };
    }

    fn lexNumber(self: *Lexer) Token {
        const start = self.index;
        const line = self.line;
        const column = self.column;

        while (self.peek()) |ch| {
            if (!std.ascii.isDigit(ch)) {
                break;
            }

            _ = self.advance();
        }

        return Token{
            .kind = .number,
            .text = self.input[start..self.index],
            .line = line,
            .column = column,
        };
    }

    fn lexString(self: *Lexer) !Token {
        const start = self.index;
        const line = self.line;
        const column = self.column;

        _ = self.advance();

        while (self.peek()) |ch| {
            if (ch == '"') {
                _ = self.advance();

                return Token{
                    .kind = .string,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            }

            if (ch == '\n') {
                return error.UnterminatedString;
            }

            _ = self.advance();
        }

        return error.UnterminatedString;
    }

    fn nextToken(self: *Lexer) !Token {
        while (true) {
            self.skipWhitespace();

            const start = self.index;
            const line = self.line;
            const column = self.column;

            const ch = self.peek() orelse {
                return Token{
                    .kind = .eof,
                    .text = "",
                    .line = line,
                    .column = column,
                };
            };

            if (isIdentifierStart(ch)) {
                return self.lexIdentifier();
            }

            if (std.ascii.isDigit(ch)) {
                return self.lexNumber();
            }

            switch (ch) {
                '"' => return try self.lexString(),

                '/brain/' => {
                    _ = self.advance();

                    if (self.peek() == '/brain/') {
                        _ = self.advance();
                        self.skipComment();
                        continue;
                    }

                    return Token{
                        .kind = .slash,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                '+' => {
                    _ = self.advance();
                    return Token{
                        .kind = .plus,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                '-' => {
                    _ = self.advance();
                    return Token{
                        .kind = .minus,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                '*' => {
                    _ = self.advance();
                    return Token{
                        .kind = .star,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                '=' => {
                    _ = self.advance();
                    return Token{
                        .kind = .equal,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                '(' => {
                    _ = self.advance();
                    return Token{
                        .kind = .left_paren,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                ')' => {
                    _ = self.advance();
                    return Token{
                        .kind = .right_paren,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                '{' => {
                    _ = self.advance();
                    return Token{
                        .kind = .left_brace,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                '}' => {
                    _ = self.advance();
                    return Token{
                        .kind = .right_brace,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                ',' => {
                    _ = self.advance();
                    return Token{
                        .kind = .comma,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                ';' => {
                    _ = self.advance();
                    return Token{
                        .kind = .semicolon,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                else => return error.InvalidCharacter,
            }
        }
    }
};

fn isIdentifierStart(ch: u8) bool {
    return std.ascii.isAlphabetic(ch) or ch == '_';
}

fn isIdentifierContinue(ch: u8) bool {
    return std.ascii.isAlphanumeric(ch) or ch == '_';
}

fn keywordKind(text: []const u8) ?TokenKind {
    if (std.mem.eql(u8, text, "let")) return .keyword_let;
    if (std.mem.eql(u8, text, "if")) return .keyword_if;
    if (std.mem.eql(u8, text, "else")) return .keyword_else;
    if (std.mem.eql(u8, text, "return")) return .keyword_return;

    return null;
}

pub fn main() !void {
    const source =
        \\let answer = 42;
        \\print(answer);
        \\
        \\// this is a comment
        \\if answer {
        \\    return "done";
        \\}
    ;

    var lexer = Lexer.init(source);

    while (true) {
        const token = try lexer.nextToken();

        std.debug.print(
            "{s:<20} text={s:<12} line={d} column={d}\n",
            .{
                @tagName(token.kind),
                token.text,
                token.line,
                token.column,
            },
        );

        if (token.kind == .eof) {
            break;
        }
    }
}

Run:

zig build run

Example output:

keyword_let         text=let          line=1 column=1
identifier          text=answer       line=1 column=5
equal               text==            line=1 column=12
number              text=42           line=1 column=14
semicolon           text=;            line=1 column=16
identifier          text=print        line=2 column=1
left_paren          text=(            line=2 column=6
identifier          text=answer       line=2 column=7
right_paren         text=)            line=2 column=13
semicolon           text=;            line=2 column=14

Why Lexers Usually Use Slices

Notice that tokens store:

text: []const u8

The lexer does not allocate new strings for every token.

Instead, each token references the original source text.

That is efficient because lexers create many tokens.

A parser for a large file might produce hundreds of thousands of tokens.

Avoiding allocations matters.

Why Comments Are Removed Here

The lexer skips comments:

self.skipComment();
continue;

So comments never become tokens.

That is common in compilers because comments usually do not affect execution.

Some tools preserve comments:

formatters
documentation generators
IDEs

Those tools often emit comment tokens too.

Why Line and Column Tracking Matters

Without positions, compiler errors become frustrating.

This is bad:

syntax error

This is much better:

syntax error at line 12 column 7

Position tracking begins in the lexer because the lexer sees every character.

What You Learned

You built a lexer for a small programming language.

You scanned identifiers, keywords, numbers, strings, operators, and punctuation.

You skipped whitespace and comments.

You tracked line and column positions.

You returned tokens as slices into the original source.

This is the first stage of many language tools. Parsers, compilers, interpreters, linters, and syntax highlighters usually start with lexing.