Language Modeling

Language modeling is the task of predicting text sequences. A language model assigns probabilities to sequences of tokens and learns the statistical structure of language.

Language modeling is the task of predicting text sequences. A language model assigns probabilities to sequences of tokens and learns the statistical structure of language.

Given a token sequence:

$$ x = (x_1, x_2, \dots, x_T), $$

a language model estimates:

$$ P(x_1, x_2, \dots, x_T). $$

Modern language models are the foundation of many NLP systems, including text generation, dialogue systems, translation systems, summarizers, code assistants, and retrieval-augmented systems.

Autoregressive Factorization

A sequence probability can be decomposed using the chain rule:

$$ P(x_1, x_2, \dots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_{<t}), $$

where:

$$ x_{<t} = (x_1, x_2, \dots, x_{t-1}). $$

The model predicts the next token conditioned on all previous tokens.

Example:

the cat sat on the

The model predicts the next token distribution:

Token Probability
mat 0.42
floor 0.11
chair 0.05
moon 0.0001

A good language model assigns high probability to plausible continuations.

Vocabulary and Tokens

Language models operate on token sequences rather than raw text.

A tokenizer converts text into token IDs:

"The cat sleeps."

may become:

[314, 892, 12011, 13]

The vocabulary size is:

$$ |V|. $$

Each token corresponds to one row in the embedding matrix:

$$ E \in \mathbb{R}^{|V| \times D}. $$

The input token IDs have shape:

[B, T]

After embedding:

[B, T, D]

where:

Symbol Meaning
$B$ Batch size
$T$ Sequence length
$D$ Embedding dimension

Next-Token Prediction

The central training objective of autoregressive language models is next-token prediction.

Suppose the token sequence is:

the cat sat

The model receives:

Input Target
the cat
the cat sat
the cat sat <eos>

The model predicts one token at each position.

If:

logits: [B, T, V]

then the target tensor is:

targets: [B, T]

The loss compares predicted logits with the next-token targets.

Causal Masking

Autoregressive models must not see future tokens during training.

For the sequence:

the cat sat

the prediction for cat must not depend on sat.

Transformers enforce this using a causal attention mask.

For sequence length $T=4$:

1 0 0 0
1 1 0 0
1 1 1 0
1 1 1 1

Position $t$ may attend only to positions:

$$ \le t. $$

In PyTorch:

import torch

T = 4

mask = torch.tril(torch.ones(T, T))
print(mask)

Output:

tensor([[1., 0., 0., 0.],
        [1., 1., 0., 0.],
        [1., 1., 1., 0.],
        [1., 1., 1., 1.]])

Without causal masking, the model could trivially copy future tokens during training.

Cross-Entropy Training Objective

Autoregressive language models usually use cross-entropy loss.

Suppose:

logits: [B, T, V]
targets: [B, T]

We flatten the tensors:

import torch.nn as nn

loss_fn = nn.CrossEntropyLoss()

B, T, V = logits.shape

loss = loss_fn(
    logits.reshape(B * T, V),
    targets.reshape(B * T),
)

The target at each position is the next token.

The model minimizes:

$$ -\log P(x_t \mid x_{<t}). $$

A lower loss means the model assigns higher probability to the correct next token.

Perplexity

Perplexity is a common evaluation metric for language models.

If the average negative log-likelihood per token is:

$$ L, $$

then perplexity is:

$$ \operatorname{PPL} = \exp(L). $$

Perplexity measures how uncertain the model is.

Interpretation:

Perplexity Interpretation
Low Model predicts tokens confidently
High Model is uncertain

If a model has perplexity 10, it behaves roughly as if it chooses among 10 equally likely options per step.

Lower perplexity usually indicates better language modeling performance, though it does not perfectly correlate with downstream usefulness or factual accuracy.

Recurrent Language Models

Before transformers, many language models used recurrent neural networks.

An RNN language model processes tokens sequentially:

$$ h_t = f(h_{t-1}, x_t). $$

The hidden state summarizes previous tokens.

An LSTM language model:

import torch.nn as nn

class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        self.lstm = nn.LSTM(
            embedding_dim,
            hidden_dim,
            batch_first=True,
        )

        self.output = nn.Linear(hidden_dim, vocab_size)

    def forward(self, input_ids):
        x = self.embedding(input_ids)
        h, _ = self.lstm(x)
        logits = self.output(h)
        return logits

RNN models struggle with long-range dependencies and parallelization. Transformers largely replaced them for large-scale language modeling.

Transformer Language Models

Transformer language models use self-attention instead of recurrence.

Advantages:

Advantage Description
Parallel training All tokens processed simultaneously
Long-range interactions Direct token-to-token attention
Scalable training Efficient GPU utilization
Better representation learning Rich contextual embeddings

A decoder-only transformer computes:

input_ids: [B, T]
-> embeddings
-> transformer blocks
-> hidden_states: [B, T, D]
-> output projection
-> logits: [B, T, V]

Each position predicts the next token.

Modern large language models such as GPT-style systems use this architecture.

Weight Tying

Many language models tie input embeddings and output projection weights.

The embedding matrix:

$$ E \in \mathbb{R}^{V \times D} $$

is reused for output logits:

$$ z_t = h_t E^\top. $$

Advantages:

Benefit Description
Fewer parameters Reduced memory usage
Better generalization Shared token representations
Faster training Smaller model size

Weight tying is now common in transformer language models.

Positional Encoding

Transformers do not inherently know token order.

Example:

dog bites man
man bites dog

contain the same tokens but different meanings.

Positional information must therefore be added.

A positional encoding provides a vector:

$$ p_t $$

for each position $t$.

The transformer input becomes:

$$ x_t = e_t + p_t, $$

where:

Symbol Meaning
$e_t$ Token embedding
$p_t$ Positional embedding

Modern models use several positional methods:

Method Description
Learned embeddings Trainable position vectors
Sinusoidal encoding Fixed trigonometric patterns
Rotary embeddings Rotate hidden dimensions
Relative attention Encode token distance

Position encoding strongly affects long-context behavior.

Context Length

A transformer attends over a finite context window.

If the maximum context length is:

$$ L, $$

then tokens beyond $L$ positions cannot be attended to directly.

Longer context windows improve:

Capability Example
Long-document reasoning Research papers
Multi-turn dialogue Long conversations
Code understanding Large repositories
Retrieval integration Many retrieved passages

However, self-attention cost grows approximately as:

$$ O(T^2), $$

where $T$ is sequence length.

This motivates research into sparse attention, memory systems, state-space models, and linear attention methods.

Training Data

Language models are trained on large corpora.

Common data sources:

Source Example
Web pages Common Crawl
Books Digitized books
Code repositories GitHub
Scientific papers arXiv
Dialogues Chat logs
Documentation Technical manuals

Training quality depends heavily on data quality.

Problems include:

Issue Description
Duplicates Memorization risk
Spam Low-quality language
Toxic content Harmful outputs
Imbalance Overrepresentation of domains
Copyright concerns Legal restrictions

Data filtering and deduplication are important parts of large-scale language model training.

Scaling Laws

Large language models exhibit scaling behavior.

Performance improves predictably as:

Variable Increases
Model parameters Larger networks
Training tokens More data
Compute More optimization steps

Empirical scaling laws show approximate power-law relationships between loss and compute scale.

However, scaling eventually encounters constraints:

Constraint Example
Compute cost GPU expense
Memory limits Model size
Data quality Finite high-quality text
Latency Inference speed
Energy usage Training power consumption

Scaling alone does not guarantee reasoning ability, factuality, or safety.

Inference and KV Caching

Autoregressive generation repeatedly predicts one token at a time.

Naively recomputing all attention states each step is expensive.

Transformers therefore cache previous key and value tensors.

At generation step $t$:

Cached tensor Shape
Keys [B, H, T, D_h]
Values [B, H, T, D_h]

where:

Symbol Meaning
$H$ Number of attention heads
$D_h$ Head dimension

KV caching reduces generation complexity from recomputing the entire sequence at every step.

Sampling from Language Models

The model outputs logits:

$$ z_t \in \mathbb{R}^{V}. $$

A decoding algorithm converts logits into tokens.

Common methods:

Method Behavior
Greedy decoding Deterministic highest-probability token
Beam search Explore several sequences
Top-k sampling Restrict to top-k tokens
Top-p sampling Restrict cumulative probability mass
Temperature sampling Adjust randomness

Generation quality depends strongly on decoding configuration.

Low temperature:

Effect
More deterministic
More repetitive
Less creative

High temperature:

Effect
More diverse
More random
Less stable

Emergent Behaviors

Large language models sometimes exhibit capabilities not obvious in smaller models.

Examples:

Capability Example
In-context learning Learn from prompt examples
Few-shot reasoning Solve unseen tasks
Tool coordination Use external APIs
Chain-of-thought reasoning Multi-step explanations
Code synthesis Generate programs

The exact causes remain an active research topic.

Some behaviors appear gradually with scale. Others appear more abruptly.

Failure Modes

Language models have important limitations.

Failure mode Example
Hallucination False factual claims
Memorization Reproducing training data
Bias Harmful stereotypes
Prompt injection Unsafe instruction following
Context confusion Losing track of dialogue
Arithmetic weakness Calculation errors

Language models optimize token prediction, not truth, reasoning correctness, or safety.

This distinction is critical when deploying systems in high-stakes settings.

Pretraining and Fine-Tuning

Most modern systems use two stages:

Stage Purpose
Pretraining Learn general language structure
Fine-tuning Adapt to downstream tasks

Pretraining uses large-scale next-token prediction.

Fine-tuning adapts the model for:

Task Example
Dialogue Chat systems
Translation Multilingual systems
Coding Code generation
QA Reading comprehension
Summarization Condensed outputs

Instruction tuning and RLHF further shape model behavior.

PyTorch Training Example

A simplified transformer language model training step:

def training_step(model, batch, optimizer):
    input_ids = batch["input_ids"]
    targets = batch["targets"]

    logits = model(input_ids)
    # logits: [B, T, V]

    B, T, V = logits.shape

    loss_fn = nn.CrossEntropyLoss()

    loss = loss_fn(
        logits.reshape(B * T, V),
        targets.reshape(B * T),
    )

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    return loss.item()

The targets are usually shifted by one token relative to the inputs.

Summary

Language modeling predicts token sequences autoregressively. Modern language models use transformer architectures with causal masking and next-token prediction objectives.

Key components include tokenization, embeddings, positional encoding, self-attention, output projections, and decoding algorithms. Training uses cross-entropy loss over large text corpora. Evaluation often uses perplexity.

Large language models extend basic language modeling into dialogue, reasoning, retrieval augmentation, tool use, and multimodal systems, but they still inherit core limitations from probabilistic next-token prediction.