Language Modeling

Language modeling is the task of predicting text sequences. A language model assigns probabilities to sequences of tokens and learns the statistical structure of language.

Given a token sequence:

$$ x = (x_1, x_2, \dots, x_T), $$

a language model estimates:

$$ P(x_1, x_2, \dots, x_T). $$

Modern language models are the foundation of many NLP systems, including text generation, dialogue systems, translation systems, summarizers, code assistants, and retrieval-augmented systems.

Autoregressive Factorization

A sequence probability can be decomposed using the chain rule:

$$ P(x_1, x_2, \dots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_{<t}), $$

where:

$$ x_{<t} = (x_1, x_2, \dots, x_{t-1}). $$

The model predicts the next token conditioned on all previous tokens.

Example:

the cat sat on the

The model predicts the next token distribution:

Token	Probability
`mat`	0.42
`floor`	0.11
`chair`	0.05
`moon`	0.0001

A good language model assigns high probability to plausible continuations.

Vocabulary and Tokens

Language models operate on token sequences rather than raw text.

A tokenizer converts text into token IDs:

"The cat sleeps."

may become:

[314, 892, 12011, 13]

The vocabulary size is:

$$ |V|. $$

Each token corresponds to one row in the embedding matrix:

$$ E \in \mathbb{R}^{|V| \times D}. $$

The input token IDs have shape:

[B, T]

After embedding:

[B, T, D]

where:

Symbol	Meaning
$B$	Batch size
$T$	Sequence length
$D$	Embedding dimension

Next-Token Prediction

The central training objective of autoregressive language models is next-token prediction.

Suppose the token sequence is:

the cat sat

The model receives:

Input	Target
`the`	`cat`
`the cat`	`sat`
`the cat sat`	`<eos>`

The model predicts one token at each position.

If:

logits: [B, T, V]

then the target tensor is:

targets: [B, T]

The loss compares predicted logits with the next-token targets.

Causal Masking

Autoregressive models must not see future tokens during training.

For the sequence:

the cat sat

the prediction for cat must not depend on sat.

Transformers enforce this using a causal attention mask.

For sequence length $T=4$:

Position $t$ may attend only to positions:

$$ \le t. $$

In PyTorch:

import torch

T = 4

mask = torch.tril(torch.ones(T, T))
print(mask)

Output:

tensor([[1., 0., 0., 0.],
        [1., 1., 0., 0.],
        [1., 1., 1., 0.],
        [1., 1., 1., 1.]])

Without causal masking, the model could trivially copy future tokens during training.

Cross-Entropy Training Objective

Autoregressive language models usually use cross-entropy loss.

Suppose:

logits: [B, T, V]
targets: [B, T]

We flatten the tensors:

import torch.nn as nn

loss_fn = nn.CrossEntropyLoss()

B, T, V = logits.shape

loss = loss_fn(
    logits.reshape(B * T, V),
    targets.reshape(B * T),
)

The target at each position is the next token.

The model minimizes:

$$ -\log P(x_t \mid x_{<t}). $$

A lower loss means the model assigns higher probability to the correct next token.

Perplexity

Perplexity is a common evaluation metric for language models.

If the average negative log-likelihood per token is:

$$ L, $$

then perplexity is:

$$ \operatorname{PPL} = \exp(L). $$

Perplexity measures how uncertain the model is.

Interpretation:

Perplexity	Interpretation
Low	Model predicts tokens confidently
High	Model is uncertain

If a model has perplexity 10, it behaves roughly as if it chooses among 10 equally likely options per step.

Lower perplexity usually indicates better language modeling performance, though it does not perfectly correlate with downstream usefulness or factual accuracy.

Recurrent Language Models

Before transformers, many language models used recurrent neural networks.

An RNN language model processes tokens sequentially:

$$ h_t = f(h_{t-1}, x_t). $$

The hidden state summarizes previous tokens.

An LSTM language model:

import torch.nn as nn

class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        self.lstm = nn.LSTM(
            embedding_dim,
            hidden_dim,
            batch_first=True,
        )

        self.output = nn.Linear(hidden_dim, vocab_size)

    def forward(self, input_ids):
        x = self.embedding(input_ids)
        h, _ = self.lstm(x)
        logits = self.output(h)
        return logits

RNN models struggle with long-range dependencies and parallelization. Transformers largely replaced them for large-scale language modeling.

Transformer Language Models

Transformer language models use self-attention instead of recurrence.

Advantages:

Advantage	Description
Parallel training	All tokens processed simultaneously
Long-range interactions	Direct token-to-token attention
Scalable training	Efficient GPU utilization
Better representation learning	Rich contextual embeddings

A decoder-only transformer computes:

input_ids: [B, T]
-> embeddings
-> transformer blocks
-> hidden_states: [B, T, D]
-> output projection
-> logits: [B, T, V]

Each position predicts the next token.

Modern large language models such as GPT-style systems use this architecture.

Weight Tying

Many language models tie input embeddings and output projection weights.

The embedding matrix:

$$ E \in \mathbb{R}^{V \times D} $$

is reused for output logits:

$$ z_t = h_t E^\top. $$

Advantages:

Benefit	Description
Fewer parameters	Reduced memory usage
Better generalization	Shared token representations
Faster training	Smaller model size

Weight tying is now common in transformer language models.

Positional Encoding

Transformers do not inherently know token order.

Example:

dog bites man
man bites dog

contain the same tokens but different meanings.

Positional information must therefore be added.

A positional encoding provides a vector:

$$ p_t $$

for each position $t$.

The transformer input becomes:

$$ x_t = e_t + p_t, $$

where:

Symbol	Meaning
$e_t$	Token embedding
$p_t$	Positional embedding

Modern models use several positional methods:

Method	Description
Learned embeddings	Trainable position vectors
Sinusoidal encoding	Fixed trigonometric patterns
Rotary embeddings	Rotate hidden dimensions
Relative attention	Encode token distance

Position encoding strongly affects long-context behavior.

Context Length

A transformer attends over a finite context window.

If the maximum context length is:

$$ L, $$

then tokens beyond $L$ positions cannot be attended to directly.

Longer context windows improve:

Capability	Example
Long-document reasoning	Research papers
Multi-turn dialogue	Long conversations
Code understanding	Large repositories
Retrieval integration	Many retrieved passages

However, self-attention cost grows approximately as:

$$ O(T^2), $$

where $T$ is sequence length.

This motivates research into sparse attention, memory systems, state-space models, and linear attention methods.

Training Data

Language models are trained on large corpora.

Common data sources:

Source	Example
Web pages	Common Crawl
Books	Digitized books
Code repositories	GitHub
Scientific papers	arXiv
Dialogues	Chat logs
Documentation	Technical manuals

Training quality depends heavily on data quality.

Problems include:

Issue	Description
Duplicates	Memorization risk
Spam	Low-quality language
Toxic content	Harmful outputs
Imbalance	Overrepresentation of domains
Copyright concerns	Legal restrictions

Data filtering and deduplication are important parts of large-scale language model training.

Scaling Laws

Large language models exhibit scaling behavior.

Performance improves predictably as:

Variable	Increases
Model parameters	Larger networks
Training tokens	More data
Compute	More optimization steps

Empirical scaling laws show approximate power-law relationships between loss and compute scale.

However, scaling eventually encounters constraints:

Constraint	Example
Compute cost	GPU expense
Memory limits	Model size
Data quality	Finite high-quality text
Latency	Inference speed
Energy usage	Training power consumption

Scaling alone does not guarantee reasoning ability, factuality, or safety.

Inference and KV Caching

Autoregressive generation repeatedly predicts one token at a time.

Naively recomputing all attention states each step is expensive.

Transformers therefore cache previous key and value tensors.

At generation step $t$:

Cached tensor	Shape
Keys	`[B, H, T, D_h]`
Values	`[B, H, T, D_h]`

where:

Symbol	Meaning
$H$	Number of attention heads
$D_h$	Head dimension

KV caching reduces generation complexity from recomputing the entire sequence at every step.

Sampling from Language Models

The model outputs logits:

$$ z_t \in \mathbb{R}^{V}. $$

A decoding algorithm converts logits into tokens.

Common methods:

Method	Behavior
Greedy decoding	Deterministic highest-probability token
Beam search	Explore several sequences
Top-k sampling	Restrict to top-k tokens
Top-p sampling	Restrict cumulative probability mass
Temperature sampling	Adjust randomness

Generation quality depends strongly on decoding configuration.

Low temperature:

Effect
More deterministic
More repetitive
Less creative

High temperature:

Effect
More diverse
More random
Less stable

Emergent Behaviors

Large language models sometimes exhibit capabilities not obvious in smaller models.

Examples:

Capability	Example
In-context learning	Learn from prompt examples
Few-shot reasoning	Solve unseen tasks
Tool coordination	Use external APIs
Chain-of-thought reasoning	Multi-step explanations
Code synthesis	Generate programs

The exact causes remain an active research topic.

Some behaviors appear gradually with scale. Others appear more abruptly.

Failure Modes

Language models have important limitations.

Failure mode	Example
Hallucination	False factual claims
Memorization	Reproducing training data
Bias	Harmful stereotypes
Prompt injection	Unsafe instruction following
Context confusion	Losing track of dialogue
Arithmetic weakness	Calculation errors

Language models optimize token prediction, not truth, reasoning correctness, or safety.

This distinction is critical when deploying systems in high-stakes settings.

Pretraining and Fine-Tuning

Most modern systems use two stages:

Stage	Purpose
Pretraining	Learn general language structure
Fine-tuning	Adapt to downstream tasks

Pretraining uses large-scale next-token prediction.

Fine-tuning adapts the model for:

Task	Example
Dialogue	Chat systems
Translation	Multilingual systems
Coding	Code generation
QA	Reading comprehension
Summarization	Condensed outputs

Instruction tuning and RLHF further shape model behavior.

PyTorch Training Example

A simplified transformer language model training step:

def training_step(model, batch, optimizer):
    input_ids = batch["input_ids"]
    targets = batch["targets"]

    logits = model(input_ids)
    # logits: [B, T, V]

    B, T, V = logits.shape

    loss_fn = nn.CrossEntropyLoss()

    loss = loss_fn(
        logits.reshape(B * T, V),
        targets.reshape(B * T),
    )

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    return loss.item()

The targets are usually shifted by one token relative to the inputs.

Summary

Language modeling predicts token sequences autoregressively. Modern language models use transformer architectures with causal masking and next-token prediction objectives.

Key components include tokenization, embeddings, positional encoding, self-attention, output projections, and decoding algorithms. Training uses cross-entropy loss over large text corpora. Evaluation often uses perplexity.

Large language models extend basic language modeling into dialogue, reasoning, retrieval augmentation, tool use, and multimodal systems, but they still inherit core limitations from probabilistic next-token prediction.