Chapter 22 sections from Deep Learning with PyTorch.
8 items
A language model assigns probabilities to sequences of tokens. The tokens may be words, subwords, characters, bytes, or other discrete symbols. In the classical setting, a sentence is represented as a finite sequence
Statistical language models estimate probabilities from discrete counts.
Autoregressive modeling is the dominant formulation for modern language generation. The model predicts the next token from previous tokens. Repeating this prediction step produces a sequence.
Masked language modeling trains a model to recover missing tokens from their surrounding context.
A language model does not read raw text directly. It reads tokens. Tokenization is the process that maps a string of text into a sequence of discrete symbols, and later maps generated symbols back into text.
Subword methods split text into units smaller than words but usually larger than single characters.
After tokenization, text is represented as integer token IDs.
A pretraining objective defines the prediction task used to train a model before it is adapted to a downstream use case.