Chapter 15 sections from Deep Learning with PyTorch.
7 items
Attention is a method for letting a model choose which parts of an input are most relevant when producing an output.
Self-attention is attention applied within a single sequence. The same input supplies the queries, keys, and values. Each position builds a new representation by reading from other positions in the same sequence.
Multi-head attention runs several attention operations in parallel.
Self-attention compares tokens by content. By itself, it has no built-in notion of token order.
A transformer encoder is a stack of layers that maps an input sequence to a contextual sequence representation.
A transformer decoder maps a partial output sequence to predictions for the next token or next output step.
Standard self-attention compares every token with every other token. For a sequence of length $T$, this produces a $T \times T$ attention matrix. The cost grows quadratically with sequence length.