Chapter 20 sections from Deep Learning with PyTorch.
8 items
Sequence models often need to decide which parts of an input are relevant to a particular output.
Additive attention was one of the first successful neural attention mechanisms. It was introduced for neural machine translation to allow a decoder to selectively focus on different encoder states during generation.
Dot-product attention uses an inner product to measure how well a query matches a key.
Self-attention is attention applied within a single sequence.
Cross-attention is attention between two different sequences or sources of information. The queries come from one sequence, while the keys and values come from another.
Multi-head attention runs several attention operations in parallel. Each head has its own query, key, and value projections. The outputs of the heads are concatenated and projected back to the model dimension.
Attention gives a model direct access between positions in a sequence.
Attention is a differentiable retrieval mechanism. A query asks for information, keys define where information can be found, and values carry the content returned to the model.