Chapter 21 sections from Deep Learning with PyTorch.
7 items
A transformer encoder is a neural network block that maps a sequence of input vectors to a sequence of contextualized output vectors.
A transformer decoder is a neural network block that maps a prefix sequence to a sequence of next-token representations. It is used when the model must generate output one step at a time.
Self-attention compares tokens to other tokens, but by itself it has no built-in notion of order.
Transformer layers are deep stacks of attention and feedforward blocks.
Scaling a transformer means increasing its capacity, data exposure, context length, training compute, or serving throughput.
Standard transformer attention scales quadratically with sequence length. For a sequence of length $T$, self-attention constructs a score matrix of size
Dense transformers activate every parameter for every token.