Chapter 21

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 21 ›

Transformer Encoders

A transformer encoder is a neural network block that maps a sequence of input vectors to a sequence of contextualized output vectors.

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 21 ›

Transformer Decoders

A transformer decoder is a neural network block that maps a prefix sequence to a sequence of next-token representations. It is used when the model must generate output one step at a time.

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 21 ›

Positional Encoding

Self-attention compares tokens to other tokens, but by itself it has no built-in notion of order.

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 21 ›

Residual and Normalization Layers

Transformer layers are deep stacks of attention and feedforward blocks.

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 21 ›

Scaling Transformers

Scaling a transformer means increasing its capacity, data exposure, context length, training compute, or serving throughput.

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 21 ›

Efficient Transformers

Standard transformer attention scales quadratically with sequence length. For a sequence of length $T$, self-attention constructs a score matrix of size

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 21 ›

Sparse Expert Architectures

Dense transformers activate every parameter for every token.

Sections

Transformer Encoders

Transformer Decoders

Positional Encoding

Residual and Normalization Layers

Scaling Transformers

Efficient Transformers

Sparse Expert Architectures