Chapter 26 sections from Deep Learning with PyTorch.
8 items
Data parallelism is the simplest and most widely used form of distributed deep learning.
Distributed Data Parallel, usually abbreviated as DDP, is PyTorch’s primary system for synchronous multi-GPU training.
Model parallelism splits a model across multiple devices. Instead of copying the whole model onto every GPU, different parts of the model live on different GPUs.
Pipeline parallelism splits a model into sequential stages and places each stage on a different device. It is a form of model parallelism designed to reduce idle time.
Distributed training systems fail regularly. GPUs crash, network connections reset, processes hang, disks fill, filesystems become unavailable, and nodes disappear from the cluster.
Multi-node training uses more than one machine for a single training job. Each machine contributes one or more accelerators, and all machines cooperate to train the same model.
Foundation models are large neural networks trained on broad datasets and adapted to many downstream tasks.
Training produces model parameters. Inference uses those parameters to generate predictions.