Chapter 26

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 26 ›

Data Parallelism

Data parallelism is the simplest and most widely used form of distributed deep learning.

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 26 ›

Distributed Data Parallel

Distributed Data Parallel, usually abbreviated as DDP, is PyTorch’s primary system for synchronous multi-GPU training.

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 26 ›

Model Parallelism

Model parallelism splits a model across multiple devices. Instead of copying the whole model onto every GPU, different parts of the model live on different GPUs.

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 26 ›

Pipeline Parallelism

Pipeline parallelism splits a model into sequential stages and places each stage on a different device. It is a form of model parallelism designed to reduce idle time.

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 26 ›

Fault Tolerance

Distributed training systems fail regularly. GPUs crash, network connections reset, processes hang, disks fill, filesystems become unavailable, and nodes disappear from the cluster.

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 26 ›

Multi-Node Training

Multi-node training uses more than one machine for a single training job. Each machine contributes one or more accelerators, and all machines cooperate to train the same model.

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 26 ›

Training Foundation Models

Foundation models are large neural networks trained on broad datasets and adapted to many downstream tasks.

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 26 ›

Inference Optimization

Training produces model parameters. Inference uses those parameters to generate predictions.

Sections

Data Parallelism

Distributed Data Parallel

Model Parallelism

Pipeline Parallelism

Fault Tolerance

Multi-Node Training

Training Foundation Models

Inference Optimization