pytorch | brain

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 29 ›

Variational Inference

Bayesian neural networks require inference over a posterior distribution: $$ p(\theta \mid D) = \frac{p(D \mid \theta)p(\theta)}{p(D)}. $$ For modern neural networks, this posterior is usually impossible to compute exactly. The parameter space is extremely high-dimensional, the likelihood is nonlinear, and the evidence integral is intractable. Variational inference transforms this inference problem into an optimization problem. Instead of computing the true posterior directly, we approximate it with a simpler distribution...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 29 ›

Practical Probabilistic Modeling in PyTorch

Probabilistic deep learning adds distributions to ordinary neural networks. A model no longer predicts only a value. It predicts parameters of a probability distribution, samples from latent variables, estimates uncertainty, or defines a likelihood for observed data. In PyTorch, this usually means combining three pieces: Piece Role Neural network Computes distribution parameters Probability distribution Defines likelihood or sampling rule Loss function Optimizes negative log likelihood or ELBO The neural network...

Writes › Book › Deep Learning with PyTorch › Part VIII ›

Chapter 25

Sections 25.1 Dimensionality Reduction 25.2 Sparse Autoencoders 25.3 Denoising Autoencoders 25.4 Variational Autoencoders 25.5 Latent Space Manipulation 25.6 Representation Learning

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 10 ›

Batch Normalization

Batch normalization is a layer that normalizes activations using statistics computed from a mini-batch. It was introduced to make deep networks easier to train, especially convolutional and feedforward networks. The basic idea is simple: keep intermediate activations in a controlled numerical range, then let the model learn how much scale and shift it wants. A neural network layer often produces pre-activations $$ z = Wx + b. $$ If the...

Writes › Book › Deep Learning with PyTorch ›

Part VIII

Chapters Chapter 25 Chapter 26 Chapter 27 Chapter 28

Writes › Book › Deep Learning with PyTorch › Part VII › Chapter 23 ›

Tool Use and Agents

A language model becomes more useful when it can interact with external systems. Text generation alone is limited by the model’s training data, context window, arithmetic accuracy, and lack of persistent access to the world. Tool use extends the model by allowing it to call functions, search indexes, execute code, read files, query databases, use calculators, and operate APIs. An agent is a model-centered system that selects actions over time....

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 25 ›

Dimensionality Reduction

High-dimensional data often contains structure that can be described with fewer variables than the raw representation suggests. An image with $224 \times 224 \times 3$ pixels has 150,528 numerical values, but natural images occupy a much smaller part of that space. A sentence may contain many tokens, but its meaning can often be summarized by a shorter vector. A user profile may contain thousands of observed interactions, but many of...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 15 ›

Attention Mechanisms

Attention is a method for letting a model choose which parts of an input are most relevant when producing an output. It replaces the idea that all input positions should contribute equally. In a sequence model, attention allows one token to look at other tokens. In an image model, it allows one patch to look at other patches. In a multimodal model, it allows text tokens to look at image...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 30 ›

Attribution Methods

Attribution methods assign credit or blame to parts of an input, hidden representation, neuron, feature, or training example for a model output. Saliency maps are one family of attribution methods. This section treats attribution more broadly. The central question is: which parts of the computation were responsible for this prediction? For an image classifier, attribution may ask which pixels supported the class “dog.” For a text classifier, it may ask...

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 22 ›

Subword Methods

Subword methods split text into units smaller than words but usually larger than single characters. They are the standard tokenization strategy for modern language models because they balance coverage, compression, and generalization. A word-level tokenizer has trouble with rare words. A character-level tokenizer creates long sequences. Subword tokenization sits between these extremes. For example: unhelpfulness may be split as: ["un", "help", "ful", "ness"] The model can represent the word even...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 14 ›

CNN Architectures

A convolutional neural network architecture defines how convolutional layers, activation functions, normalization layers, pooling layers, residual paths, and classifier heads are arranged. The architecture determines the flow of tensors through the model. A CNN usually follows a staged design. Early layers operate on large spatial maps with few channels. Later layers operate on smaller spatial maps with more channels. This design gradually trades spatial resolution for semantic abstraction. The Basic...

Saturation and Gradient Flow

Activation functions control both the forward signal and the backward signal. In the forward pass, they transform pre-activations into activations. In the backward pass, their derivatives decide how much gradient passes to earlier layers. This second role is critical. A network can have a reasonable forward computation but still train poorly if gradients vanish, explode, or become blocked by saturated activations. Forward Signal and Backward Signal Consider one layer of...

ELU, GELU, and Swish

ReLU and its variants improved optimization in deep networks, but they still have limitations. ReLU is not smooth at zero, discards all negative values, and can produce dead activations. Later activation functions attempted to preserve the optimization advantages of ReLU while improving gradient behavior, smoothness, and representational flexibility. Three important modern activations are ELU, GELU, and Swish. These functions appear frequently in modern convolutional networks, transformers, and large language models....

Writes › Book › Deep Learning with PyTorch › Part VII › Chapter 23 ›

Pretraining Objectives

A large language model is trained in two broad phases. The first phase is pretraining. The second phase is adaptation. During pretraining, the model learns general statistical structure from a large corpus. During adaptation, the pretrained model is specialized for a task, instruction-following behavior, dialogue, tool use, or preference alignment. A pretraining objective defines the learning problem used before task-specific supervision. It tells the model what to predict, what loss...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 30 ›

Saliency Maps

A saliency map is a visualization that assigns an importance score to each part of an input. For an image model, the saliency map usually assigns a score to each pixel or image region. For a text model, it may assign a score to each token. The goal is to estimate which parts of the input most influenced the model’s prediction. Saliency methods are often used for model inspection. They...

Margin-Based Losses

Margin-based losses are used when the goal is not only to make the correct prediction, but to make it by a sufficient margin. A margin measures separation. In classification, it measures how much more strongly the model favors the correct class than an incorrect class. These losses are common in support vector machines, metric learning, ranking systems, face recognition, verification tasks, and some contrastive representation learning methods. The central idea...

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 21 ›

Transformer Decoders

A transformer decoder is a neural network block that maps a prefix sequence to a sequence of next-token representations. It is used when the model must generate output one step at a time. Decoder-only transformers are the core architecture behind GPT-style language models. Encoder-decoder transformers also use decoder blocks, but those decoders include an additional cross-attention sublayer that reads encoder outputs. The Decoder Problem Suppose we have a sequence of...

Writes › Book › Deep Learning with PyTorch › Part VII › Chapter 23 ›

In-Context Learning

Large language models can often perform new tasks without updating their parameters. Instead of retraining the model, we provide examples or instructions directly in the prompt. The model adapts its behavior dynamically during inference. This phenomenon is called in-context learning. A model trained only with next-token prediction can learn behaviors such as: Capability Example Translation English to French Summarization Compressing documents Classification Sentiment prediction Reasoning Solving math problems Code generation...

Writes › Book › Deep Learning with PyTorch › Part VIII ›

Chapter 26

Sections 26.1 Data Parallelism 26.2 Distributed Data Parallel 26.3 Model Parallelism 26.4 Pipeline Parallelism 26.5 Fault Tolerance 26.6 Multi-Node Training 26.7 Training Foundation Models 26.8 Inference Optimization

Leaky and Parametric ReLU

ReLU is simple and effective, but it has one sharp weakness. For negative inputs, the output is zero and the gradient is also zero. A unit that stays in this region receives no useful learning signal through that activation. Leaky ReLU and Parametric ReLU modify the negative side of ReLU so that some signal can still pass through. Motivation The standard ReLU is $$ \mathrm{ReLU}(x)=\max(0,x). $$ Its negative side is...

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 18 ›

Latent Space Manipulation

Latent space manipulation studies how to change a learned representation $z$ in order to produce controlled changes in the decoded output. In an autoencoder, the encoder maps an input into a latent vector, $$ z = f_\theta(x), $$ and the decoder maps the latent vector back into data space, $$ \hat{x} = g_\phi(z). $$ If the latent space is well organized, small changes in $z$ produce meaningful changes in $\hat{x}$....

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 32 ›

Open Research Problems

Deep learning has made large empirical gains, but many scientific and engineering questions remain open. These problems matter because current systems are powerful yet incomplete. They can generalize impressively in some regimes and fail sharply in others. They can solve complex tasks while remaining difficult to interpret, verify, align, and deploy reliably. Open research problems are not only about building larger models. They concern the foundations of learning, the structure...

Writes › Book › Deep Learning with PyTorch › Part VI ›

Chapter 21

Sections 21.1 Transformer Encoders 21.2 Transformer Decoders 21.3 Positional Encoding 21.4 Residual and Normalization Layers 21.5 Scaling Transformers 21.6 Efficient Transformers 21.7 Sparse Expert Architectures

Gradient Flow in Deep Networks

Gradient flow describes how derivative information moves backward through a neural network during training. A model may have the correct architecture and loss function, yet train poorly because its gradients shrink, explode, become noisy, or fail to reach important layers. Backpropagation computes gradients. Gradient flow describes the quality of those gradients. The Basic Idea Consider a deep network written as a sequence of transformations: $$ h_1 = f_1(x), $$ $$...

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 20 ›

Cross-Attention

Cross-attention is attention between two different sequences or sources of information. The queries come from one sequence, while the keys and values come from another. Self-attention asks: “Which positions in this same sequence should I use?” Cross-attention asks: “Which positions in another sequence should I use?” This distinction is central in encoder-decoder transformers, retrieval systems, image captioning models, text-to-image models, and multimodal systems. Basic Idea Suppose we have two sequences:...

Writes › Book › Deep Learning with PyTorch › Part IV ›

Chapter 15

Sections 15.1 Attention Mechanisms 15.2 Self-Attention 15.3 Multi-Head Attention 15.4 Positional Encoding 15.5 Transformer Encoders 15.6 Transformer Decoders 15.7 Efficient Attention Methods

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 19 ›

Encoder-Decoder Architectures

A sequence-to-sequence model maps one sequence to another sequence. The input and output may have different lengths. This setting appears in machine translation, summarization, speech recognition, dialogue, code generation, and many other tasks. A standard supervised sequence-to-sequence problem has an input sequence $$ x = (x_1, x_2, \ldots, x_S) $$ and an output sequence $$ y = (y_1, y_2, \ldots, y_T). $$ The input length $S$ and output length $T$...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 16 ›

Word Embeddings

Natural language models cannot operate directly on words as strings. A neural network receives numbers, performs arithmetic on those numbers, and produces numerical outputs. Before text can be processed by a model, words must be represented as vectors. A word embedding is a vector representation of a word. Instead of representing a word as a discrete symbol such as "cat" or "run" , we represent it as a point in...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 17 ›

Recurrent Computation

A feedforward neural network processes inputs through a fixed sequence of layers. Once the output is produced, the computation ends. There is no memory of previous inputs. Sequential problems require a different structure. A model processing text, audio, or time series must preserve information across positions. The output at time $t$ should depend not only on the current input $x_t$, but also on earlier observations. Recurrent neural networks solve this...

Tensor Shapes, Dimensions, and Memory Layout

Deep learning systems manipulate tensors with millions or billions of numerical entries. Understanding the shape, dimensional structure, and memory organization of tensors is essential for building efficient neural networks in PyTorch. Many deep learning errors arise from incorrect tensor shapes rather than incorrect mathematics. Likewise, many performance problems arise from inefficient memory layouts or unnecessary tensor copies. A strong understanding of tensor structure therefore affects both correctness and computational efficiency....

Gradient Computation

Gradient computation is the process of measuring how a scalar output changes when its input values change. In deep learning, the scalar output is usually the loss, and the inputs are usually the model parameters. If a model has parameters (\theta) and loss (L), training needs the gradient $$ \nabla_\theta L. $$ This gradient tells the optimizer how to update the parameters. If a parameter change increases the loss, the...

Writes › Book › Deep Learning with PyTorch › Part IX ›

Chapter 31

Sections 31.1 Vision-Language Models 31.2 Audio-Visual Learning 31.3 Unified Foundation Models 31.4 Retrieval Systems 31.5 Long-Horizon Agents

Linear Separability

Linear separability describes when a classification dataset can be divided perfectly by a linear decision boundary. It is one of the central geometric ideas behind linear classification. For binary classification, each example has an input vector $$ x_i \in \mathbb{R}^d $$ and a label $$ y_i \in {-1,+1}. $$ The dataset is linearly separable if there exists a weight vector (w\in\mathbb{R}^d) and a bias (b\in\mathbb{R}) such that $$ y_i(w^\top x_i...

Writes › Book › Deep Learning with PyTorch › Part VII › Chapter 23 ›

Reinforcement Learning from Human Feedback

Instruction tuning teaches a model to imitate demonstrations. Reinforcement learning from human feedback, usually abbreviated RLHF, goes further. Instead of only copying target responses, the model learns to optimize behavior according to human preferences. The central idea is that many desirable properties of language model behavior are difficult to specify with simple supervised labels. For example: Desired property Why it is difficult Helpfulness Depends on context and user intent Harmlessness...

Writes › Book › Deep Learning with PyTorch › Part III ›

Chapter 10

Sections 10.1 Parameter Initialization 10.2 Vanishing and Exploding Gradients 10.3 Batch Normalization 10.4 Layer Normalization 10.5 Group and Instance Normalization 10.6 Residual Connections 10.7 Stable Training in Deep Networks

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 13 ›

Transfer Learning

Transfer learning reuses a model trained on one task as the starting point for another task. In image classification, this usually means taking a convolutional network or vision transformer trained on a large image dataset, replacing its final classifier, and fine-tuning it on a smaller target dataset. The central idea is simple: early and middle layers learn reusable visual features. They detect edges, colors, textures, shapes, object parts, and higher-level...

Evaluation Metrics

Evaluation metrics convert model behavior into numbers. A loss function guides training. A metric reports performance. Sometimes they are the same. Often they are different. For example, a classifier may train with cross-entropy loss, but report accuracy, precision, recall, F1 score, calibration error, and confusion matrices. The loss helps optimization. The metrics help judge whether the model is useful. Loss Versus Metric A loss function is optimized during training: $$...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 15 ›

Self-Attention

Self-attention is attention applied within a single sequence. The same input supplies the queries, keys, and values. Each position builds a new representation by reading from other positions in the same sequence. Given an input tensor $$ X \in \mathbb{R}^{B \times T \times D}, $$ where $B$ is batch size, $T$ is sequence length, and $D$ is model dimension, self-attention first projects $X$ into three tensors: $$ Q = XW_Q,\quad...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 27 ›

Forward Diffusion Processes

Diffusion models are generative models built around a simple idea: learn to reverse a gradual corruption process. The forward process starts with a clean data sample and repeatedly adds noise. After many small noise steps, the sample becomes almost indistinguishable from pure Gaussian noise. The model is then trained to invert this process, step by step, until noise becomes data. In this section, we study the forward diffusion process. This...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 25 ›

Variational Autoencoders

A variational autoencoder, or VAE, is an autoencoder with a probabilistic latent space. Instead of mapping an input $x$ to one fixed latent vector $z$, the encoder maps $x$ to a probability distribution over latent vectors. A standard autoencoder computes $$ z = f_\theta(x). $$ A variational autoencoder computes $$ q_\phi(z \mid x), $$ where $q_\phi$ is an approximate posterior distribution. The model samples $z$ from this distribution and decodes...

Symbolic Versus Dynamic Computation

Deep learning frameworks need a way to represent computation. Some systems represent computation as a graph built before execution. Some systems build the graph while ordinary program code runs. Some systems compile parts of the program into optimized graphs while preserving an imperative programming style. PyTorch began with a dynamic computation model. This means the graph is built from the operations that actually execute. This design makes PyTorch easy to...

Choosing and Combining Loss Functions

A loss function defines what the model is trained to improve. It translates a modeling goal into a scalar value that can be minimized by gradient-based optimization. The choice of loss function affects the learned representation, the gradient signal, the stability of training, and the behavior of the final model. Two models with the same architecture and data can learn different solutions if they use different losses. Loss Functions as...

Matrix Operations

Matrix operations are the main arithmetic language of deep learning. A linear layer, an attention head, an embedding projection, and many normalization steps can be written as matrix expressions. PyTorch gives direct support for these operations through @ , torch.matmul , torch.mm , torch.bmm , and functions in torch.linalg . This section introduces the matrix operations needed for neural network implementation and shape reasoning. Matrix Shape A matrix is a...

Likelihood-Based Objectives

Many deep learning loss functions can be understood as likelihood maximization. Instead of viewing training as minimizing an arbitrary error measure, we model the probability distribution of the data and choose parameters that make the observed data likely under that distribution. This viewpoint unifies regression, classification, sequence modeling, generative modeling, and probabilistic inference. Suppose a model with parameters (\theta) defines a probability distribution $$ p_\theta(y \mid x). $$ Given a...

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 20 ›

Multi-Head Attention

Multi-head attention runs several attention operations in parallel. Each head has its own query, key, and value projections. The outputs of the heads are concatenated and projected back to the model dimension. A single attention head computes one pattern of interaction. Multiple heads allow the model to learn several interaction patterns at the same time. Motivation A sentence contains many kinds of relationships. One token may need its subject. Another...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 28 ›

Energy-Based Models

Energy-based models, or EBMs, define probability distributions using energy functions rather than normalized output probabilities directly. Instead of predicting a probability with a softmax layer or autoregressive factorization, an energy-based model assigns a scalar energy to each configuration of variables. Low-energy configurations are treated as plausible. High-energy configurations are treated as unlikely. Energy-based modeling is one of the most general frameworks in machine learning. Boltzmann machines, restricted Boltzmann machines, Hopfield...

Linear Regression

Linear regression is the simplest supervised learning model used in deep learning. It maps an input vector to a numerical output by applying a linear transformation. Although the model is simple, it introduces the main structure of neural network training: parameters, predictions, loss functions, gradients, and optimization. A linear regression model assumes that the target value can be approximated by a weighted sum of the input features. If the input...

Gradient Descent

Gradient descent is the basic optimization method used to train neural networks. It updates model parameters in the direction that reduces the loss. A model has parameters $$ \theta $$ and a loss function $$ L(\theta). $$ The gradient of the loss is $$ \nabla_\theta L(\theta). $$ The gradient points in the direction of steepest increase. To reduce the loss, gradient descent moves in the opposite direction: $$ \theta \leftarrow...

Writes › Book › Deep Learning with PyTorch ›

Part VI

Chapters Chapter 21 Chapter 22

Multi-Task Objectives

Multi-task learning trains one model on several objectives at the same time. The model may predict several targets from the same input, share a common representation across related tasks, or combine supervised, self-supervised, and auxiliary losses. The basic idea is that related tasks can provide useful training signal to each other. A model trained only for one objective may learn narrow features. A model trained across related objectives can learn...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 26 ›

Inference Optimization

Training produces model parameters. Inference uses those parameters to generate predictions. Inference optimization studies how to make model execution faster, cheaper, smaller, and more memory-efficient while preserving acceptable output quality. For small models, naive inference may be sufficient. For foundation models, inference often becomes more expensive than training because deployed systems may serve millions or billions of requests. A language model trained once may perform inference continuously for years. Inference...

Writes › Book › Deep Learning with PyTorch › Part VII › Chapter 23 ›

Instruction Tuning

Pretraining teaches a language model to predict text. It does not directly teach the model to follow user instructions, answer safely, maintain dialogue structure, or format outputs in a useful way. A pretrained model may continue text well but still behave poorly in interactive settings. For example, it may ignore instructions, generate irrelevant continuations, produce unsafe content, or imitate undesirable patterns from the training corpus. Instruction tuning adapts a pretrained...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 13 ›

Data Augmentation Strategies

Data augmentation creates modified versions of training examples without changing their labels. In image classification, common augmentations include random crops, flips, rotations, color changes, blur, noise, erasing, Mixup, and CutMix. The goal is to make the model less sensitive to irrelevant variation. A classifier should recognize a cat whether the image is slightly shifted, brighter, darker, cropped, or photographed from a different angle. Augmentation encodes these assumptions into the training...

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 18 ›

Representation Learning

Representation learning is the study of how models learn useful internal descriptions of data. Instead of relying only on hand-designed features, a neural network learns features from examples. These learned features are then used for prediction, reconstruction, retrieval, generation, or control. An input may begin as raw data: $$ x \in \mathbb{R}^D. $$ A model maps it to a representation: $$ h = f_\theta(x), $$ where $$ h \in \mathbb{R}^d....

CPU and GPU Tensors

PyTorch tensors live on devices. A device is the hardware location where tensor storage exists and where tensor operations execute. The most common devices are CPU and CUDA GPU, but modern PyTorch can also target other accelerators depending on the installation and hardware. Device placement is part of tensor correctness. A tensor with the right shape and dtype can still fail if it lives on the wrong device. Training performance...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 29 ›

Gaussian Processes

A Gaussian process is a probabilistic model over functions. Instead of defining a probability distribution over parameters, as in Bayesian neural networks, a Gaussian process defines a probability distribution directly over functions. This gives a flexible nonparametric approach to regression, uncertainty estimation, Bayesian optimization, and probabilistic modeling. Gaussian processes are important in deep learning because they provide: principled uncertainty estimates exact Bayesian inference in small settings connections between kernels and...

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 21 ›

Scaling Transformers

Scaling a transformer means increasing its capacity, data exposure, context length, training compute, or serving throughput. In practice, scaling is controlled by several coupled variables: number of parameters, number of training tokens, model dimension, depth, attention heads, sequence length, batch size, optimizer state, hardware memory, and inference latency. A transformer can be scaled in many ways, but useful scaling is constrained by compute, memory, data quality, and optimization stability. What...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 15 ›

Efficient Attention Methods

Standard self-attention compares every token with every other token. For a sequence of length $T$, this produces a $T \times T$ attention matrix. The cost grows quadratically with sequence length. For short and medium sequences, this is acceptable. For long documents, audio streams, videos, high-resolution images, and long-context language models, quadratic attention becomes the main bottleneck. The standard attention operation is $$ \operatorname{Attention}(Q,K,V) = \operatorname{softmax} \left( \frac{QK^\top}{\sqrt{d}} \right)V. $$ If...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 28 ›

Deep Belief Networks

A deep belief network, or DBN, is a probabilistic generative model formed by stacking multiple layers of latent variables. Deep belief networks were among the first successful deep architectures capable of learning hierarchical representations from unlabeled data. DBNs were historically important because they demonstrated that deep neural networks could be trained effectively. Before residual networks, normalization layers, and modern optimizers became common, training very deep networks directly with backpropagation was...

Writes › Book › Deep Learning with PyTorch › Part IV ›

Chapter 16

Sections 16.1 Word Embeddings 16.2 Subword Tokenization 16.3 Text Classification 16.4 Named Entity Recognition 16.5 Machine Translation 16.6 Question Answering 16.7 Conversational Systems 16.8 Language Modeling

Mean Squared Error

Mean squared error is one of the simplest and most widely used loss functions in supervised learning. It measures the average squared difference between a model’s prediction and the target value. It is mainly used for regression problems, where the target is a real-valued quantity rather than a class label. Examples include predicting house prices, temperatures, distances, demand, ratings, physical measurements, or future numerical values. Suppose a model receives an...

Loss Functions

A loss function measures how wrong a model’s predictions are. During training, the model produces predictions, the loss function converts prediction error into a scalar, and the optimizer changes the parameters to reduce that scalar. In PyTorch, a loss function usually receives two tensors: loss = loss_fn(prediction, target) The result is usually a scalar tensor: torch.Size([]) This scalar is the objective used by backpropagation. Loss as a Training Objective A...

Writes › Book › Deep Learning with PyTorch › Part VII › Chapter 23 ›

Scaling Laws for Language Models

Scaling laws describe how model performance changes as we increase compute, parameter count, dataset size, and training tokens. They matter because large language models are expensive to train. Before spending millions of GPU-hours, we want a principled estimate of what a given training run is likely to achieve. A scaling law usually relates training resources to loss. For language models, the main measured quantity is often cross-entropy loss on held-out...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 27 ›

Video Diffusion Systems

Video diffusion extends image diffusion from still images to moving sequences. Instead of generating one image, the model generates a sequence of frames that should remain visually coherent over time. A video sample can be represented as a tensor: $$ x_0 \in \mathbb{R}^{B \times C \times F \times H \times W} $$ where $B$ is batch size, $C$ is channels, $F$ is the number of frames, $H$ is height, and...

Writes › Book › Deep Learning with PyTorch › Part III ›

Chapter 12

Sections 12.1 Search Spaces 12.2 Grid Search 12.3 Random Search 12.4 Bayesian Optimization 12.5 Population-Based Training 12.6 Neural Architecture Search 12.7 Automated Machine Learning

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 18 ›

Variational Autoencoders

A variational autoencoder, or VAE, is a generative latent variable model trained with neural networks. Like an ordinary autoencoder, it has an encoder and a decoder. Unlike an ordinary autoencoder, it treats the latent representation as a random variable. The goal is not only to compress and reconstruct data. The goal is to learn a latent probability model that can generate new samples. An ordinary autoencoder learns a deterministic code:...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 27 ›

Reverse Denoising Processes

The forward diffusion process gradually transforms data into noise. The reverse process attempts to invert that transformation. Starting from Gaussian noise, the model repeatedly removes noise until a structured sample emerges. The reverse process is the generative component of a diffusion model. During sampling, we begin with $$ x_T \sim \mathcal{N}(0,I) $$ and generate a sequence $$ x_T, x_{T-1}, x_{T-2}, \ldots, x_0. $$ The final tensor $x_0$ is interpreted as...

Tensor Data Types and Devices

A tensor has values, shape, data type, and device placement. Shape tells us how values are arranged. Data type tells us how each value is represented. Device placement tells us where the tensor lives: CPU, GPU, or another accelerator. These properties affect correctness, memory use, speed, and numerical stability. A model can have the right equations and still fail because a tensor has the wrong dtype or lives on the...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 17 ›

Vanishing Gradients in RNNs

Recurrent neural networks were designed to process sequential data by maintaining a hidden state over time. In principle, the hidden state can preserve information from arbitrarily distant positions. In practice, standard recurrent networks often fail to learn long-range dependencies. The main reason is the vanishing gradient problem. During training, gradients must propagate backward through many recurrent steps. Repeated multiplication by small derivatives causes the gradient magnitude to shrink exponentially. When...

Unsupervised Learning

Unsupervised learning studies data without explicit target labels. The dataset contains inputs only: $$ \mathcal{D} = {x^{(1)}, x^{(2)}, \dots, x^{(N)}}. $$ There is no given (y). The model must discover useful structure in the data itself. Unsupervised learning is used for clustering, dimensionality reduction, density estimation, anomaly detection, representation learning, and generative modeling. The Goal of Unsupervised Learning In supervised learning, the target tells the model what to predict. In...

Writes › Book › Deep Learning with PyTorch › Part IX ›

Chapter 29

Sections 29.1 Bayesian Neural Networks 29.2 Variational Inference 29.3 Monte Carlo Methods 29.4 Uncertainty Estimation 29.5 Gaussian Processes 29.6 Practical Probabilistic Modeling in PyTorch 29.7 Summary and Further Reading

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 21 ›

Efficient Transformers

Standard transformer attention scales quadratically with sequence length. For a sequence of length $T$, self-attention constructs a score matrix of size $$ T \times T. $$ This gives attention complexity $$ O(T^2D), $$ where $D$ is the model dimension. Quadratic scaling becomes expensive for long documents, code repositories, videos, audio streams, retrieval contexts, and agent memory traces. Efficient transformer methods reduce attention cost, memory usage, or latency while preserving as...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 12 ›

Population-Based Training

Population-based training, or PBT, is a hyperparameter optimization method that trains many models at the same time. Each model has its own weights and hyperparameters. During training, weak models are replaced or modified using information from stronger models. Grid search, random search, and Bayesian optimization usually treat each trial as a separate run. A configuration is selected before training begins, and it usually stays fixed until the run ends. PBT...

Writes › Book › Deep Learning with PyTorch › Part IV ›

Chapter 17

Sections 17.1 Sequential Data 17.2 Recurrent Computation 17.3 Backpropagation Through Time 17.4 Vanishing Gradients in RNNs 17.5 Bidirectional Networks 17.6 Sequence Modeling Applications

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 12 ›

Random Search

Random search is a hyperparameter optimization method that samples configurations at random from a search space. Instead of evaluating every point on a fixed grid, random search chooses a fixed number of trials and draws each trial independently. This method is simple, but it is often more effective than grid search in deep learning. The reason is that only a small number of hyperparameters usually dominate performance. Random search spends...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 27 ›

Score Matching

Diffusion models can be understood from multiple mathematical viewpoints. One interpretation treats them as probabilistic latent-variable models. Another treats them as iterative denoisers. A third and deeper interpretation connects them to score matching. Score matching explains why denoising diffusion models work. It connects diffusion training to density estimation, stochastic differential equations, and energy-based modeling. Many modern diffusion systems are built directly from the score-based perspective. Probability Densities and Scores Suppose...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 17 ›

Sequential Data

Many learning problems involve data whose meaning depends on order. A sentence is not just a bag of words. A speech signal is not just a collection of sound amplitudes. A stock price record is not just a set of numbers. In each case, the position of each observation matters. Sequential data is data indexed by an ordered variable, usually time or position. We write a sequence as $$ x_1,...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 32 ›

Scientific Deep Learning

Scientific deep learning applies neural networks and differentiable computation to scientific and engineering problems. Unlike many consumer AI systems, scientific models must often obey physical laws, quantify uncertainty, generalize under distribution shift, and produce numerically stable predictions over long time horizons. The goal is not only prediction. Scientific deep learning also aims to support discovery, simulation, optimization, control, and reasoning about complex systems. Applications include: Domain Example tasks Physics Fluid...

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 20 ›

Motivation for Attention

Sequence models often need to decide which parts of an input are relevant to a particular output. Attention is the mechanism that makes this decision explicit. Instead of compressing an entire input into one fixed-size vector, an attention layer allows the model to look back at many input positions and form a weighted combination of them. The central problem is selective access. A model may receive a sequence of tokens,...

Datasets and DataLoaders

A deep learning model does not train directly from files. It trains from tensors. The purpose of a data pipeline is to convert stored data into batches of tensors with consistent shapes, data types, and labels. In PyTorch, the two central abstractions for this are Dataset and DataLoader . A Dataset defines how to access one example. A DataLoader defines how to combine many examples into batches and feed them...

Logistic Regression

Logistic regression is a linear model for classification. It predicts a probability instead of a raw numerical value. Despite its name, logistic regression is mainly used for classification, not regression. For binary classification, each target belongs to one of two classes: $$ y \in {0, 1}. $$ The model receives an input vector $$ x \in \mathbb{R}^{d} $$ and computes a score: $$ z = w^\top x + b. $$...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 14 ›

Pooling Layers

Pooling is a downsampling operation used in convolutional neural networks. It reduces the spatial size of a feature map while keeping the most important local information. A pooling layer has no learned weights. It applies a fixed rule, such as taking the maximum or average value inside a local window. Pooling is commonly used after convolution and activation: $$ \text{convolution} \rightarrow \text{activation} \rightarrow \text{pooling}. $$ The purpose is to reduce...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 14 ›

Efficient Convolutions

Efficient convolutions reduce computation, memory use, or latency while preserving useful spatial modeling. They are important when models must run on mobile devices, edge hardware, browsers, real-time systems, or large-scale training clusters. A standard convolution is powerful, but expensive. If an input has $C_{\text{in}}$ channels and an output has $C_{\text{out}}$ channels, a $k \times k$ convolution uses $$ C_{\text{out}} C_{\text{in}} k^2 $$ weights. It also performs this many multiply-add operations...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 12 ›

Bayesian Optimization

Bayesian optimization is a hyperparameter optimization method for expensive black-box functions. It is useful when each training run costs enough that random search wastes too much compute. The central idea is to build a probabilistic model of the relationship between hyperparameters and validation performance. This model is called a surrogate model. Instead of blindly sampling configurations, Bayesian optimization uses previous results to decide which configuration to try next. The Optimization...

Tensor Creation and Initialization

Neural networks start with tensors. Some tensors come from data. Others are created by the program: weights, biases, masks, counters, labels, random noise, and temporary buffers. PyTorch provides several ways to create tensors, and each choice controls shape, data type, device placement, and initialization. Tensor creation is simple at the surface, but it affects numerical behavior and training stability. Poor initialization can make a network train slowly, diverge, or produce...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 10 ›

Group and Instance Normalization

Batch normalization and layer normalization are the two most common normalization layers, but they do not cover every setting well. Batch normalization works well when batch statistics are reliable. Layer normalization works well for token and sequence models. Group normalization and instance normalization are useful when we want normalization that works independently of batch size while still respecting channel structure. These methods are most common in computer vision, especially when...

Sigmoid and Hyperbolic Tangent

Activation functions give neural networks their nonlinear structure. Without nonlinear activation functions, a feedforward network made from many linear layers would still compute only a linear transformation. Depth would add parameters, but it would not add expressive power. Two classical activation functions are the logistic sigmoid and the hyperbolic tangent. They played a central role in early neural networks and remain useful for gates, probabilities, and bounded outputs. The Logistic...

Writes › Book › Deep Learning with PyTorch › Part IV ›

Chapter 14

Sections 14.1 Convolution Operations 14.2 Pooling Layers 14.3 Feature Maps 14.4 Padding and Stride 14.5 CNN Architectures 14.6 Residual Networks 14.7 Efficient Convolutions

Tensor-Based Computation

PyTorch programs are tensor programs. A tensor stores numbers in a structured array, and most model computation is expressed as operations over tensors. This includes matrix multiplication, convolution, attention, normalization, indexing, reshaping, reduction, and elementwise arithmetic. A neural network can be understood as a composition of tensor transformations: $$ X \longrightarrow H_1 \longrightarrow H_2 \longrightarrow \cdots \longrightarrow Y. $$ The input tensor (X) contains data. The intermediate tensors (H_1, H_2,...

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 22 ›

Embeddings and Output Projections

After tokenization, text is represented as integer token IDs. A neural language model cannot use these IDs as numerical quantities directly. Token ID 900 is not “larger” or “closer” to token ID 899 in a semantic sense. IDs are just labels. The model first maps token IDs into vectors. These vectors are called embeddings. If the vocabulary size is $|V|$ and the embedding dimension is $d$, the embedding table is...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 16 ›

Subword Tokenization

A language model cannot process raw text directly. Text must first be converted into a sequence of token IDs. The procedure that performs this conversion is called tokenization . Older NLP systems often used word-level tokenization. A sentence was split into words, and each word received an integer ID. This approach is simple, but it has a serious limitation: natural language has too many possible words. Names, numbers, spelling variants,...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 31 ›

Audio-Visual Learning

Audio-visual learning studies models that jointly process sound and visual information. The goal is to learn representations that combine what is seen with what is heard. Humans naturally integrate multiple sensory streams. When watching someone speak, we combine lip motion, facial expression, and sound. When observing a moving car, we associate engine noise with visual motion. Audio-visual models attempt to learn similar correspondences. The core challenge is multimodal alignment across...

Computational Graphs

A computational graph is a graph that represents a numerical computation. The nodes represent values or operations. The edges describe how data flows from one operation to the next. In deep learning, computational graphs are important because they give a precise way to describe forward computation and gradient computation. A neural network takes input tensors, applies a sequence of operations, produces an output tensor, computes a loss, and then differentiates...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 12 ›

Automated Machine Learning

Automated machine learning, or AutoML, refers to systems that automate parts of the model development process. Hyperparameter optimization and neural architecture search are both parts of AutoML, but AutoML is broader. It may include data preprocessing, feature construction, model selection, training recipe selection, ensembling, compression, deployment, and monitoring. In deep learning, AutoML usually means a system that searches over training configurations and model structures under a compute budget. The goal...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 25 ›

Latent Space Manipulation

A latent space is the internal coordinate system learned by an encoder or generative model. In an autoencoder, the encoder maps an input $x$ to a latent representation $z$, and the decoder maps $z$ back to an output $\hat{x}$: $$ z = f_\theta(x), \qquad \hat{x} = g_\phi(z). $$ Latent space manipulation studies what happens when we edit $z$ before decoding it. Instead of changing the input directly, we change its...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 31 ›

Vision-Language Models

A vision-language model learns a joint representation of images and text. Its purpose is to connect visual information with natural language so that a model can compare, retrieve, caption, answer questions about, or generate images from text. Traditional computer vision models map an image to a fixed label, such as cat , car , or tumor . Traditional language models operate only on tokens. A vision-language model combines both modalities....

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 27 ›

Diffusion Transformers

Early diffusion systems used convolutional U-Nets as denoising networks. U-Nets worked well because images contain strong local structure, and convolutions efficiently model nearby spatial relationships. However, transformers became increasingly attractive because they scale effectively with model size, support flexible conditioning, and capture long-range dependencies more naturally than convolutional architectures. Diffusion Transformers, often abbreviated DiTs, replace or augment convolutional U-Nets with transformer-based architectures. Instead of treating images as grids processed by...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 17 ›

Backpropagation Through Time

Recurrent networks reuse the same parameters at every time step. This makes them compact, but it also changes how gradients are computed. A parameter such as $W_{hh}$ affects not one layer, but every recurrent transition in the sequence. Backpropagation through time, usually abbreviated BPTT, is the method used to train recurrent neural networks. It applies ordinary backpropagation to the unrolled recurrent graph. The Unrolled Graph A recurrent network is defined...

Writes › Book › Deep Learning with PyTorch › Part VII › Chapter 24 ›

Text Classification

Text classification is the task of assigning one or more labels to a piece of text. The input may be a sentence, a paragraph, a document, a conversation, or a search query. The output is a class label, a set of labels, or a probability distribution over labels. Examples include sentiment analysis, spam detection, topic classification, intent detection, toxicity detection, language identification, legal document tagging, product review classification, and support...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 10 ›

Layer Normalization

Layer normalization is a normalization method that normalizes features within each individual example. Unlike batch normalization, it does not use statistics from other examples in the same mini-batch. This makes it especially useful for sequence models, transformers, recurrent networks, and settings where batch sizes are small or variable. For an input vector $$ x \in \mathbb{R}^{D}, $$ layer normalization computes the mean and variance across the feature dimension of that...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 11 ›

Early Stopping

Neural networks are usually trained iteratively. An optimizer repeatedly updates model parameters to reduce the training loss. If training continues indefinitely, the model often becomes increasingly specialized to the training set. Eventually it may begin fitting noise, accidental correlations, and sampling artifacts instead of learning general structure. Early stopping is a regularization method that halts training before the model begins to overfit. Instead of choosing the final model from the...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 15 ›

Positional Encoding

Self-attention compares tokens by content. By itself, it has no built-in notion of token order. If a sequence is permuted, self-attention follows the permutation. It can still compare all tokens, but it cannot tell whether a token came first, second, or last unless position information is added. This matters because many sequences are order-sensitive. dog bites man man bites dog These two sequences contain the same words, but they have...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 31 ›

Retrieval Systems

A retrieval system finds relevant information from an external memory source. Instead of storing all knowledge directly inside neural network parameters, the model searches a database, vector index, document collection, or memory store during inference. Retrieval systems are fundamental to modern foundation models because parametric memory is limited. A model’s weights cannot reliably store all facts, documents, conversations, codebases, or world knowledge. Retrieval provides dynamic access to external information. The...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 16 ›

Conversational Systems

A conversational system processes dialogue between users and machines. The system receives one or more conversational turns and generates a response. Unlike single-turn NLP tasks, dialogue systems must maintain context across multiple exchanges. Example: Speaker Text User What is PyTorch? Assistant PyTorch is a deep learning framework developed by Meta. User Does it support GPUs? Assistant Yes. PyTorch supports CUDA and other accelerator backends. The second user message contains the...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 10 ›

Stable Training in Deep Networks

Stable training means that a model can make steady progress without numerical collapse, uncontrolled gradients, or large oscillations in the loss. Deep networks are sensitive systems. A small problem in scale, initialization, normalization, data preprocessing, optimizer settings, or precision can compound across many layers. This section combines the ideas from the previous sections into a practical view of stable PyTorch training. What Stability Means A training run is stable when...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 31 ›

Long-Horizon Agents

A long-horizon agent is a model-driven system that pursues goals over many steps. It observes the environment, chooses actions, records intermediate state, uses tools, and adjusts its plan as new information arrives. A single model call answers one prompt. An agent loop extends this into a process: $$ \text{observe} \rightarrow \text{plan} \rightarrow \text{act} \rightarrow \text{observe} \rightarrow \cdots $$ The word “long-horizon” means the task cannot be solved reliably in one...

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 20 ›

Summary and Further Reading

Attention is a differentiable retrieval mechanism. A query asks for information, keys define where information can be found, and values carry the content returned to the model. The output is a weighted combination of values, where the weights are learned from query-key compatibility. The chapter began with the motivation for attention: fixed-vector sequence representations are too restrictive for many tasks. Attention removes this bottleneck by giving each output position direct...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 25 ›

Representation Learning

Representation learning is the study of how a model converts raw data into useful internal variables. In an autoencoder, the representation is the latent code $z$. More generally, it may be a hidden state, embedding vector, feature map, memory state, graph embedding, or sequence of token representations. The central question is simple: what information should the model keep, and in what form should it keep it? A raw input often...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 15 ›

Transformer Decoders

A transformer decoder maps a partial output sequence to predictions for the next token or next output step. Unlike an encoder, a decoder usually cannot see future positions. It must produce each representation using only the current and previous tokens. Decoder-only transformers are the core architecture behind modern autoregressive language models. Encoder-decoder transformers also use decoder blocks, but those decoders include cross-attention to read from encoder outputs. Given token embeddings...

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 18 ›

Sparse Autoencoders

An ordinary autoencoder compresses information by forcing the latent representation to have fewer dimensions than the input. A sparse autoencoder uses a different idea. Instead of requiring a small latent dimension, it requires that only a small number of latent units be active for any given input. The latent representation may still have many dimensions: $$ z \in \mathbb{R}^d, $$ with $d$ potentially larger than the input dimension. The constraint...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 25 ›

Sparse Autoencoders

An undercomplete autoencoder constrains the representation by reducing the latent dimension. A sparse autoencoder imposes a different constraint. Instead of requiring a small latent vector, it requires that only a small fraction of latent units be active for any input. This distinction matters. A sparse autoencoder may use a large latent dimension while still forcing the representation to remain selective and structured. The model learns a distributed code in which...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 16 ›

Language Modeling

Language modeling is the task of predicting text sequences. A language model assigns probabilities to sequences of tokens and learns the statistical structure of language. Given a token sequence: $$ x = (x_1, x_2, \dots, x_T), $$ a language model estimates: $$ P(x_1, x_2, \dots, x_T). $$ Modern language models are the foundation of many NLP systems, including text generation, dialogue systems, translation systems, summarizers, code assistants, and retrieval-augmented systems....

Momentum and Adaptive Methods

Stochastic gradient descent uses the current minibatch gradient to update the parameters. This is simple and effective, but the update can be noisy. Momentum and adaptive optimization methods modify the basic SGD update to make training faster, smoother, or less sensitive to feature scale. The general training loop stays the same: optimizer.zero_grad() loss.backward() optimizer.step() The difference is inside optimizer.step() . Different optimizers use different rules for converting gradients into parameter...

The Chain Rule

The chain rule is the mathematical rule that makes backpropagation possible. Neural networks are built by composing many functions. The chain rule tells us how to differentiate such compositions. A deep network may contain hundreds or thousands of operations: matrix multiplications, additions, nonlinear activations, normalization layers, attention blocks, losses, and regularization terms. PyTorch does not need a separate derivative formula for the entire network. It only needs derivative rules for...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 32 ›

Conclusion

Deep learning systems have progressed from small task-specific models to large multimodal foundation systems capable of perception, language understanding, reasoning, planning, generation, and interaction. This progress emerged from the combination of several forces: larger datasets scalable architectures efficient hardware distributed training self-supervised learning improved optimization better software systems Modern AI is therefore both a scientific field and a systems discipline. Progress depends not only on algorithms, but also on data...

Writes › Book › Deep Learning with PyTorch › Part VII ›

Chapter 23

Sections 23.1 Pretraining Objectives 23.2 Scaling Laws for Language Models 23.3 Instruction Tuning 23.4 Reinforcement Learning from Human Feedback 23.5 Constitutional Alignment 23.6 In-Context Learning 23.7 Tool Use and Agents 23.8 Retrieval-Augmented Generation

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 11 ›

Mixup and CutMix

Mixup and CutMix are data augmentation methods that create new training examples by combining two examples and their labels. They regularize the model by discouraging overly sharp decision boundaries. Both methods are usually used for classification. They replace hard one-example training with interpolated training examples. Motivation A classifier trained only on ordinary examples may learn decision boundaries that are too sharp. It may assign extreme confidence to regions of input...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 26 ›

Training Foundation Models

Foundation models are large neural networks trained on broad datasets and adapted to many downstream tasks. Examples include large language models, multimodal transformers, vision foundation models, audio-language systems, and general-purpose embedding models. Training these systems requires coordinated advances in: optimization distributed systems data engineering numerical stability infrastructure reliability hardware utilization Foundation model training differs from ordinary deep learning mainly in scale. The underlying mathematical principles remain similar, but the operational...

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 22 ›

Tokenization Systems

A language model does not read raw text directly. It reads tokens. Tokenization is the process that maps a string of text into a sequence of discrete symbols, and later maps generated symbols back into text. For a text string $$ s = \text{"Deep learning works."} $$ a tokenizer produces a token sequence $$ x_{1:T} = (x_1, x_2, \ldots, x_T). $$ The model then operates on token IDs, not characters...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 29 ›

Monte Carlo Methods

Monte Carlo methods approximate difficult mathematical quantities using random samples. In probabilistic deep learning, they are used when expectations, integrals, or posterior predictive distributions cannot be computed exactly. The basic idea is simple. If a quantity is an expectation under a distribution, we can estimate it by drawing samples from that distribution and averaging the result. Expectations as Averages Many probabilistic learning problems require computing an expectation: $$ \mathbb{E}_{p(z)}[f(z)] =...

Writes › Book › Deep Learning with PyTorch › Part V ›

Chapter 18

Sections 18.1 Dimensionality Reduction 18.2 Sparse Autoencoders 18.3 Denoising Autoencoders 18.4 Variational Autoencoders 18.5 Latent Space Manipulation 18.6 Representation Learning

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 18 ›

Dimensionality Reduction

Deep learning often begins with data that has many coordinates. An image may contain hundreds of thousands of pixel values. A document may be represented by thousands of token counts. A biological measurement may contain expressions for tens of thousands of genes. A user profile, a graph node, or a sensor trace may also have a high-dimensional representation. Dimensionality reduction is the problem of replacing a high-dimensional representation with a...

Backpropagation

Backpropagation is the algorithm used to compute gradients in neural networks efficiently. It applies reverse-mode differentiation to a computational graph. The algorithm evaluates the network forward to compute predictions and loss, then traverses the graph backward to compute gradients with respect to parameters. Without backpropagation, training modern neural networks would be computationally infeasible. A large language model may contain billions of parameters. Backpropagation computes all parameter gradients in roughly the...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 28 ›

Probabilistic Circuits

Probabilistic circuits are tractable probabilistic models built from simple computational graphs. They represent probability distributions using a network of sum, product, and leaf nodes. The main goal is to keep probabilistic inference efficient while still allowing expressive models. They are also known under related names such as sum-product networks, arithmetic circuits, and tractable probabilistic models. The exact terminology depends on the architecture and the literature, but the central idea is...

GPUs and Accelerators

Deep learning became practical at scale because neural network computation maps well to parallel hardware. Training a model requires many repeated tensor operations, especially matrix multiplication, convolution, normalization, and attention. These operations contain large numbers of independent arithmetic computations. GPUs and specialized accelerators execute these computations in parallel. A CPU can run deep learning models, but it is usually optimized for general-purpose control flow, low-latency execution, and a smaller number...

Bias and Variance

Bias and variance describe two different sources of prediction error. They are useful because they separate errors caused by an overly simple model from errors caused by an overly sensitive model. A model with high bias makes strong simplifying assumptions. It tends to underfit. A model with high variance changes too much when the training data changes. It tends to overfit. The practical goal is to find a model that...

Installing and Configuring PyTorch

A PyTorch installation must match three things: the Python environment, the operating system, and the available hardware. A CPU-only installation is enough for small examples and early chapters. GPU support becomes important once models, datasets, and training loops grow. The goal of installation is not only to make import torch work. The goal is to produce a reproducible environment where code runs consistently, dependencies are isolated, and hardware acceleration is...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 15 ›

Multi-Head Attention

Multi-head attention runs several attention operations in parallel. Each attention operation is called a head. Each head has its own query, key, and value projections. The outputs of all heads are then concatenated and projected back into the model dimension. Single-head attention gives each token one attention distribution. Multi-head attention gives each token several attention distributions. This lets the model read different kinds of context at the same time. For...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 32 ›

Text-to-Image Systems

Text-to-image generation aims to synthesize images from natural language descriptions. A model receives a prompt such as: “A red fox sitting in snow during sunrise” and generates an image consistent with the description. Modern text-to-image systems are usually built from latent diffusion models conditioned on text embeddings. These systems combine: Component Purpose Text encoder Convert language into embeddings Diffusion model Generate latent representations Decoder Convert latents into images Guidance mechanism...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 26 ›

Model Parallelism

Model parallelism splits a model across multiple devices. Instead of copying the whole model onto every GPU, different parts of the model live on different GPUs. This is useful when the model is too large to fit on one device. Data parallelism replicates the model, so each device must hold a full copy. Model parallelism removes this requirement by partitioning the model itself. A simple example is a network with...

Writes › Book › Deep Learning with PyTorch ›

Part IV

Chapters Chapter 14 Chapter 15 Chapter 16 Chapter 17

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 32 ›

Scaling Laws

Modern deep learning systems often improve when we increase three quantities: model size, dataset size, and compute. This empirical regularity is called a scaling law. A scaling law describes how model performance changes as some resource increases. The resource may be the number of parameters, the number of training tokens, the amount of compute, the dataset size, or the inference-time budget. The performance measure may be loss, accuracy, error rate,...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 31 ›

Unified Foundation Models

A unified foundation model is a neural network trained across many modalities, tasks, and domains using a shared architecture and shared representations. Instead of building separate systems for language, vision, audio, robotics, or reasoning, a unified model attempts to learn a general computational interface that can process all of them. The central idea is that diverse forms of data can be represented as sequences of tokens and processed by a...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 30 ›

Model Editing

Model editing modifies a trained model so that it changes a specific behavior while preserving most other behaviors. The goal is to update knowledge, remove undesirable outputs, correct mistakes, or alter policies without retraining the entire model. For example, suppose a language model answers: “The capital of Australia is Sydney.” A model edit attempts to change this fact so that the model answers: “The capital of Australia is Canberra.” The...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 27 ›

Latent Diffusion

Early diffusion models operated directly in pixel space. A model generated images by iteratively denoising tensors such as [B, 3, 512, 512] where $B$ is the batch size and the remaining dimensions represent RGB images. Although these models produced high-quality outputs, they were computationally expensive. Every denoising step required neural network computation over large high-resolution tensors. Training and inference therefore consumed large amounts of memory, compute, and time. Latent diffusion...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 11 ›

Stochastic Depth

Stochastic depth regularizes deep residual networks by randomly skipping residual branches during training. It is also commonly called drop path. Unlike ordinary dropout, which drops individual activations, stochastic depth drops a whole computation path. A standard residual block computes $$ y = x + F(x), $$ where $x$ is the input and $F(x)$ is the residual branch. With stochastic depth, the branch is randomly kept or removed: $$ y =...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 16 ›

Question Answering

Question answering, often abbreviated QA , is the task of producing an answer to a question. The input may contain only a question, or it may contain both a question and a context passage. The output may be a span from the passage, a generated sentence, a choice from several candidates, or a structured value. Examples: Question Context Answer Who invented the World Wide Web? Tim Berners-Lee invented the World...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 10 ›

Parameter Initialization

A neural network begins training with parameters that have not yet been learned from data. These initial values matter. They determine the scale of activations in the forward pass and the scale of gradients in the backward pass. Poor initialization can make training slow, unstable, or impossible. Parameter initialization is the rule used to choose the starting values of weights and biases before optimization begins. A layer usually computes $$...

Overfitting and Underfitting

Overfitting and underfitting describe two common ways a model can fail. A model underfits when it learns too little from the training data. A model overfits when it learns the training data too specifically and performs poorly on new data. The goal is to find a model that captures the stable patterns in the data without memorizing accidental details. The Central Problem During training, a model minimizes loss on the...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 32 ›

Efficient AI Systems

Modern deep learning systems are constrained by compute, memory, bandwidth, latency, and energy. As models become larger, efficiency becomes a central engineering problem rather than a secondary optimization. An efficient AI system maximizes useful capability per unit of resource. The resource may be GPU hours, memory capacity, power consumption, inference latency, network bandwidth, storage size, or monetary cost. Efficiency matters at every scale. A mobile vision model must run under...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 26 ›

Data Parallelism

Data parallelism is the simplest and most widely used form of distributed deep learning. The idea is to keep a copy of the same model on several devices, feed each device a different part of the batch, compute gradients independently, and then combine the gradients before updating the parameters. Suppose we have a model with parameters $\theta$, a loss function $\ell$, and a mini-batch $$ B = {(x_1, y_1), \ldots,...

Writes › Book › Deep Learning with PyTorch › Part VII › Chapter 24 ›

Dialogue Systems

A dialogue system is a model or collection of models that interacts with users through natural language. The system receives a sequence of user and assistant messages and produces a response conditioned on the conversation history. Dialogue systems are used in chat assistants, customer support, tutoring systems, coding assistants, search interfaces, recommendation systems, voice assistants, collaborative agents, and multimodal systems. A dialogue system must do more than generate fluent text....

Softmax Regression

Softmax regression extends logistic regression from two classes to many classes. It is the standard linear model for multiclass classification. Suppose an input example has feature vector $$ x \in \mathbb{R}^d $$ and a label $$ y \in {0,1,\ldots,K-1}. $$ There are (K) possible classes. The model predicts a probability distribution over those classes. Class Scores Softmax regression computes one score for each class. These scores are called logits. For...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 32 ›

Robotics and Embodied AI

Robotics and embodied AI study learning systems that act in the physical world. A robot must perceive its environment, estimate its own state, decide what to do, and execute actions through motors or actuators. Unlike a pure text or image model, an embodied system is coupled to the world through sensing and action. A robot is not only a predictor. It is an agent inside a feedback loop. $$ \text{observation}...

Writes › Book › Deep Learning with PyTorch ›

Part VII

Chapters Chapter 23 Chapter 24

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 26 ›

Fault Tolerance

Distributed training systems fail regularly. GPUs crash, network connections reset, processes hang, disks fill, filesystems become unavailable, and nodes disappear from the cluster. As training runs become larger and longer, the probability of failure approaches certainty. Fault tolerance is the collection of techniques that allow training to recover from these failures without losing excessive work. A small model trained for one hour on one GPU may not need sophisticated recovery....

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 30 ›

Distribution Shift

A distribution shift occurs when the data seen at deployment differs from the data used during training. The model may still receive inputs with the same shape and type, but the statistical structure of those inputs has changed. A classifier trained on clean product photos may fail on blurry phone images. A speech model trained mostly on studio recordings may degrade in noisy rooms. A medical model trained in one...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 26 ›

Multi-Node Training

Multi-node training uses more than one machine for a single training job. Each machine contributes one or more accelerators, and all machines cooperate to train the same model. A node is one physical or virtual machine. A typical node may contain 4 or 8 GPUs. If we train on 4 nodes with 8 GPUs each, the job uses 32 GPUs. $$ \text{world size} = \text{number of nodes} \times \text{GPUs per...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 14 ›

Padding and Stride

Padding and stride control the spatial size of convolutional feature maps. Kernel size controls how large a local window the layer sees. Padding controls what happens near the boundary. Stride controls how far the kernel moves between neighboring output positions. These parameters determine the mapping $$ [B, C_{\text{in}}, H, W] \rightarrow [B, C_{\text{out}}, H_{\text{out}}, W_{\text{out}}]. $$ A correct CNN implementation requires careful tracking of these shapes. Why Padding Is Needed...

Limits of Linear Decision Boundaries

A linear classifier separates classes using a hyperplane. In two dimensions this boundary is a line. In three dimensions it is a plane. In higher dimensions it is a hyperplane. This simple geometry makes linear models efficient and interpretable. It also limits what they can represent. A linear classifier can only divide the input space into two half-spaces for binary classification. Many real patterns require curved, disconnected, hierarchical, or context-dependent...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 15 ›

Transformer Encoders

A transformer encoder is a stack of layers that maps an input sequence to a contextual sequence representation. Each output position can contain information from every visible input position. Encoder models are useful when the full input is available before prediction. Examples include text classification, named entity recognition, sentence embedding, document ranking, image classification with vision transformers, and many multimodal encoders. Given an input tensor $$ X \in \mathbb{R}^{B \times...

Writes › Book › Deep Learning with PyTorch › Part VII › Chapter 23 ›

Constitutional Alignment

Reinforcement learning from human feedback improves model behavior using preference data. However, collecting large amounts of human feedback is expensive, slow, and difficult to scale consistently. Constitutional alignment addresses this problem by replacing much of the direct human supervision with explicit principles and AI-generated critique. Instead of asking humans to rank every response, we define a constitution: a set of behavioral rules, norms, or objectives. The model then uses these...

Limits of Linear Models

Linear models are the first useful class of predictive models in deep learning. They introduce weighted sums, biases, logits, losses, gradients, and optimization. They are also the simplest examples of models trained by minimizing an objective function. Their limits explain why deep networks are needed. A linear model has the form $$ f(x) = w^\top x + b. $$ For regression, this value is used directly. For binary classification, it...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 14 ›

Feature Maps

A feature map is the spatial output produced by a convolutional filter. In a convolutional neural network, each output channel can be read as a map of where a learned feature appears in the input. If a filter detects vertical edges, its feature map contains high values at locations where vertical edges are present. If a filter detects a texture, its feature map contains high values where that texture appears....

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 22 ›

Statistical Language Models

A language model assigns probabilities to sequences of tokens. The tokens may be words, subwords, characters, bytes, or other discrete symbols. In the classical setting, a sentence is represented as a finite sequence $$ x_{1:T} = (x_1, x_2, \ldots, x_T), $$ where each $x_t$ belongs to a vocabulary $V$. A language model defines a probability distribution over such sequences: $$ p(x_{1:T}). $$ The central question is simple: how likely is...

Reinforcement Learning Overview

Reinforcement learning studies how an agent learns to act through interaction with an environment. Unlike supervised learning, the agent does not receive a correct action for every situation. Instead, it receives feedback through rewards. The basic interaction is: $$ \text{agent} \longrightarrow \text{action} \longrightarrow \text{environment} \longrightarrow \text{reward and next state}. $$ The agent’s goal is to choose actions that maximize total reward over time. Agents, Environments, and Rewards A reinforcement learning...

Writes › Book › Deep Learning with PyTorch › Part III ›

Chapter 13

Sections 13.1 Classification Pipelines 13.2 Transfer Learning 13.3 Fine-Tuning Pretrained Models 13.4 Data Augmentation Strategies 13.5 Large-Scale Training 13.6 Calibration and Confidence

Writes › Book › Deep Learning with PyTorch › Part IX ›

Chapter 32

Sections 32.1 Scaling Laws 32.2 Efficient AI Systems 32.3 Scientific Deep Learning 32.4 Robotics and Embodied AI 32.5 Open Research Problems 32.6 Conclusion 32.7 Exercises 32.8 Further Reading

Learning Rate Scheduling

The learning rate controls the size of each parameter update. In early training, larger updates can help the model move quickly into a useful region. Later in training, smaller updates can help the model settle into a better solution. A learning rate schedule changes the learning rate during training. The optimizer update has the form $$ \theta_t = \theta_{t-1} - \eta_t g_t, $$ where (\eta_t) is the learning rate at...

Scalars, Vectors, Matrices, and Tensors

Deep learning represents data and computation using arrays of numbers. These arrays may have different numbers of axes. A single number is a scalar. A one-dimensional array is a vector. A two-dimensional array is a matrix. An array with any number of axes is a tensor. This language is used throughout deep learning because neural networks operate on numerical data. Images, text, audio, graphs, and actions must all be encoded...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 11 ›

Label Smoothing

Label smoothing is a regularization method for classification. It replaces hard target labels with softened target distributions. Instead of telling the model that the correct class has probability $1$ and every other class has probability $0$, label smoothing assigns most probability mass to the correct class and a small amount to the other classes. For a classification problem with $K$ classes, a one-hot target for class $y$ is $$ q_k...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 10 ›

Vanishing and Exploding Gradients

Deep networks train by sending information in two directions. The forward pass sends activations from the input layer to the output layer. The backward pass sends gradients from the loss back to earlier layers. Stable training requires both signals to remain numerically useful. When gradients become extremely small as they move backward through the network, we say gradients vanish. When gradients become extremely large, we say gradients explode. Both problems...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 28 ›

Flow-Based Models

Flow-based models are generative models that learn an invertible transformation between a simple probability distribution and a complex data distribution. Unlike many other generative models, flow-based systems provide: exact likelihood computation, exact latent-variable inference, exact sampling, invertible mappings. A flow model transforms data into latent variables through a sequence of reversible functions. If the transformation is invertible and differentiable, probability densities can be computed exactly using the change-of-variables formula. Flow-based...

Random Tensor Generation

Random tensors are used throughout deep learning. They initialize parameters, shuffle examples, sample noise, apply dropout, augment data, and generate outputs from probabilistic models. PyTorch provides direct tools for drawing samples from common probability distributions. A random tensor has the same structural properties as any other tensor: shape, dtype, device, layout, and gradient behavior. The difference is that its values are produced by a pseudorandom number generator. Uniform Random Tensors...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 29 ›

Bayesian Neural Networks

A Bayesian neural network is a neural network whose parameters are treated as random variables rather than fixed unknown constants. In an ordinary neural network, training produces one set of weights. After training, the model makes predictions using those weights. In a Bayesian neural network, training produces a probability distribution over possible weights. Prediction then averages over many plausible networks, weighted by how well each network explains the data. The...

What Is Deep Learning

Deep learning is a branch of machine learning that studies models built from many layers of learned computation. These models are usually called neural networks. A neural network receives data as input, transforms it through a sequence of mathematical operations, and produces an output such as a class label, a probability distribution, a generated image, or the next token in a sentence. The word deep refers to the use of...

Data Leakage and Experimental Design

Data leakage occurs when information that should be unavailable during training or evaluation enters the modeling process. It causes performance estimates to look better than they really are. A model with leakage may appear accurate in experiments and fail after deployment. This is one of the most common reasons machine learning systems disappoint in production. What Data Leakage Means A clean experiment separates information by role. Training data is used...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 13 ›

Calibration and Confidence

A classifier returns scores. Users often interpret those scores as confidence. This interpretation is safe only when the scores are calibrated. A calibrated model assigns probabilities that match empirical correctness. If a model predicts class “cat” with probability 0.8 on many images, then about 80 percent of those predictions should be correct. If only 60 percent are correct, the model is overconfident. If 95 percent are correct, the model is...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 28 ›

Boltzmann Machines

A Boltzmann machine is a probabilistic neural network that defines a probability distribution over binary variables. It belongs to the family of energy-based models. Instead of computing an output directly from an input, it assigns an energy to each possible configuration of variables. Configurations with low energy receive high probability. Configurations with high energy receive low probability. The central idea is simple: learning means shaping an energy surface so that...

Jacobians and Hessians

Gradients are enough for most neural network training. A gradient tells us how a scalar loss changes with respect to parameters. Some problems require a more detailed view of derivatives. Jacobians describe first derivatives of vector-valued functions. Hessians describe second derivatives of scalar-valued functions. These objects are central to optimization theory, sensitivity analysis, uncertainty estimation, curvature-aware training, and some advanced methods in meta-learning and scientific machine learning. From Derivatives to...

The PyTorch Ecosystem

PyTorch is a deep learning platform built around tensors, automatic differentiation, and composable neural network modules. It is used for research, production training, inference, computer vision, natural language processing, audio, reinforcement learning, graph learning, and large-scale model development. The central idea is simple: PyTorch lets you write numerical programs using tensors, then automatically computes gradients through those programs. This makes it suitable for deep learning, where training requires repeated gradient...

Writes › Book › Deep Learning with PyTorch › Part VII › Chapter 24 ›

Information Retrieval

Information retrieval is the task of finding relevant items from a collection in response to a query. The collection may contain web pages, documents, passages, emails, tickets, code files, papers, products, images, or database records. The query may be a few keywords, a natural language question, a document, or an embedding. In natural language processing systems, information retrieval is used for search, question answering, recommendation, document discovery, retrieval-augmented generation, duplicate...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 30 ›

Mechanistic Interpretability

Mechanistic interpretability studies neural networks by treating them as learned computational systems. The goal is to identify the internal mechanisms that produce model behavior: features, circuits, attention heads, neurons, residual stream directions, and layer-to-layer transformations. Attribution methods ask which input parts contributed to an output. Mechanistic interpretability asks a deeper question: what algorithm did the model implement internally? For example, a language model may answer a factual question correctly. Attribution...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 27 ›

Noise Schedules

A diffusion model needs a rule for how noise increases during the forward process. This rule is called the noise schedule. It determines how quickly clean data is corrupted, how much signal remains at each timestep, and how difficult each denoising task becomes. The forward process is $$ q(x_t\mid x_{t-1}) = \mathcal{N} \left( x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I \right). $$ The sequence $$ \beta_1,\beta_2,\ldots,\beta_T $$ is the noise schedule. Each $\beta_t$...

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 22 ›

Masked Language Modeling

Masked language modeling trains a model to recover missing tokens from their surrounding context. Instead of predicting only the next token, the model receives a corrupted sequence and learns to reconstruct selected hidden tokens. Given an original sequence $$ x_{1:T} = (x_1, x_2, \ldots, x_T), $$ some tokens are replaced with a special mask token. The model predicts the original tokens at the masked positions. For example: $$ \text{deep learning...

Writes › Book › Deep Learning with PyTorch ›

Part V

Chapters Chapter 18 Chapter 19 Chapter 20

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 13 ›

Large-Scale Training

Large-scale training means training models on datasets, model sizes, or hardware configurations that exceed a simple single-GPU workflow. In image classification, this often means millions of images, large backbones, long schedules, high-resolution inputs, or multi-GPU training. The goal is not only to make training faster. The goal is to keep optimization stable, data loading efficient, validation reliable, and checkpoints recoverable while the system grows. What Changes at Scale A small...

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 22 ›

Autoregressive Modeling

Autoregressive modeling is the dominant formulation for modern language generation. The model predicts the next token from previous tokens. Repeating this prediction step produces a sequence. Given a token sequence $$ x_{1:T} = (x_1, x_2, \ldots, x_T), $$ an autoregressive language model factorizes its probability as $$ p_\theta(x_{1:T}) = \prod_{t=1}^{T} p_\theta(x_t \mid x_{1:t-1}). $$ This is the same chain-rule factorization introduced in statistical language modeling. The difference is the parameterization....

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 16 ›

Text Classification

Text classification assigns one or more labels to a piece of text. The input may be a sentence, paragraph, document, review, message, query, or conversation turn. The output is a class label, a probability distribution over labels, or a set of active labels. Common examples include sentiment analysis, spam detection, topic classification, intent detection, toxicity detection, language identification, document routing, and product category prediction. A text classifier has the same...

Indexing, Slicing, and Tensor Views

Indexing and slicing select parts of a tensor. These operations are used constantly in PyTorch: selecting batches, cropping images, extracting token positions, applying masks, gathering logits, and rearranging model outputs. A tensor operation may either create a view or a copy. A view shares storage with the original tensor. A copy owns separate storage. This distinction matters for memory use, performance, and mutation. Basic Indexing A tensor entry is selected...

Writes › Book › Deep Learning with PyTorch › Part VII › Chapter 24 ›

Cross-Lingual Transfer

Cross-lingual transfer is the ability of a model trained or adapted in one language to work in another language. It is important because labeled data is unevenly distributed across languages. English has many datasets and benchmarks. Many other languages have limited annotation, limited digital text, or domain-specific data that is expensive to label. The goal is to share knowledge across languages. A model may learn sentiment classification, named entity recognition,...

Writes › Book › Deep Learning with PyTorch › Part VIII ›

Chapter 28

Sections 28.1 Boltzmann Machines 28.2 Restricted Boltzmann Machines 28.3 Deep Belief Networks 28.4 Energy-Based Models 28.5 Flow-Based Models 28.6 Probabilistic Circuits

Weight Decay and Regularization

Training loss measures how well a model fits the training data. A model with many parameters can sometimes fit the training data too closely. It may learn noise, accidental correlations, or details that do not hold for new examples. This problem is overfitting. Regularization changes training so that the model is encouraged to learn simpler or more stable solutions. Weight decay is one of the most common regularization methods in...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 11 ›

L1 and L2 Regularization

A neural network is trained by minimizing a loss function. For a supervised learning problem, this loss measures how far the model predictions are from the target values. If the model has parameters $\theta$, and the training data are denoted by $\mathcal{D}$, the usual training objective has the form $$ \min_\theta ; \mathcal{L}_{\text{data}}(\theta). $$ The term $\mathcal{L}_{\text{data}}$ is the data-fitting loss. It may be mean squared error for regression, cross-entropy...

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 18 ›

Denoising Autoencoders

A denoising autoencoder learns to recover a clean input from a corrupted version of that input. Instead of copying $x$ to $\hat{x}$, the model receives a noisy input $\tilde{x}$ and learns to reconstruct the original clean input $x$. $$ \tilde{x} = q(\tilde{x}\mid x) $$ $$ z = f_\theta(\tilde{x}) $$ $$ \hat{x} = g_\phi(z) $$ The training objective is $$ L(x,\hat{x}) = |x - \hat{x}|^2. $$ The corruption process $q(\tilde{x}\mid x)$...

Softmax and Output Activations

Many neural networks produce raw scores. These scores are called logits. A logit can be any real number. It may be negative, positive, small, or large. For classification, we usually need to convert logits into probabilities. Output activation functions perform this conversion. The most important output activation for multi-class classification is softmax. It maps a vector of real-valued scores into a vector of positive values that sum to 1. Logits...

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 21 ›

Transformer Encoders

A transformer encoder is a neural network block that maps a sequence of input vectors to a sequence of contextualized output vectors. It is used when the whole input sequence is available at once and each position may attend to every other position. Transformer encoders are common in text understanding, image understanding, speech representation learning, retrieval, classification, tagging, and multimodal systems. BERT-style language models, Vision Transformers, and many embedding models...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 29 ›

Uncertainty Estimation

Uncertainty estimation measures how much confidence a model should place in its own predictions. In ordinary supervised learning, a model usually returns a point prediction or a probability distribution over classes. In uncertainty-aware learning, the model also reports how reliable that prediction is. This matters because high accuracy on a test set does not guarantee safe behavior under distribution shift, noisy inputs, missing features, adversarial perturbations, or rare cases. A...

Writes › Book › Deep Learning with PyTorch › Part VII ›

Chapter 24

Sections 24.1 Text Classification 24.2 Named Entity Recognition 24.3 Question Answering 24.4 Summarization 24.5 Information Retrieval 24.6 Dialogue Systems 24.7 Cross-Lingual Transfer

Writes › Book › Deep Learning with PyTorch › Part III ›

Chapter 11

Sections 11.1 L1 and L2 Regularization 11.2 Early Stopping 11.3 Dropout 11.4 Data Augmentation 11.5 Label Smoothing 11.6 Stochastic Depth 11.7 Mixup and CutMix 11.8 Stochastic Depth

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 21 ›

Sparse Expert Architectures

Dense transformers activate every parameter for every token. As models become larger, this approach becomes increasingly expensive. A trillion-parameter dense model would require enormous compute for every forward pass, even if only part of the model is needed for a given token. Sparse expert architectures address this problem by activating only a subset of parameters for each token. The most common form is the Mixture-of-Experts transformer, usually abbreviated as MoE....

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 28 ›

Restricted Boltzmann Machines

A restricted Boltzmann machine, or RBM, is a simplified Boltzmann machine with a bipartite structure. The restriction removes all connections between units of the same type. Visible units do not connect to other visible units, and hidden units do not connect to other hidden units. This restriction makes inference and sampling tractable enough for practical training. RBMs were historically important in early deep learning systems. They were used for unsupervised...

Rectified Linear Units

The rectified linear unit, usually called ReLU, is the most widely used activation function in modern deep learning. ReLU transformed neural network training because it greatly reduced optimization difficulties that appeared in deep sigmoid and tanh networks. The ReLU function is simple: $$ \mathrm{ReLU}(x)=\max(0,x). $$ Unlike sigmoid and tanh, ReLU does not saturate for positive inputs. This allows gradients to propagate more effectively through deep networks. Definition of ReLU The...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 32 ›

Exercises

Conceptual Questions Explain the difference between parameter scaling, data scaling, and compute scaling. Why do scaling laws often follow approximate power-law behavior? What is compute-optimal training? Why can a smaller model trained on more data outperform a larger model trained on less data? Explain the difference between training-time scaling and inference-time scaling. Why is attention complexity quadratic in sequence length for standard transformers? What are the main bottlenecks in large-scale...

Tensor Memory Layout and Performance

A tensor has a logical shape and a physical memory layout. The shape tells us how to interpret the tensor as an array. The memory layout tells us how the entries are stored in memory. Most PyTorch code can be written without thinking about memory layout. However, layout becomes important when code becomes slow, when view() fails, when a tensor is noncontiguous, or when we write performance-critical training and inference...

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 20 ›

Additive Attention

Additive attention was one of the first successful neural attention mechanisms. It was introduced for neural machine translation to allow a decoder to selectively focus on different encoder states during generation. The key idea is simple. Instead of measuring similarity between vectors using only a dot product, additive attention learns a small neural network that computes how well a query and a key match. This approach is sometimes called Bahdanau...

Writes › Book › Deep Learning with PyTorch › Part V ›

Chapter 20

Sections 20.1 Motivation for Attention 20.2 Additive Attention 20.3 Dot-Product Attention 20.4 Self-Attention 20.5 Cross-Attention 20.6 Multi-Head Attention 20.7 Attention Complexity 20.8 Summary and Further Reading

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 20 ›

Self-Attention

Self-attention is attention applied within a single sequence. The queries, keys, and values all come from the same input. Each position in the sequence computes a new representation by looking at other positions in that same sequence. This is the main operation inside transformer encoders and decoders. It lets every token exchange information with every other token in one layer. From Attention to Self-Attention In general attention, we may have...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 12 ›

Grid Search

Grid search is one of the simplest methods for hyperparameter optimization. The idea is straightforward: define a finite set of candidate values for each hyperparameter, construct every possible combination, train a model for each configuration, and select the configuration with the best validation performance. Although modern deep learning systems often use more advanced methods, grid search remains important because it is easy to implement, easy to reason about, reproducible, and...

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 19 ›

Neural Machine Translation

Neural machine translation maps a sentence in one language to a sentence in another language using a neural sequence model. The model receives a source sentence and generates a target sentence. For example: source: I like cats. target: J'aime les chats. This is a conditional generation problem. The model learns the probability of a target sentence given a source sentence: $$ p(y \mid x) = \prod_{t=1}^{T} p(y_t \mid y_{<t}, x)....

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 26 ›

Distributed Data Parallel

Distributed Data Parallel, usually abbreviated as DDP, is PyTorch’s primary system for synchronous multi-GPU training. DDP extends ordinary data parallelism to distributed environments while minimizing Python overhead and communication inefficiency. The central design principle is simple: one process controls one device. Each process owns a complete replica of the model, computes gradients locally, and synchronizes gradients with other processes during backpropagation. Compared with older single-process approaches such as nn.DataParallel ,...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 13 ›

Classification Pipelines

Image classification assigns one label, or a small set of labels, to an image. A model receives an image tensor as input and produces class scores as output. The class with the largest score is usually taken as the prediction. A classification pipeline is the complete path from raw image files to trained model predictions. It includes data storage, preprocessing, batching, model definition, loss computation, optimization, validation, checkpointing, and inference....

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 19 ›

Teacher Forcing

Teacher forcing is a training method for autoregressive sequence models. It is used when a model generates an output sequence one token at a time, but during training we already know the correct output sequence. In a sequence-to-sequence task, the model learns $$ p(y \mid x) = \prod_{t=1}^{T} p(y_t \mid y_{<t}, x). $$ At step $t$, the decoder predicts $y_t$ using the input sequence $x$ and the previous target tokens...

Reverse-Mode Differentiation

Reverse-mode differentiation is the method used by backpropagation. It computes derivatives by first evaluating a function forward, then propagating gradient information backward from the output to the inputs. This method is especially useful in deep learning because neural networks usually have many parameters and one scalar loss. Reverse-mode differentiation can compute the gradient of one scalar output with respect to millions or billions of parameters efficiently. The Problem Setting Assume...

Contrastive Objectives

Contrastive objectives train a model by comparing examples. Instead of learning only from an input and its target, the model learns which examples should be close together and which examples should be far apart. These objectives are central in self-supervised learning, metric learning, retrieval, representation learning, multimodal learning, and modern embedding systems. The basic idea is: $$ \text{similar examples should have similar representations} $$ and $$ \text{dissimilar examples should have...

Writes › Book › Deep Learning with PyTorch › Part VII › Chapter 24 ›

Named Entity Recognition

Named entity recognition, or NER, is the task of finding spans of text that refer to entities and assigning each span a type. Common entity types include people, organizations, locations, dates, products, medical terms, legal references, gene names, and monetary amounts. For example: Apple hired John Smith in California. A named entity recognizer may produce: Span Entity type Apple ORG John Smith PERSON California LOCATION NER is a sequence labeling...

Stochastic Gradient Descent

Stochastic gradient descent, usually abbreviated as SGD, is the standard form of gradient-based training used in deep learning. It updates parameters using a small random subset of the training data instead of the full dataset. The full training objective is $$ L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell(f_\theta(x_i), y_i). $$ Full-batch gradient descent computes $$ \nabla_\theta L(\theta) $$ using all (N) examples. SGD estimates this gradient using one example or a minibatch....

Dynamic Computation Graphs

Deep learning models are built from sequences of mathematical operations. During training, the system must compute not only the forward result of these operations, but also derivatives with respect to model parameters. PyTorch achieves this through dynamic computation graphs. A computation graph is a directed graph where nodes represent operations or tensors, and edges represent data dependencies between them. When a tensor operation is executed, PyTorch records how the result...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 11 ›

Dropout

Dropout is a regularization method that randomly removes parts of a neural network during training. More precisely, it sets selected activations to zero with some probability. The model must learn useful predictions without depending too heavily on any single hidden unit. A dropout layer takes an activation tensor $h$ and samples a binary mask $m$. Each entry of the mask is either 0 or 1. During training, the output is...

Writes › Book › Deep Learning with PyTorch › Part VII › Chapter 23 ›

Retrieval-Augmented Generation

Retrieval-augmented generation, usually abbreviated RAG, combines a language model with an external information retrieval system. Instead of relying only on knowledge stored in model parameters, the system retrieves relevant documents at inference time and places them into the model’s context. The core pattern is: user question -> retrieve relevant evidence -> condition the model on evidence -> generate answer RAG is useful because pretrained models have limited, static, and imperfect...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 10 ›

Residual Connections

Residual connections allow a layer or block to add its input directly to its output. Instead of forcing a block to learn a complete transformation from scratch, the block learns a correction to the input. A residual block has the form $$ y = x + F(x), $$ where $x$ is the input, $F(x)$ is a learned transformation, and $y$ is the output. The function $F$ may be a stack...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 13 ›

Fine-Tuning Pretrained Models

Fine-tuning adapts a pretrained model to a target dataset by continuing training from learned weights instead of starting from random initialization. In transfer learning, we may train only a new classifier head. In fine-tuning, we update part or all of the pretrained backbone as well. Fine-tuning is useful when the target task is close enough to the pretraining task that learned features remain useful, but different enough that the model...

Writes › Book › Deep Learning with PyTorch ›

Part IX

Chapters Chapter 29 Chapter 30 Chapter 31 Chapter 32

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 16 ›

Named Entity Recognition

Named entity recognition, usually abbreviated NER , identifies spans of text that refer to named or typed entities. Typical entity types include people, organizations, locations, dates, products, events, quantities, and domain-specific terms. For example: Ada Lovelace worked with Charles Babbage in London. A named entity recognizer may produce: Span Entity type Ada Lovelace PERSON Charles Babbage PERSON London LOCATION NER is a sequence labeling problem. Unlike text classification, which assigns...

Writes › Book › Deep Learning with PyTorch ›

Part III

Chapters Chapter 10 Chapter 11 Chapter 12 Chapter 13

Cross-Entropy Loss

Cross-entropy loss is the standard loss function for classification. It measures how well a model’s predicted class distribution matches the true class label. In regression, the target is usually a real number. In classification, the target is a class. For example, an image classifier may choose one label from $$ {\text{cat}, \text{dog}, \text{car}, \text{tree}}. $$ A neural network does not usually output the class directly. It outputs a vector of...

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 20 ›

Attention Complexity

Attention gives a model direct access between positions in a sequence. This direct access is powerful, but it has a cost. Standard self-attention compares every position with every other position, so its memory and compute grow quadratically with sequence length. For short and medium sequences, this cost is acceptable. For long documents, high-resolution images, long audio, video, code repositories, and retrieval contexts, attention cost becomes one of the main limits....

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 22 ›

Pretraining Objectives

A pretraining objective defines the prediction task used to train a model before it is adapted to a downstream use case. In language modeling, the objective is usually self-supervised: labels are created directly from raw text. No human annotator needs to label each example. The corpus itself supplies the targets. For example, in next-token prediction, the input is $$ x_{1:t} $$ and the target is $$ x_{t+1}. $$ In masked...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 14 ›

Residual Networks

Residual networks are convolutional networks built from blocks with skip connections. A skip connection passes the input of a block directly to its output, usually by addition. This gives the network a direct path for information and gradients. The central residual form is $$ y = F(x) + x. $$ Here $x$ is the block input, $F(x)$ is the learned residual function, and $y$ is the block output. The block...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 17 ›

Sequence Modeling Applications

Recurrent neural networks were among the first deep learning architectures capable of handling variable-length sequential data. Before transformers became dominant, recurrent models formed the foundation of modern systems for language processing, speech recognition, machine translation, handwriting recognition, time-series forecasting, and many other domains. Even today, recurrent methods remain useful when: streaming computation is required, memory must remain compact, latency is critical, or data naturally arrives sequentially. This section surveys the...

Writes › Book › Deep Learning with PyTorch › Part VIII ›

Chapter 27

Sections 27.1 Forward Diffusion Processes 27.2 Reverse Denoising Processes 27.3 Score Matching 27.4 Noise Schedules 27.5 Latent Diffusion 27.6 Text-to-Image Systems 27.7 Video Diffusion Systems 27.8 Diffusion Transformers

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 22 ›

Neural Language Models

Statistical language models estimate probabilities from discrete counts. Neural language models replace count tables with differentiable functions parameterized by neural networks. Instead of memorizing exact token sequences, the model learns continuous representations that generalize across similar contexts. A neural language model defines a conditional probability distribution $$ p_\theta(x_t \mid x_{1:t-1}), $$ where $\theta$ denotes the model parameters. These parameters are learned from data by maximizing the likelihood of observed sequences....

PyTorch Versus Other Frameworks

PyTorch is one of several major frameworks for deep learning. A framework provides the basic machinery needed to define models, run tensor operations, compute gradients, optimize parameters, load data, distribute training, and deploy trained models. The main frameworks in modern deep learning include PyTorch, TensorFlow, JAX, Keras, and specialized inference runtimes such as ONNX Runtime and TensorRT. Each framework has a different design center. Framework Design center Typical use PyTorch...

Supervised Learning

Supervised learning is the central paradigm of modern machine learning and deep learning. In supervised learning, a model learns a mapping from inputs to outputs using examples where the correct outputs are already known. A supervised learning system receives pairs of data: $$ (x, y), $$ where (x) is the input and (y) is the target output or label. The goal is to learn a function $$ f_\theta(x) \approx y,...

Writes › Book › Deep Learning with PyTorch › Part VII › Chapter 24 ›

Question Answering

Question answering is the task of producing an answer to a question. The input may contain only the question, or it may contain both a question and one or more passages that may contain the answer. Examples: Question: Who wrote The Origin of Species? Answer: Charles Darwin Question: What does dropout do? Answer: It randomly disables units during training to reduce co-adaptation and improve generalization. Question answering systems are used...

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 20 ›

Dot-Product Attention

Dot-product attention uses an inner product to measure how well a query matches a key. It is simpler than additive attention and maps efficiently to matrix multiplication. This efficiency is one reason it became the standard attention mechanism in transformers. The mechanism follows the same retrieval pattern introduced earlier: Compare queries with keys. Normalize the comparison scores. Use the resulting weights to combine values. The difference is the scoring function....

Self-Supervised Learning

Self-supervised learning is a form of learning where the training signal is created from the data itself. The dataset does not need human-written labels, but the model still receives a prediction task. The usual dataset contains only inputs: $$ \mathcal{D} = {x^{(1)}, x^{(2)}, \dots, x^{(N)}}. $$ A self-supervised method transforms each input into an input-target pair: $$ x \longrightarrow (\tilde{x}, y_{\text{pretext}}). $$ Here (\tilde{x}) is the model input and (y_{\text{pretext}})...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 12 ›

Neural Architecture Search

Neural architecture search, or NAS, is the process of automatically searching for model architectures. Ordinary hyperparameter optimization usually tunes values such as learning rate, batch size, dropout, or weight decay. NAS searches the structure of the network itself. Architecture choices include the number of layers, hidden width, convolution kernel sizes, attention heads, skip connections, normalization placement, activation functions, and block types. In large models, architecture search may also include mixture-of-experts...

Writes › Book › Deep Learning with PyTorch › Part VII › Chapter 24 ›

Summarization

Summarization is the task of producing a shorter version of one or more source texts while preserving the important information. The input may be a news article, a scientific paper, a legal document, a support thread, a meeting transcript, a code review, or a set of retrieved passages. The output is a compact text that should be faithful, readable, and appropriate for the user’s purpose. Examples: Input: A long news...

Automatic Differentiation Engines

An automatic differentiation engine is the system that records numerical operations and computes derivatives from them. In PyTorch, this system is called autograd. It is responsible for building the backward graph during the forward pass and executing gradient computation when backward() is called. Automatic differentiation is different from symbolic differentiation and numerical differentiation. Symbolic differentiation manipulates formulas. Numerical differentiation estimates derivatives using small perturbations. Automatic differentiation evaluates exact derivative rules...

Writes › Book › Deep Learning with PyTorch › Part IX ›

Chapter 30

Sections 30.1 Adversarial Examples 30.2 Distribution Shift 30.3 Saliency Maps 30.4 Attribution Methods 30.5 Mechanistic Interpretability 30.6 Model Editing

Writes › Book › Deep Learning with PyTorch › Part VI ›

Chapter 22

Sections 22.1 Statistical Language Models 22.2 Neural Language Models 22.3 Autoregressive Modeling 22.4 Masked Language Modeling 22.5 Tokenization Systems 22.6 Subword Methods 22.7 Embeddings and Output Projections 22.8 Pretraining Objectives

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 30 ›

Adversarial Examples

An adversarial example is an input that has been deliberately modified so that a model makes a wrong prediction, while the modification is small enough that a human observer still sees the original object. For an image classifier, an adversarial example may look like a normal image of a panda to a human, but the model may classify it as a gibbon. For a text classifier, a small character substitution...

Logistic Regression

Linear regression predicts a real number. Logistic regression predicts a probability for binary classification. A binary classification problem has two possible labels: $$ y \in {0,1}. $$ Examples include spam versus not spam, fraud versus legitimate, disease versus no disease, and click versus no click. The model receives an input vector $$ x \in \mathbb{R}^d $$ and predicts the probability that the label is 1. $$ \hat{p} = P(y =...

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 21 ›

Positional Encoding

Self-attention compares tokens to other tokens, but by itself it has no built-in notion of order. If we permute the input sequence and apply the same self-attention operation, the attention mechanism still compares all tokens in the same content-based way. A transformer therefore needs an additional signal that tells it where each token appears. Positional encoding is the mechanism that injects order information into a transformer. It gives the model...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 25 ›

Denoising Autoencoders

A denoising autoencoder learns to reconstruct a clean input from a corrupted version of that input. Instead of copying $x$ to $\hat{x}$, the model receives a noisy input $\tilde{x}$ and must recover the original $x$. The encoder maps the corrupted input to a latent representation: $$ z = f_\theta(\tilde{x}). $$ The decoder reconstructs the clean input: $$ \hat{x} = g_\phi(z). $$ The training objective is $$ \min_{\theta,\phi} \frac{1}{N} \sum_{i=1}^N |x_i...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 17 ›

Bidirectional Networks

Standard recurrent neural networks process sequences in one direction, usually from left to right. At time step $t$, the hidden state summarizes only the past: $$ h_t = f(h_{t-1}, x_t). $$ This is appropriate for causal prediction tasks such as language generation, where future tokens are unavailable. However, many sequence tasks are not causal. When labeling or analyzing a sequence, the entire input is already known. In such cases, future...

Writes › Book › Deep Learning with PyTorch › Part VI › Chapter 21 ›

Residual and Normalization Layers

Transformer layers are deep stacks of attention and feedforward blocks. Without additional structure, such stacks are difficult to optimize. Activations may grow or shrink across layers. Gradients may become unstable. Early layers may be overwritten by later layers. Residual connections and normalization layers are used to make deep transformers trainable. They are simple mechanisms, but they strongly affect model stability, depth, learning speed, and final quality. The Role of Residual...

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 19 ›

Speech Recognition Systems

Speech recognition maps an acoustic signal to a text sequence. The input is continuous audio. The output is discrete symbols: characters, subword tokens, words, or phonemes. A speech recognition system receives a waveform $$ x = (x_1, x_2, \ldots, x_N) $$ and predicts a token sequence $$ y = (y_1, y_2, \ldots, y_T). $$ The input length $N$ is usually much larger than the output length $T$. A few seconds...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 11 ›

Data Augmentation

Data augmentation is a regularization method that creates modified versions of training examples while preserving their labels. Instead of changing the model or adding a penalty to the loss, data augmentation changes the training distribution seen by the model. For an image classifier, a cat remains a cat after small crops, flips, color changes, or mild rotations. For a speech model, the spoken word remains the same after small background...

Structure of a PyTorch Project

A PyTorch project should separate concerns. Model code should define computation. Data code should prepare examples and batches. Training code should connect models, data, losses, optimizers, logging, checkpoints, and evaluation. This separation keeps experiments readable and reduces hidden coupling. A small project can start as one file. That is acceptable for a short experiment. As soon as the project has multiple datasets, models, or training runs, a clearer structure becomes...

Training, Validation, and Test Sets

A machine learning dataset is usually divided into three parts: a training set, a validation set, and a test set. Each part has a different role. The training set is used to fit model parameters. The validation set is used to make design choices. The test set is used only for final evaluation. This separation is necessary because a model can perform well on examples it has already seen while...

Writes › Book › Deep Learning with PyTorch › Part V › Chapter 19 ›

Beam Search

Beam search is a decoding algorithm for autoregressive sequence models. It is used when a model must generate a sequence, but greedy decoding is too narrow. In greedy decoding, the model chooses the most likely token at every step: $$ \hat{y} t = \arg\max_k p(y_t = k \mid \hat{y} {<t}, x). $$ This can fail because a locally best token may lead to a poor full sequence. Beam search keeps...

Writes › Book › Deep Learning with PyTorch › Part V ›

Chapter 19

Sections 19.1 Encoder-Decoder Architectures 19.2 Teacher Forcing 19.3 Beam Search 19.4 Neural Machine Translation 19.5 Speech Recognition Systems

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 14 ›

Convolution Operations

Convolution is the central operation in convolutional neural networks. It gives a neural network a way to process spatial data, such as images, by applying the same small pattern detector across many locations. Instead of connecting every input value to every output value, a convolutional layer uses local connections and shared parameters. This makes convolutional networks efficient and well suited to data with spatial structure. An edge in the top-left...

Self-Supervised Objectives

Self-supervised learning trains a model using supervision constructed from the data itself. Instead of requiring human labels, the training task is derived from structure already present in the input. A language model predicts missing or future tokens. A vision model may predict whether two augmented images came from the same source image. An audio model may predict masked spectrogram frames. A multimodal model may match an image with its caption....

The Perceptron Algorithm

The perceptron is one of the earliest algorithms for binary classification. It learns a linear decision boundary by updating its weights whenever it makes a mistake. Unlike logistic regression, the perceptron does not predict calibrated probabilities. It predicts a class label directly. Its purpose is simple: find a hyperplane that separates two classes, when such a hyperplane exists. Binary Labels For the perceptron, it is convenient to use labels $$...

Writes › Book › Deep Learning with PyTorch › Part IX › Chapter 29 ›

Summary and Further Reading

Probabilistic deep learning extends neural networks with explicit probability models. Instead of producing only point estimates, a probabilistic model represents uncertainty, likelihood, latent structure, or a posterior distribution over parameters. This chapter covered five core ideas. First, Bayesian neural networks treat weights as random variables. A prior describes plausible parameters before observing data. The posterior updates this belief after seeing data. Prediction averages over plausible networks rather than relying on...

Practical Activation Selection

Activation functions should be chosen for the architecture, loss, initialization, normalization, and training scale. There is no universal best activation. The right choice depends on what the layer must do. A useful rule is: use simple activations for simple architectures, smooth activations for large transformer-style models, and bounded activations when the output must stay in a specific range. Hidden-Layer Activations Hidden layers need nonlinear functions that preserve useful gradients. In...

Writes › Book › Deep Learning with PyTorch › Part IV › Chapter 16 ›

Machine Translation

Machine translation converts text from one language into another. Given a source sentence in one language, the model generates a semantically equivalent sentence in a target language. For example: Source language Target language the cat sat on the mat le chat s'est assis sur le tapis good morning buenos días where is the station? 駅はどこですか Modern neural machine translation systems are usually sequence-to-sequence models built with transformers. The core problem...

Tensor Arithmetic and Broadcasting

Tensor arithmetic is the basic computation layer of PyTorch. Neural networks are built from additions, multiplications, reductions, matrix products, reshapes, and nonlinear functions. Higher-level layers such as nn.Linear , nn.Conv2d , and nn.MultiheadAttention are composed from these lower-level tensor operations. This section studies arithmetic at the tensor level. The goal is to understand which operations are elementwise, which operations reduce dimensions, and which operations combine axes through linear algebra. Elementwise...

Writes › Book ›

Deep Learning with PyTorch

Part I. PyTorch Foundations Chapter 1. Introduction to Deep Learning and PyTorch 1.1 What Is Deep Learning 1.2 The PyTorch Ecosystem 1.3 Dynamic Computation Graphs 1.4 Tensor-Based Computation 1.5 GPUs and Accelerators 1.6 PyTorch Versus Other Frameworks 1.7 Installing and Configuring PyTorch 1.8 Structure of a PyTorch Project Chapter 2. Tensors and Tensor Operations 2.1 Creating Tensors 2.2 Tensor Shapes and Dimensions 2.3 Tensor Arithmetic 2.4 Broadcasting Rules 2.5 Indexing...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 11 ›

Stochastic Depth

Stochastic depth is a regularization method for deep residual networks. During training, it randomly skips entire residual blocks. Instead of dropping individual activations, as dropout does, stochastic depth drops whole computational paths. A standard residual block computes $$ y = x + F(x), $$ where $x$ is the input and $F(x)$ is the residual branch. With stochastic depth, the residual branch is randomly kept or removed during training: $$ y...

Writes › Book › Deep Learning with PyTorch › Part VIII › Chapter 26 ›

Pipeline Parallelism

Pipeline parallelism splits a model into sequential stages and places each stage on a different device. It is a form of model parallelism designed to reduce idle time. Instead of sending one full batch through stage 1, then stage 2, then stage 3, pipeline parallelism divides the batch into smaller microbatches. While one microbatch is processed by a later stage, another microbatch can be processed by an earlier stage. The...

Writes › Book › Deep Learning with PyTorch › Part III › Chapter 12 ›

Search Spaces

Hyperparameter optimization begins by deciding what may vary. This set of possible choices is called the search space. Before we run grid search, random search, Bayesian optimization, or any automated method, we must define the hyperparameters, their allowed values, and the rules that make some combinations valid or invalid. A hyperparameter is a value chosen outside the training process. It controls how the model is built or how training is...

#pytorch

Variational Inference

Practical Probabilistic Modeling in PyTorch

Chapter 25

Batch Normalization

Part VIII

Tool Use and Agents

Dimensionality Reduction

Attention Mechanisms

Attribution Methods

Subword Methods

CNN Architectures

Saturation and Gradient Flow

ELU, GELU, and Swish

Pretraining Objectives

Saliency Maps

Margin-Based Losses

Transformer Decoders

In-Context Learning

Chapter 26

Leaky and Parametric ReLU

Latent Space Manipulation

Open Research Problems

Chapter 21

Gradient Flow in Deep Networks

Cross-Attention

Chapter 15

Encoder-Decoder Architectures

Word Embeddings

Recurrent Computation

Tensor Shapes, Dimensions, and Memory Layout

Gradient Computation

Chapter 31

Linear Separability

Reinforcement Learning from Human Feedback

Chapter 10

Transfer Learning

Evaluation Metrics

Self-Attention

Forward Diffusion Processes

Variational Autoencoders

Symbolic Versus Dynamic Computation

Choosing and Combining Loss Functions

Matrix Operations

Likelihood-Based Objectives

Multi-Head Attention

Energy-Based Models

Linear Regression

Gradient Descent

Part VI

Multi-Task Objectives

Inference Optimization

Instruction Tuning

Data Augmentation Strategies

Representation Learning

CPU and GPU Tensors

Gaussian Processes

Scaling Transformers

Efficient Attention Methods

Deep Belief Networks

Chapter 16

Mean Squared Error

Loss Functions

Scaling Laws for Language Models

Video Diffusion Systems

Chapter 12

Variational Autoencoders

Reverse Denoising Processes

Tensor Data Types and Devices

Vanishing Gradients in RNNs

Unsupervised Learning

Chapter 29

Efficient Transformers

Population-Based Training

Chapter 17

Random Search

Score Matching

Sequential Data

Scientific Deep Learning

Motivation for Attention