Case Studies
This section studies reverse mode automatic differentiation through concrete examples. Each case has the same structure:
Case Studies
This section studies reverse mode automatic differentiation through concrete examples. Each case has the same structure:
- evaluate the primal computation forward;
- seed the output adjoint;
- propagate adjoints backward;
- collect input or parameter gradients.
The examples move from scalar expressions to neural network layers and iterative programs.
Case Study 1: Scalar Expression
Consider
$$ y = \log(1 + x^2). $$
Decompose the computation:
$$ v_1 = x^2, $$
$$ v_2 = 1 + v_1, $$
$$ v_3 = \log(v_2), $$
$$ y = v_3. $$
Forward pass:
| Variable | Expression |
|---|---|
| $v_1$ | $x^2$ |
| $v_2$ | $1 + v_1$ |
| $v_3$ | $\log(v_2)$ |
Reverse initialization:
$$ \bar v_3 = 1. $$
Backward propagation:
$$ \bar v_2 \mathrel{+}= \bar v_3 \frac{1}{v_2}, $$
$$ \bar v_1 \mathrel{+}= \bar v_2, $$
$$ \bar x \mathrel{+}= \bar v_1 2x. $$
Therefore,
$$ \bar x = \frac{2x}{1+x^2}. $$
Reverse mode recovered the derivative by local rules without symbolic simplification of the original expression.
Case Study 2: Shared Intermediate
Consider
$$ y = x^2 + \sin(x^2). $$
Use a shared intermediate:
$$ v_1 = x^2, $$
$$ v_2 = \sin(v_1), $$
$$ v_3 = v_1 + v_2, $$
$$ y = v_3. $$
The value $v_1$ contributes to the output through two paths. Reverse mode must accumulate both.
Seed:
$$ \bar v_3 = 1. $$
From addition:
$$ \bar v_1 \mathrel{+}= \bar v_3, $$
$$ \bar v_2 \mathrel{+}= \bar v_3. $$
From sine:
$$ \bar v_1 \mathrel{+}= \bar v_2 \cos(v_1). $$
Thus,
$$ \bar v_1 = 1 + \cos(v_1). $$
From square:
$$ \bar x \mathrel{+}= \bar v_1 2x. $$
Therefore,
$$ \frac{dy}{dx} = 2x(1+\cos(x^2)). $$
This example shows why adjoints are added, not assigned.
Case Study 3: Two Inputs, One Output
Let
$$ y = x_1x_2 + \sin(x_1). $$
Wengert list:
$$ v_1 = x_1x_2, $$
$$ v_2 = \sin(x_1), $$
$$ v_3 = v_1 + v_2, $$
$$ y = v_3. $$
Seed:
$$ \bar v_3 = 1. $$
From addition:
$$ \bar v_1 = 1, \qquad \bar v_2 = 1. $$
From sine:
$$ \bar x_1 \mathrel{+}= \cos(x_1). $$
From multiplication:
$$ \bar x_1 \mathrel{+}= x_2, $$
$$ \bar x_2 \mathrel{+}= x_1. $$
Final gradient:
$$ \nabla y = \begin{bmatrix} x_2 + \cos(x_1) \ x_1 \end{bmatrix}. $$
One reverse pass computes both partial derivatives.
Case Study 4: Vector-Jacobian Product
Define
$$ f(x_1,x_2) = \begin{bmatrix} x_1x_2 \ x_1 + x_2^2 \end{bmatrix}. $$
Let the output seed be
$$ u = \begin{bmatrix} u_1 \ u_2 \end{bmatrix}. $$
Reverse mode computes
$$ u^\top J_f(x). $$
Write outputs as:
$$ y_1 = x_1x_2, $$
$$ y_2 = x_1 + x_2^2. $$
Seed output adjoints:
$$ \bar y_1 = u_1, \qquad \bar y_2 = u_2. $$
From $y_1$:
$$ \bar x_1 \mathrel{+}= u_1x_2, $$
$$ \bar x_2 \mathrel{+}= u_1x_1. $$
From $y_2$:
$$ \bar x_1 \mathrel{+}= u_2, $$
$$ \bar x_2 \mathrel{+}= 2u_2x_2. $$
Final result:
$$ u^\top J_f(x) = \begin{bmatrix} u_1x_2 + u_2, \quad u_1x_1 + 2u_2x_2 \end{bmatrix}. $$
Reverse mode computes this product without constructing the full Jacobian as a stored matrix.
Case Study 5: Linear Layer
Consider a batch linear layer:
$$ Y = XW + b. $$
Shapes:
| Symbol | Shape |
|---|---|
| $X$ | $B \times d_{\text{in}}$ |
| $W$ | $d_{\text{in}} \times d_{\text{out}}$ |
| $b$ | $d_{\text{out}}$ |
| $Y$ | $B \times d_{\text{out}}$ |
Let $\bar Y$ be the adjoint arriving from later layers.
Reverse rules:
$$ \bar X = \bar Y W^\top, $$
$$ \bar W = X^\top \bar Y, $$
$$ \bar b = \sum_{i=1}^{B} \bar Y_i. $$
The backward pass has the same numerical character as the forward pass: dense matrix multiplication plus a reduction.
This is why deep learning systems implement reverse rules as tensor kernels, not scalar derivative loops.
Case Study 6: ReLU Layer
Let
$$ Y = \operatorname{ReLU}(X), $$
where
$$ Y_{ij} = \max(0, X_{ij}). $$
The derivative is
$$ \frac{\partial Y_{ij}}{\partial X_{ij}} = \begin{cases} 1 & X_{ij} > 0, \ 0 & X_{ij} < 0. \end{cases} $$
At $X_{ij}=0$, systems choose a convention, commonly zero.
Given output adjoint $\bar Y$, reverse propagation gives
$$ \bar X_{ij} = \bar Y_{ij} \mathbf{1}{X{ij} > 0}. $$
A ReLU backward pass therefore needs either:
- the original input $X$;
- the output $Y$;
- a stored Boolean mask.
The mask often costs less memory than storing the full input, depending on representation.
Case Study 7: Softmax Cross-Entropy
Softmax followed by cross-entropy is common in classification.
Given logits $z \in \mathbb{R}^K$, define
$$ p_i = \frac{e^{z_i}}{\sum_j e^{z_j}}. $$
For one-hot target $t$, the loss is
$$ L = -\sum_i t_i \log p_i. $$
A direct graph would include exponentials, sums, division, logarithms, and multiplication. Reverse mode can differentiate this graph mechanically.
However, most systems use a fused stable backward rule:
$$ \bar z = p - t. $$
For a batch with mean reduction:
$$ \bar Z = \frac{1}{B}(P - T). $$
This custom backward rule is both simpler and more numerically stable than differentiating each primitive separately.
Case Study 8: Simple Recurrent Computation
Consider a recurrence:
$$ h_0 = x, $$
$$ h_{t+1} = \tanh(ah_t + b), \qquad t = 0,\ldots,T-1, $$
$$ L = h_T. $$
Forward pass stores or checkpoints hidden states:
$$ h_0,h_1,\ldots,h_T. $$
Seed:
$$ \bar h_T = 1. $$
Backward through time:
$$ z_t = ah_t + b, $$
$$ h_{t+1} = \tanh(z_t). $$
Reverse rules:
$$ \bar z_t = \bar h_{t+1}(1-h_{t+1}^2), $$
$$ \bar a \mathrel{+}= \bar z_t h_t, $$
$$ \bar b \mathrel{+}= \bar z_t, $$
$$ \bar h_t \mathrel{+}= \bar z_t a. $$
The same parameter $a$ is used at every time step, so its adjoint accumulates over all steps:
$$ \bar a = \sum_{t=0}^{T-1} \bar z_t h_t. $$
This is backpropagation through time.
Case Study 9: Checkpointed Chain
Consider a chain of eight layers:
$$ h_0 \to h_1 \to h_2 \to \cdots \to h_8. $$
Full reverse mode stores all activations:
$$ h_0,h_1,\ldots,h_8. $$
A checkpointed scheme might store:
$$ h_0,h_4,h_8. $$
During backward propagation through the segment $h_4 \to h_8$, the system recomputes:
$$ h_5,h_6,h_7,h_8 $$
from checkpoint $h_4$, then runs the local backward pass.
Then it repeats for $h_0 \to h_4$.
This reduces peak activation memory but adds extra forward computation.
The derivative remains mathematically the same if recomputation faithfully reproduces the original forward values.
Case Study 10: Tiny Reverse Mode Engine
A minimal scalar reverse mode system stores values, adjoints, and backward closures.
import math
class Var:
def __init__(self, value):
self.value = float(value)
self.grad = 0.0
self.parents = []
def backward(self):
topo = []
seen = set()
def visit(v):
if id(v) in seen:
return
seen.add(id(v))
for parent, _ in v.parents:
visit(parent)
topo.append(v)
visit(self)
self.grad = 1.0
for v in reversed(topo):
for parent, pullback in v.parents:
parent.grad += pullback(v.grad)
def add(a, b):
out = Var(a.value + b.value)
out.parents = [
(a, lambda g: g),
(b, lambda g: g),
]
return out
def mul(a, b):
out = Var(a.value * b.value)
out.parents = [
(a, lambda g: g * b.value),
(b, lambda g: g * a.value),
]
return out
def sin(a):
out = Var(math.sin(a.value))
out.parents = [
(a, lambda g: g * math.cos(a.value)),
]
return out
Using it:
x1 = Var(2.0)
x2 = Var(3.0)
y = add(mul(x1, x2), sin(x1))
y.backward()
print(y.value)
print(x1.grad)
print(x2.grad)
The gradients are:
$$ \bar x_1 = x_2 + \cos(x_1), $$
$$ \bar x_2 = x_1. $$
This small engine illustrates the whole mechanism:
- the forward pass builds a graph;
- each operation stores a pullback;
- backward traversal accumulates gradients.
Chapter Summary
Reverse mode automatic differentiation computes gradients by recording a primal computation and replaying its local derivative rules backward.
The main ideas from this chapter are:
| Concept | Role |
|---|---|
| adjoint | sensitivity of output to an intermediate value |
| reverse graph | dependency graph traversed backward |
| vector-Jacobian product | core operation of reverse mode |
| reverse accumulation | additive propagation of adjoints |
| tape | recorded execution trace |
| Wengert list | single-assignment linearized computation |
| checkpointing | memory reduction by recomputation |
| backpropagation | reverse mode applied to neural networks |
Reverse mode is efficient when many inputs influence a small number of outputs. This is the standard setting in optimization and deep learning: many parameters, one scalar objective.