Case Studies

This section studies reverse mode automatic differentiation through concrete examples. Each case has the same structure:

evaluate the primal computation forward;
seed the output adjoint;
propagate adjoints backward;
collect input or parameter gradients.

The examples move from scalar expressions to neural network layers and iterative programs.

Case Study 1: Scalar Expression

Consider

$$ y = \log(1 + x^2). $$

Decompose the computation:

$$ v_1 = x^2, $$

$$ v_2 = 1 + v_1, $$

$$ v_3 = \log(v_2), $$

$$ y = v_3. $$

Forward pass:

Variable	Expression
$v_1$	$x^2$
$v_2$	$1 + v_1$
$v_3$	$\log(v_2)$

Reverse initialization:

$$ \bar v_3 = 1. $$

Backward propagation:

$$ \bar v_2 \mathrel{+}= \bar v_3 \frac{1}{v_2}, $$

$$ \bar v_1 \mathrel{+}= \bar v_2, $$

$$ \bar x \mathrel{+}= \bar v_1 2x. $$

Therefore,

$$ \bar x = \frac{2x}{1+x^2}. $$

Reverse mode recovered the derivative by local rules without symbolic simplification of the original expression.

Case Study 2: Shared Intermediate

Consider

$$ y = x^2 + \sin(x^2). $$

Use a shared intermediate:

$$ v_1 = x^2, $$

$$ v_2 = \sin(v_1), $$

$$ v_3 = v_1 + v_2, $$

$$ y = v_3. $$

The value $v_1$ contributes to the output through two paths. Reverse mode must accumulate both.

Seed:

$$ \bar v_3 = 1. $$

From addition:

$$ \bar v_1 \mathrel{+}= \bar v_3, $$

$$ \bar v_2 \mathrel{+}= \bar v_3. $$

From sine:

$$ \bar v_1 \mathrel{+}= \bar v_2 \cos(v_1). $$

Thus,

$$ \bar v_1 = 1 + \cos(v_1). $$

From square:

$$ \bar x \mathrel{+}= \bar v_1 2x. $$

Therefore,

$$ \frac{dy}{dx} = 2x(1+\cos(x^2)). $$

This example shows why adjoints are added, not assigned.

Case Study 3: Two Inputs, One Output

Let

$$ y = x_1x_2 + \sin(x_1). $$

Wengert list:

$$ v_1 = x_1x_2, $$

$$ v_2 = \sin(x_1), $$

$$ v_3 = v_1 + v_2, $$

$$ y = v_3. $$

Seed:

$$ \bar v_3 = 1. $$

From addition:

$$ \bar v_1 = 1, \qquad \bar v_2 = 1. $$

From sine:

$$ \bar x_1 \mathrel{+}= \cos(x_1). $$

From multiplication:

$$ \bar x_1 \mathrel{+}= x_2, $$

$$ \bar x_2 \mathrel{+}= x_1. $$

Final gradient:

$$ \nabla y = \begin{bmatrix} x_2 + \cos(x_1) \ x_1 \end{bmatrix}. $$

One reverse pass computes both partial derivatives.

Case Study 4: Vector-Jacobian Product

Define

$$ f(x_1,x_2) = \begin{bmatrix} x_1x_2 \ x_1 + x_2^2 \end{bmatrix}. $$

Let the output seed be

$$ u = \begin{bmatrix} u_1 \ u_2 \end{bmatrix}. $$

Reverse mode computes

$$ u^\top J_f(x). $$

Write outputs as:

$$ y_1 = x_1x_2, $$

$$ y_2 = x_1 + x_2^2. $$

Seed output adjoints:

$$ \bar y_1 = u_1, \qquad \bar y_2 = u_2. $$

From $y_1$:

$$ \bar x_1 \mathrel{+}= u_1x_2, $$

$$ \bar x_2 \mathrel{+}= u_1x_1. $$

From $y_2$:

$$ \bar x_1 \mathrel{+}= u_2, $$

$$ \bar x_2 \mathrel{+}= 2u_2x_2. $$

Final result:

$$ u^\top J_f(x) = \begin{bmatrix} u_1x_2 + u_2, \quad u_1x_1 + 2u_2x_2 \end{bmatrix}. $$

Reverse mode computes this product without constructing the full Jacobian as a stored matrix.

Case Study 5: Linear Layer

Consider a batch linear layer:

$$ Y = XW + b. $$

Shapes:

Symbol	Shape
$X$	$B \times d_{\text{in}}$
$W$	$d_{\text{in}} \times d_{\text{out}}$
$b$	$d_{\text{out}}$
$Y$	$B \times d_{\text{out}}$

Let $\bar Y$ be the adjoint arriving from later layers.

Reverse rules:

$$ \bar X = \bar Y W^\top, $$

$$ \bar W = X^\top \bar Y, $$

$$ \bar b = \sum_{i=1}^{B} \bar Y_i. $$

The backward pass has the same numerical character as the forward pass: dense matrix multiplication plus a reduction.

This is why deep learning systems implement reverse rules as tensor kernels, not scalar derivative loops.

Case Study 6: ReLU Layer

Let

$$ Y = \operatorname{ReLU}(X), $$

where

$$ Y_{ij} = \max(0, X_{ij}). $$

The derivative is

$$ \frac{\partial Y_{ij}}{\partial X_{ij}} = \begin{cases} 1 & X_{ij} > 0, \ 0 & X_{ij} < 0. \end{cases} $$

At $X_{ij}=0$, systems choose a convention, commonly zero.

Given output adjoint $\bar Y$, reverse propagation gives

$$ \bar X_{ij} = \bar Y_{ij} \mathbf{1}{X{ij} > 0}. $$

A ReLU backward pass therefore needs either:

the original input $X$;
the output $Y$;
a stored Boolean mask.

The mask often costs less memory than storing the full input, depending on representation.

Case Study 7: Softmax Cross-Entropy

Softmax followed by cross-entropy is common in classification.

Given logits $z \in \mathbb{R}^K$, define

$$ p_i = \frac{e^{z_i}}{\sum_j e^{z_j}}. $$

For one-hot target $t$, the loss is

$$ L = -\sum_i t_i \log p_i. $$

A direct graph would include exponentials, sums, division, logarithms, and multiplication. Reverse mode can differentiate this graph mechanically.

However, most systems use a fused stable backward rule:

$$ \bar z = p - t. $$

For a batch with mean reduction:

$$ \bar Z = \frac{1}{B}(P - T). $$

This custom backward rule is both simpler and more numerically stable than differentiating each primitive separately.

Case Study 8: Simple Recurrent Computation

Consider a recurrence:

$$ h_0 = x, $$

$$ h_{t+1} = \tanh(ah_t + b), \qquad t = 0,\ldots,T-1, $$

$$ L = h_T. $$

Forward pass stores or checkpoints hidden states:

$$ h_0,h_1,\ldots,h_T. $$

Seed:

$$ \bar h_T = 1. $$

Backward through time:

$$ z_t = ah_t + b, $$

$$ h_{t+1} = \tanh(z_t). $$

Reverse rules:

$$ \bar z_t = \bar h_{t+1}(1-h_{t+1}^2), $$

$$ \bar a \mathrel{+}= \bar z_t h_t, $$

$$ \bar b \mathrel{+}= \bar z_t, $$

$$ \bar h_t \mathrel{+}= \bar z_t a. $$

The same parameter $a$ is used at every time step, so its adjoint accumulates over all steps:

$$ \bar a = \sum_{t=0}^{T-1} \bar z_t h_t. $$

This is backpropagation through time.

Case Study 9: Checkpointed Chain

Consider a chain of eight layers:

$$ h_0 \to h_1 \to h_2 \to \cdots \to h_8. $$

Full reverse mode stores all activations:

$$ h_0,h_1,\ldots,h_8. $$

A checkpointed scheme might store:

$$ h_0,h_4,h_8. $$

During backward propagation through the segment $h_4 \to h_8$, the system recomputes:

$$ h_5,h_6,h_7,h_8 $$

from checkpoint $h_4$, then runs the local backward pass.

Then it repeats for $h_0 \to h_4$.

This reduces peak activation memory but adds extra forward computation.

The derivative remains mathematically the same if recomputation faithfully reproduces the original forward values.

Case Study 10: Tiny Reverse Mode Engine

A minimal scalar reverse mode system stores values, adjoints, and backward closures.

import math

class Var:
    def __init__(self, value):
        self.value = float(value)
        self.grad = 0.0
        self.parents = []

    def backward(self):
        topo = []
        seen = set()

        def visit(v):
            if id(v) in seen:
                return
            seen.add(id(v))
            for parent, _ in v.parents:
                visit(parent)
            topo.append(v)

        visit(self)

        self.grad = 1.0

        for v in reversed(topo):
            for parent, pullback in v.parents:
                parent.grad += pullback(v.grad)

def add(a, b):
    out = Var(a.value + b.value)
    out.parents = [
        (a, lambda g: g),
        (b, lambda g: g),
    ]
    return out

def mul(a, b):
    out = Var(a.value * b.value)
    out.parents = [
        (a, lambda g: g * b.value),
        (b, lambda g: g * a.value),
    ]
    return out

def sin(a):
    out = Var(math.sin(a.value))
    out.parents = [
        (a, lambda g: g * math.cos(a.value)),
    ]
    return out

Using it:

x1 = Var(2.0)
x2 = Var(3.0)

y = add(mul(x1, x2), sin(x1))
y.backward()

print(y.value)
print(x1.grad)
print(x2.grad)

The gradients are:

$$ \bar x_1 = x_2 + \cos(x_1), $$

$$ \bar x_2 = x_1. $$

This small engine illustrates the whole mechanism:

the forward pass builds a graph;
each operation stores a pullback;
backward traversal accumulates gradients.

Chapter Summary

Reverse mode automatic differentiation computes gradients by recording a primal computation and replaying its local derivative rules backward.

The main ideas from this chapter are:

Concept	Role
adjoint	sensitivity of output to an intermediate value
reverse graph	dependency graph traversed backward
vector-Jacobian product	core operation of reverse mode
reverse accumulation	additive propagation of adjoints
tape	recorded execution trace
Wengert list	single-assignment linearized computation
checkpointing	memory reduction by recomputation
backpropagation	reverse mode applied to neural networks

Reverse mode is efficient when many inputs influence a small number of outputs. This is the standard setting in optimization and deep learning: many parameters, one scalar objective.