Reverse Accumulation Algorithms
Reverse accumulation is the operational core of reverse mode automatic differentiation. The forward pass evaluates a program and records dependency information. The reverse...
Reverse Accumulation Algorithms
Reverse accumulation is the operational core of reverse mode automatic differentiation. The forward pass evaluates a program and records dependency information. The reverse pass traverses that recorded structure backward and accumulates adjoint contributions into each variable.
The word accumulation is important. A variable may influence the final output through many downstream paths. Reverse mode computes the total derivative by summing contributions from all such paths.
For a scalar-valued function
$$ f : \mathbb{R}^n \to \mathbb{R}, $$
reverse accumulation computes
$$ \nabla f(x) = \left( \frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n} \right) $$
using one backward traversal.
Forward Evaluation and Recording
Consider a straight-line program:
$$ v_1 = x_1 x_2, $$
$$ v_2 = \sin(v_1), $$
$$ v_3 = v_2 + x_1, $$
$$ y = v_3. $$
The forward pass computes intermediate values in order.
| Step | Expression |
|---|---|
| 1 | $v_1 = x_1 x_2$ |
| 2 | $v_2 = \sin(v_1)$ |
| 3 | $v_3 = v_2 + x_1$ |
| 4 | $y = v_3$ |
During execution, the system records:
- operation type;
- input references;
- output reference;
- primal values needed later.
The resulting trace is a topologically ordered sequence of operations.
Reverse Accumulation Principle
For every variable $v_i$, define adjoint
$$ \bar v_i = \frac{\partial y}{\partial v_i}. $$
The reverse pass begins with
$$ \bar y = 1. $$
Then each operation propagates adjoints backward using local derivative rules.
The chain rule gives
$$ \bar v_i = \sum_j \bar v_j \frac{\partial v_j}{\partial v_i}, $$
where the sum ranges over all immediate children of $v_i$ in the computational graph.
This equation defines reverse accumulation.
Reverse Sweep Example
Start from the previous program.
Forward values:
$$ v_1 = x_1 x_2, $$
$$ v_2 = \sin(v_1), $$
$$ v_3 = v_2 + x_1, $$
$$ y = v_3. $$
Initialize all adjoints to zero:
$$ \bar x_1 = 0, \qquad \bar x_2 = 0, \qquad \bar v_1 = 0, \qquad \bar v_2 = 0, \qquad \bar v_3 = 0. $$
Seed output:
$$ \bar y = \bar v_3 = 1. $$
Now traverse backward.
Step 1: Addition Node
Since
$$ v_3 = v_2 + x_1, $$
the reverse rule is
$$ \bar v_2 \mathrel{+}= \bar v_3, $$
$$ \bar x_1 \mathrel{+}= \bar v_3. $$
Therefore,
$$ \bar v_2 = 1, \qquad \bar x_1 = 1. $$
Step 2: Sine Node
Since
$$ v_2 = \sin(v_1), $$
the reverse rule is
$$ \bar v_1 \mathrel{+}= \bar v_2 \cos(v_1). $$
Thus,
$$ \bar v_1 = \cos(v_1). $$
Step 3: Multiplication Node
Since
$$ v_1 = x_1 x_2, $$
the reverse rules are
$$ \bar x_1 \mathrel{+}= \bar v_1 x_2, $$
$$ \bar x_2 \mathrel{+}= \bar v_1 x_1. $$
Substituting:
$$ \bar x_1 = 1 + x_2\cos(v_1), $$
$$ \bar x_2 = x_1\cos(v_1). $$
Since
$$ v_1 = x_1 x_2, $$
the final derivatives are
$$ \frac{\partial y}{\partial x_1} = 1 + x_2\cos(x_1x_2), $$
$$ \frac{\partial y}{\partial x_2} = x_1\cos(x_1x_2). $$
Generic Reverse Accumulation Algorithm
The reverse accumulation procedure can be expressed abstractly.
Suppose the forward pass generated operations
$$ o_1, o_2, \ldots, o_k. $$
Each operation defines
$$ v_j = g_j(v_{p_1}, \ldots, v_{p_r}). $$
The reverse algorithm:
- initialize all adjoints to zero;
- set output adjoint to one;
- process operations in reverse order;
- apply local pullbacks;
- accumulate contributions into parent adjoints.
Pseudocode:
forward:
for op in operations:
compute output
record operation metadata
reverse:
for variable:
adjoint = 0
output.adjoint = 1
for op in reverse(operations):
for input in op.inputs:
input.adjoint += (
op.output.adjoint
* local_derivative(op, input)
)
This structure is independent of the specific primitive operations.
Local Reverse Rules
Every primitive operation defines a reverse accumulation rule.
Addition
$$ z = x + y $$
Reverse:
$$ \bar x \mathrel{+}= \bar z, $$
$$ \bar y \mathrel{+}= \bar z. $$
Subtraction
$$ z = x - y $$
Reverse:
$$ \bar x \mathrel{+}= \bar z, $$
$$ \bar y \mathrel{+}= -\bar z. $$
Multiplication
$$ z = xy $$
Reverse:
$$ \bar x \mathrel{+}= \bar z y, $$
$$ \bar y \mathrel{+}= \bar z x. $$
Division
$$ z = \frac{x}{y} $$
Reverse:
$$ \bar x \mathrel{+}= \frac{\bar z}{y}, $$
$$ \bar y \mathrel{+}= -\bar z \frac{x}{y^2}. $$
Exponential
$$ z = e^x $$
Reverse:
$$ \bar x \mathrel{+}= \bar z e^x. $$
Often implemented as
$$ \bar x \mathrel{+}= \bar z z, $$
using the stored forward output.
Sine
$$ z = \sin(x) $$
Reverse:
$$ \bar x \mathrel{+}= \bar z \cos(x). $$
Each primitive only requires local derivative knowledge.
Graph Interpretation
The reverse accumulation algorithm computes gradients by traversing the computational graph backward.
At each node:
- receive output adjoint;
- compute local contributions;
- distribute contributions into inputs.
The full gradient emerges from repeated local applications of the chain rule.
This locality is one of the most important properties of reverse mode systems.
Sparse Dependency Propagation
Many programs have sparse dependency structure.
Example:
$$ y = x_1x_2 + x_3. $$
Variable $x_4$ does not influence $y$.
Thus,
$$ \frac{\partial y}{\partial x_4} = 0. $$
Reverse accumulation naturally respects sparsity because adjoints only propagate along dependency edges.
Nodes disconnected from the output never receive nonzero adjoints.
In-Place Adjoint Updates
Most implementations use mutable adjoint buffers.
Operationally:
adjoint[input] += contribution
This avoids repeated allocation and permits efficient accumulation.
However, parallel execution becomes more difficult because several operations may attempt to update the same adjoint simultaneously.
Parallel reverse mode therefore requires:
- synchronization;
- atomic accumulation;
- graph partitioning;
- reduction strategies.
Reverse Topological Ordering
The reverse sweep requires valid dependency order.
Suppose
$$ v_i \to v_j $$
in the forward graph.
Then reverse propagation requires
$$ \bar v_j $$
before computing contributions into
$$ \bar v_i. $$
Thus the reverse pass processes nodes in reverse topological order.
Straight-line programs automatically satisfy this property because the forward execution trace already records operations in topological sequence.
Complexity of Reverse Accumulation
Suppose evaluating the original program costs
$$ C. $$
Reverse accumulation usually costs a small constant multiple of $C$.
The total cost becomes approximately
| Phase | Cost |
|---|---|
| Forward pass | $C$ |
| Reverse pass | $\alpha C$ |
| Total | $(1+\alpha)C$ |
Typical values:
$$ 2 \le \alpha \le 5. $$
The precise value depends on:
- primitive operation complexity;
- memory traffic;
- tensor kernel efficiency;
- recomputation strategy.
Reverse Accumulation and Memory
Reverse accumulation requires access to forward information.
The reverse rules for many operations depend on saved primal values:
| Operation | Required Data |
|---|---|
| multiplication | inputs |
| division | denominator |
| exponential | output or input |
| matrix factorization | decomposition state |
The system must therefore preserve enough information from the forward pass.
This creates the central tradeoff in reverse mode:
| Strategy | Cost |
|---|---|
| Save everything | high memory |
| Recompute values | high compute |
Checkpointing algorithms balance this tradeoff.
Reverse Accumulation as Backward Linear Propagation
Each local reverse rule applies a transposed Jacobian.
Suppose
$$ z = g(x). $$
The local Jacobian is
$$ J_g(x). $$
Forward mode propagates tangents:
$$ \dot z = J_g(x)\dot x. $$
Reverse accumulation propagates adjoints:
$$ \bar x = J_g(x)^\top \bar z. $$
Thus the reverse pass computes repeated transposed linear transformations.
The global reverse accumulation algorithm is therefore the composition of local transposed Jacobian actions over the computational graph.
Conceptual Summary
Reverse accumulation algorithms operationalize the multivariate chain rule.
The forward pass builds a dependency trace and computes primal values.
The reverse pass:
- starts from output sensitivities;
- traverses operations backward;
- applies local pullbacks;
- accumulates adjoints into dependencies.
Every gradient computed by reverse mode emerges from repeated local adjoint accumulation over the reverse computational graph.