Reverse Accumulation Algorithms

Reverse accumulation is the operational core of reverse mode automatic differentiation. The forward pass evaluates a program and records dependency information. The reverse...

Reverse Accumulation Algorithms

Reverse accumulation is the operational core of reverse mode automatic differentiation. The forward pass evaluates a program and records dependency information. The reverse pass traverses that recorded structure backward and accumulates adjoint contributions into each variable.

The word accumulation is important. A variable may influence the final output through many downstream paths. Reverse mode computes the total derivative by summing contributions from all such paths.

For a scalar-valued function

$$ f : \mathbb{R}^n \to \mathbb{R}, $$

reverse accumulation computes

$$ \nabla f(x) = \left( \frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n} \right) $$

using one backward traversal.

Forward Evaluation and Recording

Consider a straight-line program:

$$ v_1 = x_1 x_2, $$

$$ v_2 = \sin(v_1), $$

$$ v_3 = v_2 + x_1, $$

$$ y = v_3. $$

The forward pass computes intermediate values in order.

Step Expression
1 $v_1 = x_1 x_2$
2 $v_2 = \sin(v_1)$
3 $v_3 = v_2 + x_1$
4 $y = v_3$

During execution, the system records:

  1. operation type;
  2. input references;
  3. output reference;
  4. primal values needed later.

The resulting trace is a topologically ordered sequence of operations.

Reverse Accumulation Principle

For every variable $v_i$, define adjoint

$$ \bar v_i = \frac{\partial y}{\partial v_i}. $$

The reverse pass begins with

$$ \bar y = 1. $$

Then each operation propagates adjoints backward using local derivative rules.

The chain rule gives

$$ \bar v_i = \sum_j \bar v_j \frac{\partial v_j}{\partial v_i}, $$

where the sum ranges over all immediate children of $v_i$ in the computational graph.

This equation defines reverse accumulation.

Reverse Sweep Example

Start from the previous program.

Forward values:

$$ v_1 = x_1 x_2, $$

$$ v_2 = \sin(v_1), $$

$$ v_3 = v_2 + x_1, $$

$$ y = v_3. $$

Initialize all adjoints to zero:

$$ \bar x_1 = 0, \qquad \bar x_2 = 0, \qquad \bar v_1 = 0, \qquad \bar v_2 = 0, \qquad \bar v_3 = 0. $$

Seed output:

$$ \bar y = \bar v_3 = 1. $$

Now traverse backward.

Step 1: Addition Node

Since

$$ v_3 = v_2 + x_1, $$

the reverse rule is

$$ \bar v_2 \mathrel{+}= \bar v_3, $$

$$ \bar x_1 \mathrel{+}= \bar v_3. $$

Therefore,

$$ \bar v_2 = 1, \qquad \bar x_1 = 1. $$

Step 2: Sine Node

Since

$$ v_2 = \sin(v_1), $$

the reverse rule is

$$ \bar v_1 \mathrel{+}= \bar v_2 \cos(v_1). $$

Thus,

$$ \bar v_1 = \cos(v_1). $$

Step 3: Multiplication Node

Since

$$ v_1 = x_1 x_2, $$

the reverse rules are

$$ \bar x_1 \mathrel{+}= \bar v_1 x_2, $$

$$ \bar x_2 \mathrel{+}= \bar v_1 x_1. $$

Substituting:

$$ \bar x_1 = 1 + x_2\cos(v_1), $$

$$ \bar x_2 = x_1\cos(v_1). $$

Since

$$ v_1 = x_1 x_2, $$

the final derivatives are

$$ \frac{\partial y}{\partial x_1} = 1 + x_2\cos(x_1x_2), $$

$$ \frac{\partial y}{\partial x_2} = x_1\cos(x_1x_2). $$

Generic Reverse Accumulation Algorithm

The reverse accumulation procedure can be expressed abstractly.

Suppose the forward pass generated operations

$$ o_1, o_2, \ldots, o_k. $$

Each operation defines

$$ v_j = g_j(v_{p_1}, \ldots, v_{p_r}). $$

The reverse algorithm:

  1. initialize all adjoints to zero;
  2. set output adjoint to one;
  3. process operations in reverse order;
  4. apply local pullbacks;
  5. accumulate contributions into parent adjoints.

Pseudocode:

forward:
    for op in operations:
        compute output
        record operation metadata

reverse:
    for variable:
        adjoint = 0

    output.adjoint = 1

    for op in reverse(operations):
        for input in op.inputs:
            input.adjoint += (
                op.output.adjoint
                * local_derivative(op, input)
            )

This structure is independent of the specific primitive operations.

Local Reverse Rules

Every primitive operation defines a reverse accumulation rule.

Addition

$$ z = x + y $$

Reverse:

$$ \bar x \mathrel{+}= \bar z, $$

$$ \bar y \mathrel{+}= \bar z. $$

Subtraction

$$ z = x - y $$

Reverse:

$$ \bar x \mathrel{+}= \bar z, $$

$$ \bar y \mathrel{+}= -\bar z. $$

Multiplication

$$ z = xy $$

Reverse:

$$ \bar x \mathrel{+}= \bar z y, $$

$$ \bar y \mathrel{+}= \bar z x. $$

Division

$$ z = \frac{x}{y} $$

Reverse:

$$ \bar x \mathrel{+}= \frac{\bar z}{y}, $$

$$ \bar y \mathrel{+}= -\bar z \frac{x}{y^2}. $$

Exponential

$$ z = e^x $$

Reverse:

$$ \bar x \mathrel{+}= \bar z e^x. $$

Often implemented as

$$ \bar x \mathrel{+}= \bar z z, $$

using the stored forward output.

Sine

$$ z = \sin(x) $$

Reverse:

$$ \bar x \mathrel{+}= \bar z \cos(x). $$

Each primitive only requires local derivative knowledge.

Graph Interpretation

The reverse accumulation algorithm computes gradients by traversing the computational graph backward.

At each node:

  1. receive output adjoint;
  2. compute local contributions;
  3. distribute contributions into inputs.

The full gradient emerges from repeated local applications of the chain rule.

This locality is one of the most important properties of reverse mode systems.

Sparse Dependency Propagation

Many programs have sparse dependency structure.

Example:

$$ y = x_1x_2 + x_3. $$

Variable $x_4$ does not influence $y$.

Thus,

$$ \frac{\partial y}{\partial x_4} = 0. $$

Reverse accumulation naturally respects sparsity because adjoints only propagate along dependency edges.

Nodes disconnected from the output never receive nonzero adjoints.

In-Place Adjoint Updates

Most implementations use mutable adjoint buffers.

Operationally:

adjoint[input] += contribution

This avoids repeated allocation and permits efficient accumulation.

However, parallel execution becomes more difficult because several operations may attempt to update the same adjoint simultaneously.

Parallel reverse mode therefore requires:

  1. synchronization;
  2. atomic accumulation;
  3. graph partitioning;
  4. reduction strategies.

Reverse Topological Ordering

The reverse sweep requires valid dependency order.

Suppose

$$ v_i \to v_j $$

in the forward graph.

Then reverse propagation requires

$$ \bar v_j $$

before computing contributions into

$$ \bar v_i. $$

Thus the reverse pass processes nodes in reverse topological order.

Straight-line programs automatically satisfy this property because the forward execution trace already records operations in topological sequence.

Complexity of Reverse Accumulation

Suppose evaluating the original program costs

$$ C. $$

Reverse accumulation usually costs a small constant multiple of $C$.

The total cost becomes approximately

Phase Cost
Forward pass $C$
Reverse pass $\alpha C$
Total $(1+\alpha)C$

Typical values:

$$ 2 \le \alpha \le 5. $$

The precise value depends on:

  1. primitive operation complexity;
  2. memory traffic;
  3. tensor kernel efficiency;
  4. recomputation strategy.

Reverse Accumulation and Memory

Reverse accumulation requires access to forward information.

The reverse rules for many operations depend on saved primal values:

Operation Required Data
multiplication inputs
division denominator
exponential output or input
matrix factorization decomposition state

The system must therefore preserve enough information from the forward pass.

This creates the central tradeoff in reverse mode:

Strategy Cost
Save everything high memory
Recompute values high compute

Checkpointing algorithms balance this tradeoff.

Reverse Accumulation as Backward Linear Propagation

Each local reverse rule applies a transposed Jacobian.

Suppose

$$ z = g(x). $$

The local Jacobian is

$$ J_g(x). $$

Forward mode propagates tangents:

$$ \dot z = J_g(x)\dot x. $$

Reverse accumulation propagates adjoints:

$$ \bar x = J_g(x)^\top \bar z. $$

Thus the reverse pass computes repeated transposed linear transformations.

The global reverse accumulation algorithm is therefore the composition of local transposed Jacobian actions over the computational graph.

Conceptual Summary

Reverse accumulation algorithms operationalize the multivariate chain rule.

The forward pass builds a dependency trace and computes primal values.

The reverse pass:

  1. starts from output sensitivities;
  2. traverses operations backward;
  3. applies local pullbacks;
  4. accumulates adjoints into dependencies.

Every gradient computed by reverse mode emerges from repeated local adjoint accumulation over the reverse computational graph.