Minimal Reverse Mode Engine

Reverse mode automatic differentiation computes derivatives by traversing the program backward after evaluation. Unlike forward mode, which propagates tangents alongside...

Minimal Reverse Mode Engine

Reverse mode automatic differentiation computes derivatives by traversing the program backward after evaluation. Unlike forward mode, which propagates tangents alongside values, reverse mode separates computation into two phases:

  1. Forward evaluation
  2. Backward gradient propagation

The forward pass computes values and records dependency information. The backward pass propagates sensitivities from outputs toward inputs.

For a scalar-valued function

$$ f : \mathbb{R}^n \to \mathbb{R} $$

reverse mode computes the full gradient efficiently:

$$ \nabla f(x) $$

This efficiency is the reason reverse mode dominates machine learning and deep neural network training.

The Core Idea

Suppose:

$$ z = f(x, y) $$

During evaluation we compute intermediate values:

x, y -> intermediates -> z

Reverse mode introduces an adjoint for each intermediate variable:

$$ \bar{v} = \frac{\partial z}{\partial v} $$

The adjoint measures how sensitive the final output is to that variable.

The backward pass starts from:

$$ \bar{z} = 1 $$

and propagates derivatives backward through every operation.

This is simply repeated application of the chain rule.

A Minimal Computational Graph

A reverse mode engine typically represents computations as nodes.

A minimal node structure:

type Node struct {
    Value float64
    Grad  float64

    Prev []*Node
    Backward func()
}

Value stores the primal value.

Grad stores the accumulated adjoint:

$$ \frac{\partial output}{\partial node} $$

Prev stores dependencies.

Backward defines how gradients propagate to parents.

This is enough to build a minimal reverse mode engine.

Variables and Constants

Variables are nodes with user-provided values.

func Var(x float64) *Node {
    return &Node{
        Value: x,
    }
}

Constants are identical structurally. They simply receive no meaningful gradient unless connected to outputs.

func Const(x float64) *Node {
    return &Node{
        Value: x,
    }
}

In minimal systems there is often no distinction between variables and constants at the node level.

Addition

Consider:

$$ z = x + y $$

Then:

$$ \frac{\partial z}{\partial x} = 1 $$

$$ \frac{\partial z}{\partial y} = 1 $$

The implementation:

func Add(a, b *Node) *Node {
    out := &Node{
        Value: a.Value + b.Value,
        Prev:  []*Node{a, b},
    }

    out.Backward = func() {
        a.Grad += out.Grad
        b.Grad += out.Grad
    }

    return out
}

The key idea:

Parents accumulate contributions from child gradients.

Reverse mode is fundamentally gradient accumulation.

Multiplication

For:

$$ z = xy $$

we have:

$$ \frac{\partial z}{\partial x} = y $$

$$ \frac{\partial z}{\partial y} = x $$

Implementation:

func Mul(a, b *Node) *Node {
    out := &Node{
        Value: a.Value * b.Value,
        Prev:  []*Node{a, b},
    }

    out.Backward = func() {
        a.Grad += b.Value * out.Grad
        b.Grad += a.Value * out.Grad
    }

    return out
}

Each backward rule applies the local derivative multiplied by the incoming adjoint.

This is the reverse chain rule.

Elementary Functions

For:

$$ y = \sin(x) $$

we know:

$$ \frac{dy}{dx} = \cos(x) $$

Implementation:

func Sin(x *Node) *Node {
    out := &Node{
        Value: math.Sin(x.Value),
        Prev:  []*Node{x},
    }

    out.Backward = func() {
        x.Grad += math.Cos(x.Value) * out.Grad
    }

    return out
}

For exponentials:

func Exp(x *Node) *Node {
    v := math.Exp(x.Value)

    out := &Node{
        Value: v,
        Prev:  []*Node{x},
    }

    out.Backward = func() {
        x.Grad += v * out.Grad
    }

    return out
}

Every primitive operation defines:

  • forward evaluation
  • local backward propagation

Nothing more is required.

Example Computation

Consider:

$$ f(x) = x^2 + 3x + 2 $$

Implementation:

func F(x *Node) *Node {
    return Add(
        Add(
            Mul(x, x),
            Mul(Const(3), x),
        ),
        Const(2),
    )
}

Evaluation:

func main() {
    x := Var(5)

    y := F(x)

    Backprop(y)

    fmt.Println("value:", y.Value)
    fmt.Println("gradient:", x.Grad)
}

Expected result:

value: 42
gradient: 13

The forward pass computes values. The backward pass computes gradients.

Topological Ordering

Backward propagation must occur in reverse dependency order.

Example:

x -> a -> b -> output

We cannot propagate gradients through a before processing b.

A minimal engine therefore performs a topological traversal.

func TopoSort(
    n *Node,
    visited map[*Node]bool,
    order *[]*Node,
) {
    if visited[n] {
        return
    }

    visited[n] = true

    for _, p := range n.Prev {
        TopoSort(p, visited, order)
    }

    *order = append(*order, n)
}

This produces nodes in forward dependency order.

The backward pass iterates in reverse:

func Backprop(root *Node) {
    var order []*Node

    TopoSort(
        root,
        map[*Node]bool{},
        &order,
    )

    root.Grad = 1

    for i := len(order)-1; i >= 0; i-- {
        if order[i].Backward != nil {
            order[i].Backward()
        }
    }
}

This is the core reverse mode algorithm.

Gradient Accumulation

A subtle but important detail:

a.Grad += ...

not:

a.Grad = ...

Gradients must accumulate because variables may influence outputs through multiple paths.

Example:

$$ f(x) = x^2 + x $$

The variable $x$ contributes through two independent branches.

Reverse mode sums all contributions:

$$ \frac{df}{dx} = \frac{\partial(x^2)}{\partial x} + \frac{\partial x}{\partial x} $$

This accumulation property is essential.

Computational Complexity

Suppose:

$$ f : \mathbb{R}^n \to \mathbb{R} $$

Forward mode:

  • computes one directional derivative per pass
  • full gradient costs roughly $O(n)$ passes

Reverse mode:

  • computes the full gradient in one backward pass

This makes reverse mode asymptotically favorable for:

  • scalar losses
  • large parameter spaces
  • neural networks

However, reverse mode must store intermediate values and graph structure from the forward pass.

Memory becomes the central systems problem.

Minimal Engine Architecture

A tiny reverse mode engine needs only:

Component Purpose
Node Stores value, gradient, dependencies
Primitive ops Define local derivative rules
Topological traversal Orders backward propagation
Gradient accumulation Combines contributions
Backward pass Executes reverse chain rule

This minimal structure already supports:

  • scalar gradients
  • arbitrary expression graphs
  • nested compositions
  • reusable primitives

Modern deep learning frameworks are vastly more complex, but their core mechanism is still this architecture.

Dynamic Graph Construction

One important property of this design:

The graph is constructed during execution.

This is called a dynamic computation graph.

Advantages:

  • ordinary language control flow
  • easy debugging
  • natural recursion
  • flexible runtime behavior

Disadvantages:

  • runtime overhead
  • graph allocation costs
  • reduced compiler optimization opportunities

Frameworks such as entity["software","PyTorch","deep learning framework"] popularized this style.

Static graph systems instead separate graph construction from execution.

Correctness Intuition

Suppose node $v$ contributes to output $y$.

The adjoint:

$$ \bar{v} = \frac{\partial y}{\partial v} $$

For every edge:

$$ v \to u $$

the chain rule gives:

$$ \frac{\partial y}{\partial v} = \sum_u \frac{\partial y}{\partial u} \frac{\partial u}{\partial v} $$

Reverse mode computes exactly this recurrence.

Each node receives contributions from downstream nodes and accumulates them.

The algorithm is therefore a graph-based implementation of the multivariable chain rule.

Memory Limitations

A minimal reverse mode engine stores every intermediate node until the backward pass finishes.

This causes memory growth proportional to computation size.

Large models introduce several problems:

Problem Cause
Activation memory explosion Stored intermediates
Long backward latency Sequential traversal
Graph allocation overhead Dynamic node creation
GPU synchronization costs Dependency ordering
Recomputation tradeoffs Checkpointing decisions

Most modern AD systems spend more engineering effort on memory management than derivative rules.

Comparison with Forward Mode

Property Forward Mode Reverse Mode
Derivative direction Inputs → outputs Outputs → inputs
Best for Few inputs Few outputs
Memory usage Low Higher
Gradient cost for scalar output O(n) passes O(1) backward pass
Graph storage Usually unnecessary Required
Jacobian-vector product Natural Indirect
Vector-Jacobian product Indirect Natural

Both modes are implementations of the chain rule. They differ only in traversal direction and accumulation strategy.

Minimal Reverse Mode Engine Summary

A minimal reverse mode engine consists of:

  • nodes storing values and gradients
  • primitive operations with local backward rules
  • topological traversal
  • gradient accumulation

The engine evaluates programs normally during the forward pass, then replays local derivative rules backward across the graph.

This architecture scales from tiny educational systems to industrial deep learning runtimes. The surrounding infrastructure changes dramatically at scale, but the mathematical core remains almost identical.