Minimal Reverse Mode Engine

Reverse mode automatic differentiation computes derivatives by traversing the program backward after evaluation. Unlike forward mode, which propagates tangents alongside values, reverse mode separates computation into two phases:

Forward evaluation
Backward gradient propagation

The forward pass computes values and records dependency information. The backward pass propagates sensitivities from outputs toward inputs.

For a scalar-valued function

$$ f : \mathbb{R}^n \to \mathbb{R} $$

reverse mode computes the full gradient efficiently:

$$ \nabla f(x) $$

This efficiency is the reason reverse mode dominates machine learning and deep neural network training.

The Core Idea

Suppose:

$$ z = f(x, y) $$

During evaluation we compute intermediate values:

x, y -> intermediates -> z

Reverse mode introduces an adjoint for each intermediate variable:

$$ \bar{v} = \frac{\partial z}{\partial v} $$

The adjoint measures how sensitive the final output is to that variable.

The backward pass starts from:

$$ \bar{z} = 1 $$

and propagates derivatives backward through every operation.

This is simply repeated application of the chain rule.

A Minimal Computational Graph

A reverse mode engine typically represents computations as nodes.

A minimal node structure:

type Node struct {
    Value float64
    Grad  float64

    Prev []*Node
    Backward func()
}

Value stores the primal value.

Grad stores the accumulated adjoint:

$$ \frac{\partial output}{\partial node} $$

Prev stores dependencies.

Backward defines how gradients propagate to parents.

This is enough to build a minimal reverse mode engine.

Variables and Constants

Variables are nodes with user-provided values.

func Var(x float64) *Node {
    return &Node{
        Value: x,
    }
}

Constants are identical structurally. They simply receive no meaningful gradient unless connected to outputs.

func Const(x float64) *Node {
    return &Node{
        Value: x,
    }
}

In minimal systems there is often no distinction between variables and constants at the node level.

Addition

Consider:

$$ z = x + y $$

Then:

$$ \frac{\partial z}{\partial x} = 1 $$

$$ \frac{\partial z}{\partial y} = 1 $$

The implementation:

func Add(a, b *Node) *Node {
    out := &Node{
        Value: a.Value + b.Value,
        Prev:  []*Node{a, b},
    }

    out.Backward = func() {
        a.Grad += out.Grad
        b.Grad += out.Grad
    }

    return out
}

The key idea:

Parents accumulate contributions from child gradients.

Reverse mode is fundamentally gradient accumulation.

Multiplication

For:

$$ z = xy $$

we have:

$$ \frac{\partial z}{\partial x} = y $$

$$ \frac{\partial z}{\partial y} = x $$

Implementation:

func Mul(a, b *Node) *Node {
    out := &Node{
        Value: a.Value * b.Value,
        Prev:  []*Node{a, b},
    }

    out.Backward = func() {
        a.Grad += b.Value * out.Grad
        b.Grad += a.Value * out.Grad
    }

    return out
}

Each backward rule applies the local derivative multiplied by the incoming adjoint.

This is the reverse chain rule.

Elementary Functions

For:

$$ y = \sin(x) $$

we know:

$$ \frac{dy}{dx} = \cos(x) $$

Implementation:

func Sin(x *Node) *Node {
    out := &Node{
        Value: math.Sin(x.Value),
        Prev:  []*Node{x},
    }

    out.Backward = func() {
        x.Grad += math.Cos(x.Value) * out.Grad
    }

    return out
}

For exponentials:

func Exp(x *Node) *Node {
    v := math.Exp(x.Value)

    out := &Node{
        Value: v,
        Prev:  []*Node{x},
    }

    out.Backward = func() {
        x.Grad += v * out.Grad
    }

    return out
}

Every primitive operation defines:

forward evaluation
local backward propagation

Nothing more is required.

Example Computation

Consider:

$$ f(x) = x^2 + 3x + 2 $$

Implementation:

func F(x *Node) *Node {
    return Add(
        Add(
            Mul(x, x),
            Mul(Const(3), x),
        ),
        Const(2),
    )
}

Evaluation:

func main() {
    x := Var(5)

    y := F(x)

    Backprop(y)

    fmt.Println("value:", y.Value)
    fmt.Println("gradient:", x.Grad)
}

Expected result:

value: 42
gradient: 13

The forward pass computes values. The backward pass computes gradients.

Topological Ordering

Backward propagation must occur in reverse dependency order.

Example:

x -> a -> b -> output

We cannot propagate gradients through a before processing b.

A minimal engine therefore performs a topological traversal.

func TopoSort(
    n *Node,
    visited map[*Node]bool,
    order *[]*Node,
) {
    if visited[n] {
        return
    }

    visited[n] = true

    for _, p := range n.Prev {
        TopoSort(p, visited, order)
    }

    *order = append(*order, n)
}

This produces nodes in forward dependency order.

The backward pass iterates in reverse:

func Backprop(root *Node) {
    var order []*Node

    TopoSort(
        root,
        map[*Node]bool{},
        &order,
    )

    root.Grad = 1

    for i := len(order)-1; i >= 0; i-- {
        if order[i].Backward != nil {
            order[i].Backward()
        }
    }
}

This is the core reverse mode algorithm.

Gradient Accumulation

A subtle but important detail:

a.Grad += ...

not:

a.Grad = ...

Gradients must accumulate because variables may influence outputs through multiple paths.

Example:

$$ f(x) = x^2 + x $$

The variable $x$ contributes through two independent branches.

Reverse mode sums all contributions:

$$ \frac{df}{dx} = \frac{\partial(x^2)}{\partial x} + \frac{\partial x}{\partial x} $$

This accumulation property is essential.

Computational Complexity

Suppose:

$$ f : \mathbb{R}^n \to \mathbb{R} $$

Forward mode:

computes one directional derivative per pass
full gradient costs roughly $O(n)$ passes

Reverse mode:

computes the full gradient in one backward pass

This makes reverse mode asymptotically favorable for:

scalar losses
large parameter spaces
neural networks

However, reverse mode must store intermediate values and graph structure from the forward pass.

Memory becomes the central systems problem.

Minimal Engine Architecture

A tiny reverse mode engine needs only:

Component	Purpose
Node	Stores value, gradient, dependencies
Primitive ops	Define local derivative rules
Topological traversal	Orders backward propagation
Gradient accumulation	Combines contributions
Backward pass	Executes reverse chain rule

This minimal structure already supports:

scalar gradients
arbitrary expression graphs
nested compositions
reusable primitives

Modern deep learning frameworks are vastly more complex, but their core mechanism is still this architecture.

Dynamic Graph Construction

One important property of this design:

The graph is constructed during execution.

This is called a dynamic computation graph.

Advantages:

ordinary language control flow
easy debugging
natural recursion
flexible runtime behavior

Disadvantages:

runtime overhead
graph allocation costs
reduced compiler optimization opportunities

Frameworks such as entity["software","PyTorch","deep learning framework"] popularized this style.

Static graph systems instead separate graph construction from execution.

Correctness Intuition

Suppose node $v$ contributes to output $y$.

The adjoint:

$$ \bar{v} = \frac{\partial y}{\partial v} $$

For every edge:

$$ v \to u $$

the chain rule gives:

$$ \frac{\partial y}{\partial v} = \sum_u \frac{\partial y}{\partial u} \frac{\partial u}{\partial v} $$

Reverse mode computes exactly this recurrence.

Each node receives contributions from downstream nodes and accumulates them.

The algorithm is therefore a graph-based implementation of the multivariable chain rule.

Memory Limitations

A minimal reverse mode engine stores every intermediate node until the backward pass finishes.

This causes memory growth proportional to computation size.

Large models introduce several problems:

Problem	Cause
Activation memory explosion	Stored intermediates
Long backward latency	Sequential traversal
Graph allocation overhead	Dynamic node creation
GPU synchronization costs	Dependency ordering
Recomputation tradeoffs	Checkpointing decisions

Most modern AD systems spend more engineering effort on memory management than derivative rules.

Comparison with Forward Mode

Property	Forward Mode	Reverse Mode
Derivative direction	Inputs → outputs	Outputs → inputs
Best for	Few inputs	Few outputs
Memory usage	Low	Higher
Gradient cost for scalar output	O(n) passes	O(1) backward pass
Graph storage	Usually unnecessary	Required
Jacobian-vector product	Natural	Indirect
Vector-Jacobian product	Indirect	Natural

Both modes are implementations of the chain rule. They differ only in traversal direction and accumulation strategy.

Minimal Reverse Mode Engine Summary

A minimal reverse mode engine consists of:

nodes storing values and gradients
primitive operations with local backward rules
topological traversal
gradient accumulation

The engine evaluates programs normally during the forward pass, then replays local derivative rules backward across the graph.

This architecture scales from tiny educational systems to industrial deep learning runtimes. The surrounding infrastructure changes dramatically at scale, but the mathematical core remains almost identical.