Intermediate Variables

Intermediate variables are the named values created between program inputs and program outputs. They make automatic differentiation mechanical.

Consider:

y = sin(x1 * x2) + exp(x2)

A straight-line version is:

v1 = x1
v2 = x2
v3 = v1 * v2
v4 = sin(v3)
v5 = exp(v2)
v6 = v4 + v5
y  = v6

The expression has been decomposed into elementary assignments. Each assignment has one local derivative rule. AD does not need to reason about the whole expression at once.

Variables as Program State

At runtime, each intermediate variable stores a primal value. The primal value is the ordinary value computed by the original program.

For example, with $x_1 = 2$ and $x_2 = 3$:

v1 = 2
v2 = 3
v3 = 6
v4 = sin(6)
v5 = exp(3)
v6 = sin(6) + exp(3)

Automatic differentiation augments this state.

In forward mode, each variable also stores a tangent:

$$ (v_i, \dot v_i) $$

In reverse mode, each variable eventually receives an adjoint:

$$ \bar v_i = \frac{\partial y}{\partial v_i} $$

The intermediate variable gives AD a place to attach derivative information.

Naming Subexpressions

Intermediate variables name subexpressions. This prevents repeated work and gives the computation a graph structure.

Without intermediate variables:

$$ y = \sin(x_1x_2) + \exp(x_2) $$

With intermediate variables:

$$ v_3 = x_1x_2 $$

$$ v_4 = \sin(v_3) $$

$$ v_5 = \exp(x_2) $$

$$ y = v_4 + v_5 $$

The variable $v_3$ is used as the input to $\sin$. The variable $v_5$ is computed independently from $v_3$. These dependencies determine the derivative flow.

Local Derivative Rules

Each intermediate assignment defines a local map.

For

v = a * b

the local differential is:

$$ dv = b,da + a,db. $$

For

v = sin(a)

the local differential is:

$$ dv = \cos(a),da. $$

For

v = exp(a)

the local differential is:

$$ dv = \exp(a),da. $$

The AD engine applies these rules line by line. It does not need symbolic simplification.

Forward Mode View

Forward mode propagates tangents with the primal computation.

For

v3 = v1 * v2

the tangent rule is:

$$ \dot v_3 = \dot v_1 v_2 + v_1 \dot v_2. $$

For

v4 = sin(v3)

the tangent rule is:

$$ \dot v_4 = \cos(v_3)\dot v_3. $$

For

v6 = v4 + v5

the tangent rule is:

$$ \dot v_6 = \dot v_4 + \dot v_5. $$

The intermediate variables carry both values and tangent values through the same evaluation order.

Reverse Mode View

Reverse mode first computes all intermediate primal values. Then it walks backward and accumulates adjoints.

For

v6 = v4 + v5

the reverse rule is:

$$ \bar v_4 += \bar v_6 $$

$$ \bar v_5 += \bar v_6 $$

For

v5 = exp(v2)

the reverse rule is:

$$ \bar v_2 += \bar v_5 \exp(v_2) $$

For

v4 = sin(v3)

the reverse rule is:

$$ \bar v_3 += \bar v_4 \cos(v_3) $$

For

v3 = v1 * v2

the reverse rule is:

$$ \bar v_1 += \bar v_3 v_2 $$

$$ \bar v_2 += \bar v_3 v_1 $$

Intermediate variables are necessary because reverse mode needs the primal values $v_2$, $v_3$, and others during the backward pass.

Storage Requirements

Forward mode can often discard intermediate derivative state once it has been consumed. Reverse mode usually cannot.

Reverse mode needs enough information to replay local derivative rules backward. For each instruction, it may need:

Stored item	Purpose
Operation code	Select the derivative rule
Input variable IDs	Know where adjoints flow
Output variable ID	Read the output adjoint
Input primal values	Evaluate local derivatives
Shape and dtype metadata	Apply tensor derivative rules
Alias and mutation metadata	Preserve program semantics

This stored execution record is commonly called a tape.

Common Subexpressions

Intermediate variables also expose sharing.

Compare:

y = sin(x * x) + cos(x * x)

A naive expression tree may compute $x*x$ twice. A straight-line program can compute it once:

v1 = x
v2 = v1 * v1
v3 = sin(v2)
v4 = cos(v2)
v5 = v3 + v4
y  = v5

The derivative of $v_2$ receives contributions from both uses:

$$ \bar v_2 += \bar v_3 \cos(v_2) $$

$$ \bar v_2 += -\bar v_4 \sin(v_2) $$

This accumulation is central to reverse mode. When one variable is used by many later operations, its adjoint is the sum of all downstream contributions.

Single Assignment Form

AD is easiest when every intermediate variable is assigned exactly once.

Good:

v1 = x
v2 = v1 * v1
v3 = v2 + 1

Harder:

v = x
v = v * v
v = v + 1

The second program mutates v. To differentiate it cleanly, an AD system often converts it into single assignment form:

v1 = x
v2 = v1 * v1
v3 = v2 + 1

Single assignment form makes data dependencies explicit. It also prevents ambiguity in reverse mode, where the old value of a variable may be needed after the variable has been overwritten.

Intermediates in Tensor Programs

In tensor programs, intermediate variables may be large arrays.

v1 = matmul(x, w)
v2 = add(v1, b)
v3 = relu(v2)
v4 = matmul(v3, u)
y  = loss(v4, target)

Here, each $v_i$ may contain millions of numbers. Reverse mode often stores these tensors because the backward pass needs them.

For example, ReLU requires knowing which entries were positive:

v3 = relu(v2)

The backward rule is:

$$ \bar v_2 = \bar v_3 \odot 1_{v_2 > 0}. $$

The mask depends on the primal value $v_2$. The system can either store $v_2$, store a compressed mask, or recompute $v_2$ during backward execution.

Lifetime of Intermediate Variables

An intermediate variable has a lifetime.

In the forward computation, its lifetime begins when it is computed. It ends when no later operation needs it.

In reverse mode, the lifetime may extend much longer because the backward pass may need the primal value.

This creates a memory problem. Large AD systems must decide which intermediates to store, which to recompute, and which to discard. This is the basis of checkpointing.

Minimal Implementation Model

A small AD engine can represent intermediate variables as integer IDs.

type VarID int

type Value struct {
    Primal float64
    Tangent float64
}

A forward-mode multiplication rule can be written as:

func mul(a, b Value) Value {
    return Value{
        Primal:  a.Primal * b.Primal,
        Tangent: a.Tangent*b.Primal + a.Primal*b.Tangent,
    }
}

For reverse mode, the variable needs an adjoint slot:

type Node struct {
    Primal float64
    Adj    float64
    Prev   []VarID
    Backward func(outAdj float64)
}

A multiplication node records enough information to propagate gradients backward:

func mul(tape *[]Node, a, b VarID) VarID {
    nodes := *tape

    av := nodes[a].Primal
    bv := nodes[b].Primal

    out := VarID(len(nodes))

    nodes = append(nodes, Node{
        Primal: av * bv,
        Prev:   []VarID{a, b},
        Backward: func(outAdj float64) {
            nodes[a].Adj += outAdj * bv
            nodes[b].Adj += outAdj * av
        },
    })

    *tape = nodes
    return out
}

This simplified code shows the idea but omits important engineering details, especially closure capture, mutation safety, tensor storage, and concurrency.

Design Rule

Intermediate variables should make dependencies explicit.

A good AD representation answers four questions for each value:

Question	Example
How was this value computed?	`v3 = mul(v1, v2)`
Which values does it depend on?	`v1`, `v2`
Which later values use it?	`v4`, `v7`
What derivative rule applies?	product rule

Once these questions are explicit, automatic differentiation becomes an execution discipline rather than a symbolic manipulation problem.

Core Idea

Intermediate variables are the handles by which AD controls a computation. They store primal values, expose dependencies, carry tangents in forward mode, receive adjoints in reverse mode, and define the storage requirements of the derivative computation.

A program without explicit intermediates may look compact to a human. A program with explicit intermediates is easier for an AD system to evaluate, transform, store, and differentiate.