Intermediate Variables
Intermediate variables are the named values created between program inputs and program outputs. They make automatic differentiation mechanical.
Intermediate Variables
Intermediate variables are the named values created between program inputs and program outputs. They make automatic differentiation mechanical.
Consider:
y = sin(x1 * x2) + exp(x2)
A straight-line version is:
v1 = x1
v2 = x2
v3 = v1 * v2
v4 = sin(v3)
v5 = exp(v2)
v6 = v4 + v5
y = v6
The expression has been decomposed into elementary assignments. Each assignment has one local derivative rule. AD does not need to reason about the whole expression at once.
Variables as Program State
At runtime, each intermediate variable stores a primal value. The primal value is the ordinary value computed by the original program.
For example, with $x_1 = 2$ and $x_2 = 3$:
v1 = 2
v2 = 3
v3 = 6
v4 = sin(6)
v5 = exp(3)
v6 = sin(6) + exp(3)
Automatic differentiation augments this state.
In forward mode, each variable also stores a tangent:
$$ (v_i, \dot v_i) $$
In reverse mode, each variable eventually receives an adjoint:
$$ \bar v_i = \frac{\partial y}{\partial v_i} $$
The intermediate variable gives AD a place to attach derivative information.
Naming Subexpressions
Intermediate variables name subexpressions. This prevents repeated work and gives the computation a graph structure.
Without intermediate variables:
$$ y = \sin(x_1x_2) + \exp(x_2) $$
With intermediate variables:
$$ v_3 = x_1x_2 $$
$$ v_4 = \sin(v_3) $$
$$ v_5 = \exp(x_2) $$
$$ y = v_4 + v_5 $$
The variable $v_3$ is used as the input to $\sin$. The variable $v_5$ is computed independently from $v_3$. These dependencies determine the derivative flow.
Local Derivative Rules
Each intermediate assignment defines a local map.
For
v = a * b
the local differential is:
$$ dv = b,da + a,db. $$
For
v = sin(a)
the local differential is:
$$ dv = \cos(a),da. $$
For
v = exp(a)
the local differential is:
$$ dv = \exp(a),da. $$
The AD engine applies these rules line by line. It does not need symbolic simplification.
Forward Mode View
Forward mode propagates tangents with the primal computation.
For
v3 = v1 * v2
the tangent rule is:
$$ \dot v_3 = \dot v_1 v_2 + v_1 \dot v_2. $$
For
v4 = sin(v3)
the tangent rule is:
$$ \dot v_4 = \cos(v_3)\dot v_3. $$
For
v6 = v4 + v5
the tangent rule is:
$$ \dot v_6 = \dot v_4 + \dot v_5. $$
The intermediate variables carry both values and tangent values through the same evaluation order.
Reverse Mode View
Reverse mode first computes all intermediate primal values. Then it walks backward and accumulates adjoints.
For
v6 = v4 + v5
the reverse rule is:
$$ \bar v_4 += \bar v_6 $$
$$ \bar v_5 += \bar v_6 $$
For
v5 = exp(v2)
the reverse rule is:
$$ \bar v_2 += \bar v_5 \exp(v_2) $$
For
v4 = sin(v3)
the reverse rule is:
$$ \bar v_3 += \bar v_4 \cos(v_3) $$
For
v3 = v1 * v2
the reverse rule is:
$$ \bar v_1 += \bar v_3 v_2 $$
$$ \bar v_2 += \bar v_3 v_1 $$
Intermediate variables are necessary because reverse mode needs the primal values $v_2$, $v_3$, and others during the backward pass.
Storage Requirements
Forward mode can often discard intermediate derivative state once it has been consumed. Reverse mode usually cannot.
Reverse mode needs enough information to replay local derivative rules backward. For each instruction, it may need:
| Stored item | Purpose |
|---|---|
| Operation code | Select the derivative rule |
| Input variable IDs | Know where adjoints flow |
| Output variable ID | Read the output adjoint |
| Input primal values | Evaluate local derivatives |
| Shape and dtype metadata | Apply tensor derivative rules |
| Alias and mutation metadata | Preserve program semantics |
This stored execution record is commonly called a tape.
Common Subexpressions
Intermediate variables also expose sharing.
Compare:
y = sin(x * x) + cos(x * x)
A naive expression tree may compute $x*x$ twice. A straight-line program can compute it once:
v1 = x
v2 = v1 * v1
v3 = sin(v2)
v4 = cos(v2)
v5 = v3 + v4
y = v5
The derivative of $v_2$ receives contributions from both uses:
$$ \bar v_2 += \bar v_3 \cos(v_2) $$
$$ \bar v_2 += -\bar v_4 \sin(v_2) $$
This accumulation is central to reverse mode. When one variable is used by many later operations, its adjoint is the sum of all downstream contributions.
Single Assignment Form
AD is easiest when every intermediate variable is assigned exactly once.
Good:
v1 = x
v2 = v1 * v1
v3 = v2 + 1
Harder:
v = x
v = v * v
v = v + 1
The second program mutates v. To differentiate it cleanly, an AD system often converts it into single assignment form:
v1 = x
v2 = v1 * v1
v3 = v2 + 1
Single assignment form makes data dependencies explicit. It also prevents ambiguity in reverse mode, where the old value of a variable may be needed after the variable has been overwritten.
Intermediates in Tensor Programs
In tensor programs, intermediate variables may be large arrays.
v1 = matmul(x, w)
v2 = add(v1, b)
v3 = relu(v2)
v4 = matmul(v3, u)
y = loss(v4, target)
Here, each $v_i$ may contain millions of numbers. Reverse mode often stores these tensors because the backward pass needs them.
For example, ReLU requires knowing which entries were positive:
v3 = relu(v2)
The backward rule is:
$$ \bar v_2 = \bar v_3 \odot 1_{v_2 > 0}. $$
The mask depends on the primal value $v_2$. The system can either store $v_2$, store a compressed mask, or recompute $v_2$ during backward execution.
Lifetime of Intermediate Variables
An intermediate variable has a lifetime.
In the forward computation, its lifetime begins when it is computed. It ends when no later operation needs it.
In reverse mode, the lifetime may extend much longer because the backward pass may need the primal value.
This creates a memory problem. Large AD systems must decide which intermediates to store, which to recompute, and which to discard. This is the basis of checkpointing.
Minimal Implementation Model
A small AD engine can represent intermediate variables as integer IDs.
type VarID int
type Value struct {
Primal float64
Tangent float64
}
A forward-mode multiplication rule can be written as:
func mul(a, b Value) Value {
return Value{
Primal: a.Primal * b.Primal,
Tangent: a.Tangent*b.Primal + a.Primal*b.Tangent,
}
}
For reverse mode, the variable needs an adjoint slot:
type Node struct {
Primal float64
Adj float64
Prev []VarID
Backward func(outAdj float64)
}
A multiplication node records enough information to propagate gradients backward:
func mul(tape *[]Node, a, b VarID) VarID {
nodes := *tape
av := nodes[a].Primal
bv := nodes[b].Primal
out := VarID(len(nodes))
nodes = append(nodes, Node{
Primal: av * bv,
Prev: []VarID{a, b},
Backward: func(outAdj float64) {
nodes[a].Adj += outAdj * bv
nodes[b].Adj += outAdj * av
},
})
*tape = nodes
return out
}
This simplified code shows the idea but omits important engineering details, especially closure capture, mutation safety, tensor storage, and concurrency.
Design Rule
Intermediate variables should make dependencies explicit.
A good AD representation answers four questions for each value:
| Question | Example |
|---|---|
| How was this value computed? | v3 = mul(v1, v2) |
| Which values does it depend on? | v1, v2 |
| Which later values use it? | v4, v7 |
| What derivative rule applies? | product rule |
Once these questions are explicit, automatic differentiation becomes an execution discipline rather than a symbolic manipulation problem.
Core Idea
Intermediate variables are the handles by which AD controls a computation. They store primal values, expose dependencies, carry tangents in forward mode, receive adjoints in reverse mode, and define the storage requirements of the derivative computation.
A program without explicit intermediates may look compact to a human. A program with explicit intermediates is easier for an AD system to evaluate, transform, store, and differentiate.