Dependency Graphs

A dependency graph describes how values in a computation depend on earlier values. Automatic differentiation operates on these dependencies.

Dependency Graphs

A dependency graph describes how values in a computation depend on earlier values. Automatic differentiation operates on these dependencies.

For a straight-line program,

v1 = x1
v2 = x2
v3 = v1 * v2
v4 = sin(v3)
v5 = v4 + v1
y  = v5

the dependency structure is:

v1 ─┬─> v3 ─> v4 ─┐
    │             ├─> v5
v2 ─┘             │
v1 ───────────────┘

Each node represents a variable or operation result. Each edge represents data flow.

The graph determines how derivatives propagate.

Directed Acyclic Structure

A dependency graph for a straight-line program is a directed acyclic graph.

The graph is directed because dependencies have orientation:

v3 = v1 * v2

means $v_3$ depends on $v_1$ and $v_2$, not the reverse.

The graph is acyclic because each variable depends only on variables computed earlier. No variable can depend on itself through a cycle.

This ordering gives a valid evaluation sequence.

If the nodes are ordered:

$$ v_1, v_2, v_3, v_4, v_5 $$

then every node appears after its dependencies.

This ordering is called a topological order.

Dependency Graph as Function Composition

A dependency graph encodes a decomposition of a function into local maps.

For

v3 = v1 * v2
v4 = sin(v3)
v5 = v4 + v1

the overall function is:

$$ y = g_3(g_2(g_1(x))) $$

where:

$$ g_1(x_1, x_2) = x_1x_2 $$

$$ g_2(v_3) = \sin(v_3) $$

$$ g_3(v_4, x_1) = v_4 + x_1 $$

The graph stores the composition structure explicitly.

Automatic differentiation applies the chain rule over this graph.

Nodes and Edges

Different AD systems represent graphs differently, but the conceptual structure is similar.

Element Meaning
Node A computed value
Edge A dependency relation
Source node Input variable
Sink node Output variable
Parent A node used to compute another node
Child A node computed from another node

For

v3 = v1 * v2

the graph contains edges:

$$ v_1 \to v_3 $$

$$ v_2 \to v_3 $$

because $v_3$ depends on both inputs.

Operation Nodes vs Value Nodes

Some systems represent operations explicitly.

Instead of:

v1 ─┐
    ├─> v3
v2 ─┘

they use:

v1 ─┐
    ├─> (*) ─> v3
v2 ─┘

The operation node stores:

Field Example
Operation type multiply
Input references v1, v2
Output reference v3
Derivative rule product rule

Operation-centered graphs are common in compiler IRs and tensor frameworks.

Value-centered graphs are common in minimal AD engines.

Forward Traversal

Forward evaluation follows graph direction.

For

v3 = v1 * v2
v4 = sin(v3)

evaluation order is:

$$ v_1, v_2 \to v_3 \to v_4 $$

Forward mode AD propagates tangent information in this direction.

If:

$$ (\dot v_1, \dot v_2) $$

are known, then:

$$ \dot v_3 $$

can be computed, followed by:

$$ \dot v_4. $$

The graph structure guarantees all required inputs are available.

Reverse Traversal

Reverse mode traverses the graph backward.

Starting from:

$$ \bar y = 1 $$

the adjoint propagates from outputs to inputs.

For:

v4 = sin(v3)

reverse propagation is:

$$ \bar v_3 += \bar v_4 \cos(v_3). $$

Then for:

v3 = v1 * v2

reverse propagation is:

$$ \bar v_1 += \bar v_3 v_2 $$

$$ \bar v_2 += \bar v_3 v_1. $$

The backward traversal accumulates sensitivity information along reversed edges.

Fan-Out and Gradient Accumulation

A node may feed multiple later computations.

Example:

v1 = x
v2 = sin(v1)
v3 = cos(v1)
v4 = v2 + v3

Dependency graph:

        ┌─> v2 ─┐
v1 ─────┤       ├─> v4
        └─> v3 ─┘

The value $v_1$ influences both $v_2$ and $v_3$.

During reverse mode:

$$ \bar v_1 $$

receives contributions from both paths:

$$ \bar v_1 += \bar v_2 \cos(v_1) $$

$$ \bar v_1 += -\bar v_3 \sin(v_1). $$

Thus:

$$ \bar v_1 = \bar v_2 \cos(v_1) - \bar v_3 \sin(v_1). $$

Gradient accumulation is fundamentally graph accumulation.

Shared Subgraphs

Dependency graphs naturally represent shared computation.

For:

v2 = x * x
v3 = sin(v2)
v4 = cos(v2)

the subexpression $x*x$ appears once in the graph.

This prevents redundant work.

Expression trees duplicate repeated subexpressions. Dependency graphs preserve sharing.

This distinction becomes important in large tensor programs where recomputation is expensive.

Tensor Dependency Graphs

In tensor systems, nodes may represent entire tensor operations.

Example:

v1 = matmul(x, w1)
v2 = relu(v1)
v3 = matmul(v2, w2)
v4 = softmax(v3)
y  = cross_entropy(v4, target)

The graph represents tensor dependencies, not scalar dependencies.

Each node may contain:

Field Example
Shape (1024, 4096)
Dtype float32
Device GPU 0
Layout row-major
Operation matmul

Reverse mode traverses the same dependency graph but applies tensor derivative rules.

Dynamic Graphs

Some systems construct dependency graphs dynamically during execution.

Example in a dynamic language:

if x > 0:
    y = x * x
else:
    y = sin(x)

The executed branch determines the graph.

If $x > 0$:

x ─> (*) ─> y

If $x \le 0$:

x ─> sin ─> y

The graph exists only for the executed computation.

Frameworks such as entity["software","PyTorch","deep learning framework"] commonly use this approach.

Static Graphs

Other systems build a graph representation before execution.

Example:

Input -> MatMul -> ReLU -> MatMul -> Output

The graph becomes a compilable program representation.

Advantages include:

Benefit Reason
Optimization Whole-graph analysis
Fusion Combine operations
Scheduling Global execution planning
Memory reuse Lifetime analysis
Device partitioning Multi-device placement

Frameworks such as entity["software","XLA","machine learning compiler"] and entity["software","TensorFlow","machine learning framework"] heavily use static graph compilation.

Graph Representation in Memory

A minimal graph representation might store:

type Node struct {
    ID      int
    Op      Op
    Inputs  []int
    Users   []int
}

Where:

Field Meaning
ID Node identifier
Op Operation type
Inputs Dependencies
Users Consumers of this node

A reverse traversal often iterates over nodes in reverse topological order.

Reverse Topological Order

Suppose the graph order is:

$$ v_1, v_2, v_3, v_4, v_5 $$

Then reverse mode processes:

$$ v_5, v_4, v_3, v_2, v_1. $$

This guarantees that when a node is processed, all downstream adjoint contributions have already been accumulated.

Reverse topological traversal is one of the central execution rules of reverse-mode AD.

Graph Transformations

Modern AD systems frequently transform dependency graphs.

Typical transformations include:

Transformation Purpose
Constant folding Precompute constants
Dead node elimination Remove unused computations
Fusion Combine kernels
Common subexpression elimination Reuse repeated computation
Rematerialization Recompute instead of storing
Shape propagation Infer tensor shapes
Differentiation transform Generate backward graph

In compiler-based AD systems, differentiation itself is often implemented as a graph rewrite.

Graphs and the Chain Rule

The dependency graph is a concrete representation of the chain rule.

For a path:

$$ x \to v_1 \to v_2 \to y $$

the contribution to the derivative is:

$$ \frac{\partial y}{\partial v_2} \frac{\partial v_2}{\partial v_1} \frac{\partial v_1}{\partial x}. $$

If multiple paths connect $x$ to $y$, the total derivative is the sum over all paths.

This explains why reverse mode accumulates contributions at merge points.

The graph structure directly determines derivative propagation.

Cycles and Recurrence

Straight-line dependency graphs are acyclic, but real programs may contain loops.

Example:

h_t = tanh(W h_{t-1} + U x_t)

This creates a recurrence relation.

AD systems usually handle recurrence by unrolling execution into a larger acyclic graph:

h0 -> h1 -> h2 -> h3 -> ...

The resulting unrolled graph is again a DAG.

Backpropagation through time is reverse-mode AD on this expanded graph.

Core Idea

A dependency graph makes computation structure explicit. Nodes represent computed values. Edges represent data dependencies. Forward mode moves with graph direction. Reverse mode moves against it.

The graph is the operational form of the chain rule. Automatic differentiation works because derivative propagation follows the same dependency structure as the original computation.