Deep learning models are built from sequences of mathematical operations.
Deep learning models are built from sequences of mathematical operations. During training, the system must compute not only the forward result of these operations, but also derivatives with respect to model parameters. PyTorch achieves this through dynamic computation graphs.
A computation graph is a directed graph where nodes represent operations or tensors, and edges represent data dependencies between them. When a tensor operation is executed, PyTorch records how the result was produced. This recorded structure allows gradients to be computed automatically during backpropagation.
The defining feature of PyTorch is that this graph is dynamic. The graph is created at runtime as Python code executes.
Static Versus Dynamic Graphs
Deep learning frameworks historically followed two major approaches.
| Approach | Graph construction | Example frameworks |
|---|---|---|
| Static graph | Define graph before execution | Early TensorFlow |
| Dynamic graph | Build graph during execution | PyTorch |
In a static graph system, the computation graph is declared first and executed later. The graph behaves like a compiled program.
In a dynamic graph system, the graph is built operation by operation as Python executes. Each forward pass constructs a new graph.
For example:
import torch
x = torch.tensor(2.0, requires_grad=True)
y = x * x + 3 * x + 1When this code runs, PyTorch dynamically records the operations:
x
├── multiply(x, x)
├── multiply(3, x)
└── add(...)The graph exists only because the operations were executed.
This gives PyTorch several advantages:
| Advantage | Meaning |
|---|---|
| Natural Python control flow | Use loops, branches, recursion |
| Easier debugging | Inspect intermediate tensors directly |
| Interactive experimentation | Works naturally in notebooks |
| Flexible architectures | Graph structure may vary per input |
Computation as a Graph
Suppose we define
The computation can be decomposed into simpler operations:
This forms a graph:
x
├── square ── a
├── multiply by 3 ── b
└──────────────┐
▼
add operations
▼
yEach node stores enough information for differentiation. During the backward pass, gradients propagate in reverse order through this graph.
In PyTorch:
x = torch.tensor(2.0, requires_grad=True)
a = x * x
b = 3 * x
y = a + b + 1
print(y)The tensor y remembers how it was constructed.
print(y.grad_fn)The field grad_fn references the operation that produced the tensor.
The Forward Pass
The forward pass computes outputs from inputs.
A neural network is a composition of functions:
For a simple multilayer network:
Each operation creates new tensors and graph nodes.
In PyTorch:
from torch import nn
model = nn.Sequential(
nn.Linear(4, 8),
nn.ReLU(),
nn.Linear(8, 2),
)
x = torch.randn(3, 4)
y = model(x)During execution, PyTorch dynamically constructs the graph associated with this forward computation.
The graph records:
- Tensor operations
- Parent tensors
- Operation types
- Information needed for differentiation
The Backward Pass
Training requires gradients of the loss with respect to parameters.
Suppose
We need
PyTorch computes these gradients using reverse-mode automatic differentiation, commonly called backpropagation.
The backward pass traverses the computation graph in reverse order.
Example:
x = torch.tensor(2.0, requires_grad=True)
y = x * x + 3 * x + 1
y.backward()
print(x.grad)The derivative is
At :
PyTorch computes this automatically.
The backward pass applies the chain rule repeatedly across the graph.
Reverse-Mode Automatic Differentiation
Deep learning models usually have many parameters and a scalar loss.
Suppose:
The input dimension may be millions or billions. The output is usually a scalar loss.
Reverse-mode differentiation is efficient in this setting because it computes gradients of the scalar output with respect to all parameters in one backward traversal.
The chain rule is the core principle.
If
then
PyTorch applies this rule automatically across the graph.
Leaf Tensors and Gradients
A tensor with requires_grad=True becomes part of gradient tracking.
x = torch.tensor([1.0, 2.0], requires_grad=True)Such tensors are called leaf tensors if they are created directly by the user rather than by another operation.
Gradients accumulate in .grad only for leaf tensors.
x = torch.tensor(2.0, requires_grad=True)
y = x * x
y.backward()
print(x.grad)Intermediate tensors usually do not store gradients unless explicitly requested.
Dynamic Graph Construction
The graph is rebuilt every forward pass.
This is important because the executed operations may depend on input data or control flow.
Example:
def f(x):
if x.sum() > 0:
return x * 2
else:
return x * xDifferent inputs may produce different computation graphs.
This flexibility makes PyTorch natural for:
- Recursive models
- Variable-length sequences
- Tree structures
- Dynamic routing
- Conditional computation
- Reinforcement learning environments
Static graph systems historically struggled with these patterns.
Control Flow in PyTorch
Because graphs are built dynamically, ordinary Python control flow works naturally.
Loops:
x = torch.tensor(1.0, requires_grad=True)
y = x
for _ in range(5):
y = y * 2
y.backward()
print(x.grad)Conditionals:
x = torch.tensor(-2.0, requires_grad=True)
if x.item() > 0:
y = x
else:
y = x * x
y.backward()Recursion also works naturally.
This is one reason PyTorch became popular in research environments.
Graph Lifetime
The computation graph exists only as long as needed for gradient computation.
After backward(), PyTorch frees most graph information to save memory.
Example:
x = torch.tensor(2.0, requires_grad=True)
y = x * x
y.backward()Calling backward() a second time fails because the graph has been freed:
# y.backward() # RuntimeErrorTo preserve the graph:
y.backward(retain_graph=True)Retaining graphs increases memory usage and should only be used when necessary.
Detaching Tensors
Sometimes a tensor should stop participating in gradient computation.
The detach() operation creates a tensor without gradient history.
x = torch.tensor(2.0, requires_grad=True)
y = x * x
z = y.detach()
print(z.requires_grad)This is useful when:
- Storing intermediate results
- Logging metrics
- Performing inference
- Preventing gradient flow
- Implementing target networks
Detached tensors share underlying storage but do not participate in autograd.
Disabling Gradient Tracking
Inference does not require gradients. Tracking them wastes memory and compute.
PyTorch provides torch.no_grad():
model.eval()
with torch.no_grad():
outputs = model(x)This disables graph construction inside the block.
Benefits:
| Benefit | Effect |
|---|---|
| Lower memory use | No graph storage |
| Faster execution | No autograd overhead |
| Cleaner inference | No accidental gradient tracking |
Modern PyTorch also provides torch.inference_mode() for further optimization.
In-Place Operations
PyTorch supports in-place tensor modification:
x.add_(1)The underscore convention indicates in-place modification.
In-place operations can reduce memory usage, but they are dangerous when tensors participate in autograd graphs.
Example:
x = torch.tensor(2.0, requires_grad=True)
y = x * x
# x.add_(1) # may break autogradChanging values needed for gradient computation can invalidate the graph.
For this reason, in-place operations should be used carefully.
Computational Graphs in Neural Networks
Consider a linear layer:
Suppose:
The forward pass computes:
The graph includes:
- Matrix multiplication
- Broadcasting
- Addition
- Activation functions
- Loss computation
For example:
model = nn.Linear(4, 2)
x = torch.randn(3, 4)
target = torch.randint(0, 2, (3,))
loss_fn = nn.CrossEntropyLoss()
logits = model(x)
loss = loss_fn(logits, target)
loss.backward()The graph links every parameter in model to the scalar loss.
After backpropagation:
print(model.weight.grad.shape)
print(model.bias.grad.shape)Gradients are tensors with the same shapes as the parameters.
Graph Depth and Memory
Deep models produce deep graphs.
Transformers with hundreds of layers may create enormous computation graphs during training. Activations from the forward pass must often be stored because backward computation needs them.
This leads to high memory usage.
Several techniques reduce memory cost:
| Technique | Idea |
|---|---|
| Mixed precision | Smaller tensor formats |
| Gradient checkpointing | Recompute activations |
| Activation offloading | Move tensors between devices |
| Tensor parallelism | Split tensors across GPUs |
Graph memory is one of the central constraints in large-scale deep learning.
Eager Execution
PyTorch originally emphasized eager execution. Operations execute immediately.
Example:
x = torch.tensor([1.0, 2.0])
y = x + 1
print(y)This feels like ordinary Python programming.
Advantages:
- Easy debugging
- Direct inspection
- Interactive development
- Natural control flow
However, eager execution may introduce overhead because operations are dispatched one at a time from Python.
Modern PyTorch combines eager programming with graph compilation through torch.compile.
Graph Compilation
PyTorch can trace and optimize execution graphs.
compiled_model = torch.compile(model)Compilation may:
- Fuse operations
- Remove Python overhead
- Optimize memory access
- Improve kernel scheduling
This bridges the gap between dynamic flexibility and static optimization.
The programmer still writes ordinary PyTorch code, but execution becomes more optimized internally.
Dynamic Graphs and Modern AI Systems
Dynamic computation graphs are especially important for modern AI systems because many tasks involve variable structure.
Examples include:
| Domain | Dynamic behavior |
|---|---|
| NLP | Variable sequence lengths |
| Reinforcement learning | Environment-dependent trajectories |
| Graph learning | Different graph sizes |
| Agents | Conditional tool use |
| Mixture-of-experts models | Input-dependent routing |
| Reasoning systems | Branching computation |
A static graph fixed in advance may not naturally express these behaviors.
Dynamic graph systems allow the executed computation to depend on the data itself.
Summary
A computation graph represents tensor operations and their dependencies. PyTorch builds this graph dynamically during execution. The graph records how outputs were computed from inputs and parameters.
During training, PyTorch traverses the graph backward to compute gradients using reverse-mode automatic differentiation. This enables efficient optimization of neural networks with millions or billions of parameters.
Dynamic graphs provide flexibility, natural Python control flow, and easy debugging. They are one of the defining design choices of PyTorch and a major reason for its adoption in modern deep learning research and production systems.