Comparative Architecture Analysis

The systems in this chapter show that automatic differentiation is not one implementation technique. It is a family of program transformations. Each system chooses a different...

Comparative Architecture Analysis

The systems in this chapter show that automatic differentiation is not one implementation technique. It is a family of program transformations. Each system chooses a different representation of the program, a different execution model, and a different boundary between user code, compiler code, and runtime code.

ADIFOR and Tapenade treat AD as source transformation. TensorFlow and PyTorch treat AD as graph or tape execution over tensor primitives. JAX treats AD as a composable transformation over functional array programs. Zygote treats AD as transformation over Julia IR. Enzyme lowers the problem further and transforms LLVM or MLIR. Tinygrad strips the design down to a minimal dynamic tensor graph.

The same mathematics appears in every system:

$$ \text{local derivative rules} + \text{chain rule} + \text{program dependency order}. $$

The architectural differences come from where the system captures that dependency order.

Program Representation

The most important distinction is the representation being differentiated.

System Main representation Typical user code
ADIFOR Fortran source legacy scientific programs
Tapenade Fortran and C source simulation and HPC codes
TensorFlow tensor operation graph or tape neural networks and tensor programs
PyTorch dynamic tensor graph interactive ML research
JAX traced functional program and JAXPR functional array programs
Zygote Julia SSA IR generic Julia numerical code
Enzyme LLVM IR and MLIR compiled multi-language programs
Tinygrad small dynamic tensor graph minimal deep learning programs

A source transformer sees the program near the form written by the user. A compiler IR transformer sees a lower-level but more regular form. A tensor graph system sees only operations from its tensor library. A dynamic tape sees the actual executed path, but only for operations it records.

This choice determines what kinds of programs feel natural.

Mode Support

Most major systems support reverse mode because modern machine learning and optimization usually need gradients of scalar losses with respect to many parameters. Forward mode remains important for directional derivatives, Jacobian-vector products, sensitivity analysis, and higher-order constructions.

System Forward mode Reverse mode Primary emphasis
ADIFOR yes limited or secondary forward source transformation
Tapenade yes yes scientific tangent and adjoint code
TensorFlow limited through APIs yes tensor reverse mode
PyTorch increasing support yes dynamic reverse mode
JAX yes yes composable JVP and VJP transformations
Zygote mostly reverse yes Julia pullbacks
Enzyme yes in some paths yes compiler-level reverse mode
Tinygrad minimal yes small reverse-mode engine

Forward mode propagates tangents with the computation. Reverse mode records or reconstructs enough of the computation to propagate adjoints backward. Systems differ mainly in how they store, reconstruct, or transform that information.

Source Transformation vs Runtime Tracing

Source transformation produces new code before execution. Runtime tracing or tape systems record what happens during execution.

Source transformation has advantages:

Advantage Reason
compiler visibility derivative code can be optimized ahead of time
auditability generated code can be inspected
whole-program analysis call graphs and data flow can be transformed
HPC integration works with existing compilers and build systems

Runtime tracing has different advantages:

Advantage Reason
flexibility follows actual execution path
easier interaction works naturally in notebooks and REPLs
dynamic models handles data-dependent structure naturally
lower upfront compilation burden graph built while running

The cost of source transformation is compiler complexity. The cost of runtime tracing is runtime overhead and weaker whole-program optimization.

Tensor Graphs vs General Programs

TensorFlow, PyTorch, JAX, and Tinygrad mostly differentiate tensor programs. ADIFOR, Tapenade, Zygote, and Enzyme aim closer to general program differentiation.

Tensor graphs are easier to differentiate because the primitive set is controlled. Operations such as matrix multiplication, convolution, reduction, broadcasting, and normalization have known derivative rules.

General programs are harder because they contain:

Feature Difficulty for AD
mutation old values may be needed in reverse mode
aliasing multiple names may refer to the same memory
external calls derivative semantics may be unknown
I/O usually nondifferentiable
dynamic allocation adjoint storage becomes complex
recursion reverse execution needs stack structure
low-level pointers activity analysis becomes harder

A tensor graph system avoids many of these problems by restricting the differentiable world. A general AD system accepts more programs but must solve more compiler and runtime problems.

Dynamic vs Staged Execution

PyTorch and Tinygrad are dynamic by default. The graph is built as operations execute. JAX and TensorFlow can stage computations for compilation. Enzyme, Tapenade, and ADIFOR work ahead of execution.

Execution style Examples Main benefit Main cost
dynamic PyTorch, Tinygrad flexibility and debugging overhead and fewer global optimizations
staged graph TensorFlow, JAX compilation and deployment tracing semantics
source transformation ADIFOR, Tapenade explicit generated code complex tooling
compiler IR transformation Enzyme, Zygote deep compiler integration harder debugging

Dynamic execution feels close to ordinary programming. Staged execution gives the system a larger region to optimize. Compiler transformation gives the system even more structural information, but users may need to understand compilation artifacts when something fails.

Treatment of State

State is one of the sharpest dividing lines.

JAX pushes users toward explicit state passing. PyTorch permits mutation but guards many unsafe cases with version counters. Zygote prefers functional code and has historically struggled with mutation. Enzyme must reason about memory at the IR level. Tapenade and ADIFOR transform imperative programs directly, so they must handle state through data-flow and activity analysis.

System State model
JAX explicit, functional state
PyTorch mutable tensors with autograd checks
TensorFlow variables plus graph semantics
Zygote functional style preferred, mutation difficult
Enzyme compiler memory analysis
Tapenade procedural source analysis
ADIFOR procedural source analysis
Tinygrad simple tensor object state

Reverse mode over mutation requires either saving old values, recomputing them, or proving they are not needed. This is a core systems problem, not a syntactic inconvenience.

Memory Strategy

Reverse mode needs forward-pass information during the backward pass. Every system must choose a memory strategy.

Strategy Used by Tradeoff
tape storage PyTorch, TensorFlow eager, Tinygrad simple but memory-heavy
checkpointing Tapenade, TensorFlow, PyTorch, JAX saves memory, adds recomputation
compiler recomputation Enzyme, JAX, Zygote can reduce storage, needs analysis
explicit derivative arrays ADIFOR forward mode predictable but can be large
generated adjoint storage Tapenade, Enzyme efficient when analysis succeeds

Memory is often the limiting factor in reverse-mode AD. A derivative program can be mathematically correct but unusable if it stores too much intermediate state.

Custom Derivatives

All practical AD systems need custom derivative rules.

They are needed when:

Situation Example
primitive is opaque external C or Fortran function
default derivative is unstable log-sum-exp, softmax, normalization
default derivative is inefficient linear solvers, eigendecompositions
operation is approximate iterative solver, projection, quantization
desired gradient differs from mathematical derivative straight-through estimator

The mechanism differs by system.

System Custom rule mechanism
TensorFlow tf.custom_gradient
PyTorch torch.autograd.Function
JAX custom_jvp, custom_vjp
Zygote ChainRules
Enzyme rules and annotations for external functions
Tapenade user-supplied derivative routines
ADIFOR derivative specifications and transformed subroutines
Tinygrad operation-level backward definitions

Custom gradients are powerful but dangerous. They create a trusted boundary. Once a user supplies a rule, the AD system usually assumes it is correct.

Performance Model

Performance depends on representation, mode, compiler access, and memory behavior.

System family Performance strength Performance risk
source transformation compiler-optimized generated code code size and build complexity
tensor tape simple reverse execution memory and Python overhead
staged tensor compiler fusion and accelerator optimization tracing and recompilation cost
compiler IR AD low-level optimization and multi-language support aliasing and IR complexity
minimal dynamic engine clarity and low conceptual overhead limited kernel and distributed optimization

There is no universally best architecture. A small neural network experiment, a production inference-training pipeline, a Fortran climate model, and a differentiable C++ simulator impose different constraints.

Architectural Lessons

The comparison suggests several durable design lessons.

First, AD systems are compiler systems even when they present themselves as libraries. They must analyze dependencies, transform programs, manage memory, and generate backward computations.

Second, reverse mode is a storage problem as much as a calculus problem. The derivative formulas are local and simple. The hard part is preserving exactly the values needed in the backward pass.

Third, purity helps. Functional programs are easier to differentiate, batch, compile, and parallelize. Mutation can be supported, but it raises the cost of analysis.

Fourth, restricting the primitive set improves reliability. Tensor frameworks succeed partly because they differentiate a controlled operation vocabulary rather than the whole host language.

Fifth, interoperability matters. Enzyme’s IR-level approach and Tapenade’s source-level approach address the same practical need: differentiating code that already exists outside machine learning frameworks.

Choosing an AD System

A practical selection can be framed by the program being differentiated.

Program type Suitable systems
legacy Fortran simulation ADIFOR, Tapenade, Enzyme
C or C++ numerical kernel Tapenade, Enzyme
Python deep learning model PyTorch, TensorFlow, JAX
functional array program JAX
Julia scientific code Zygote, Enzyme, other Julia AD systems
educational autograd engine Tinygrad
accelerator-heavy tensor workload TensorFlow, JAX, PyTorch
compiler research Enzyme, Zygote, JAX internals

The choice is architectural. It depends less on the derivative rule for multiplication and more on program representation, state model, compiler access, and deployment environment.

Summary

The major AD systems form a spectrum.

At one end, Tinygrad and PyTorch expose dynamic graph reverse mode in a direct user-facing style. TensorFlow and JAX stage tensor programs for optimization and accelerator execution. Zygote moves AD into a high-level language IR. Tapenade and ADIFOR represent the classical source-transformation tradition for scientific codes. Enzyme lowers AD into the compiler backend.

All of them implement the chain rule. Their differences show where each system chooses to locate the chain rule: in source code, in a runtime tape, in a tensor graph, in a functional IR, or in compiler IR.