Comparative Architecture Analysis

The systems in this chapter show that automatic differentiation is not one implementation technique. It is a family of program transformations. Each system chooses a different representation of the program, a different execution model, and a different boundary between user code, compiler code, and runtime code.

ADIFOR and Tapenade treat AD as source transformation. TensorFlow and PyTorch treat AD as graph or tape execution over tensor primitives. JAX treats AD as a composable transformation over functional array programs. Zygote treats AD as transformation over Julia IR. Enzyme lowers the problem further and transforms LLVM or MLIR. Tinygrad strips the design down to a minimal dynamic tensor graph.

The same mathematics appears in every system:

$$ \text{local derivative rules} + \text{chain rule} + \text{program dependency order}. $$

The architectural differences come from where the system captures that dependency order.

Program Representation

The most important distinction is the representation being differentiated.

System	Main representation	Typical user code
ADIFOR	Fortran source	legacy scientific programs
Tapenade	Fortran and C source	simulation and HPC codes
TensorFlow	tensor operation graph or tape	neural networks and tensor programs
PyTorch	dynamic tensor graph	interactive ML research
JAX	traced functional program and JAXPR	functional array programs
Zygote	Julia SSA IR	generic Julia numerical code
Enzyme	LLVM IR and MLIR	compiled multi-language programs
Tinygrad	small dynamic tensor graph	minimal deep learning programs

A source transformer sees the program near the form written by the user. A compiler IR transformer sees a lower-level but more regular form. A tensor graph system sees only operations from its tensor library. A dynamic tape sees the actual executed path, but only for operations it records.

This choice determines what kinds of programs feel natural.

Mode Support

Most major systems support reverse mode because modern machine learning and optimization usually need gradients of scalar losses with respect to many parameters. Forward mode remains important for directional derivatives, Jacobian-vector products, sensitivity analysis, and higher-order constructions.

System	Forward mode	Reverse mode	Primary emphasis
ADIFOR	yes	limited or secondary	forward source transformation
Tapenade	yes	yes	scientific tangent and adjoint code
TensorFlow	limited through APIs	yes	tensor reverse mode
PyTorch	increasing support	yes	dynamic reverse mode
JAX	yes	yes	composable JVP and VJP transformations
Zygote	mostly reverse	yes	Julia pullbacks
Enzyme	yes in some paths	yes	compiler-level reverse mode
Tinygrad	minimal	yes	small reverse-mode engine

Forward mode propagates tangents with the computation. Reverse mode records or reconstructs enough of the computation to propagate adjoints backward. Systems differ mainly in how they store, reconstruct, or transform that information.

Source Transformation vs Runtime Tracing

Source transformation produces new code before execution. Runtime tracing or tape systems record what happens during execution.

Source transformation has advantages:

Advantage	Reason
compiler visibility	derivative code can be optimized ahead of time
auditability	generated code can be inspected
whole-program analysis	call graphs and data flow can be transformed
HPC integration	works with existing compilers and build systems

Runtime tracing has different advantages:

Advantage	Reason
flexibility	follows actual execution path
easier interaction	works naturally in notebooks and REPLs
dynamic models	handles data-dependent structure naturally
lower upfront compilation burden	graph built while running

The cost of source transformation is compiler complexity. The cost of runtime tracing is runtime overhead and weaker whole-program optimization.

Tensor Graphs vs General Programs

TensorFlow, PyTorch, JAX, and Tinygrad mostly differentiate tensor programs. ADIFOR, Tapenade, Zygote, and Enzyme aim closer to general program differentiation.

Tensor graphs are easier to differentiate because the primitive set is controlled. Operations such as matrix multiplication, convolution, reduction, broadcasting, and normalization have known derivative rules.

General programs are harder because they contain:

Feature	Difficulty for AD
mutation	old values may be needed in reverse mode
aliasing	multiple names may refer to the same memory
external calls	derivative semantics may be unknown
I/O	usually nondifferentiable
dynamic allocation	adjoint storage becomes complex
recursion	reverse execution needs stack structure
low-level pointers	activity analysis becomes harder

A tensor graph system avoids many of these problems by restricting the differentiable world. A general AD system accepts more programs but must solve more compiler and runtime problems.

Dynamic vs Staged Execution

PyTorch and Tinygrad are dynamic by default. The graph is built as operations execute. JAX and TensorFlow can stage computations for compilation. Enzyme, Tapenade, and ADIFOR work ahead of execution.

Execution style	Examples	Main benefit	Main cost
dynamic	PyTorch, Tinygrad	flexibility and debugging	overhead and fewer global optimizations
staged graph	TensorFlow, JAX	compilation and deployment	tracing semantics
source transformation	ADIFOR, Tapenade	explicit generated code	complex tooling
compiler IR transformation	Enzyme, Zygote	deep compiler integration	harder debugging

Dynamic execution feels close to ordinary programming. Staged execution gives the system a larger region to optimize. Compiler transformation gives the system even more structural information, but users may need to understand compilation artifacts when something fails.

Treatment of State

State is one of the sharpest dividing lines.

JAX pushes users toward explicit state passing. PyTorch permits mutation but guards many unsafe cases with version counters. Zygote prefers functional code and has historically struggled with mutation. Enzyme must reason about memory at the IR level. Tapenade and ADIFOR transform imperative programs directly, so they must handle state through data-flow and activity analysis.

System	State model
JAX	explicit, functional state
PyTorch	mutable tensors with autograd checks
TensorFlow	variables plus graph semantics
Zygote	functional style preferred, mutation difficult
Enzyme	compiler memory analysis
Tapenade	procedural source analysis
ADIFOR	procedural source analysis
Tinygrad	simple tensor object state

Reverse mode over mutation requires either saving old values, recomputing them, or proving they are not needed. This is a core systems problem, not a syntactic inconvenience.

Memory Strategy

Reverse mode needs forward-pass information during the backward pass. Every system must choose a memory strategy.

Strategy	Used by	Tradeoff
tape storage	PyTorch, TensorFlow eager, Tinygrad	simple but memory-heavy
checkpointing	Tapenade, TensorFlow, PyTorch, JAX	saves memory, adds recomputation
compiler recomputation	Enzyme, JAX, Zygote	can reduce storage, needs analysis
explicit derivative arrays	ADIFOR forward mode	predictable but can be large
generated adjoint storage	Tapenade, Enzyme	efficient when analysis succeeds

Memory is often the limiting factor in reverse-mode AD. A derivative program can be mathematically correct but unusable if it stores too much intermediate state.

Custom Derivatives

All practical AD systems need custom derivative rules.

They are needed when:

Situation	Example
primitive is opaque	external C or Fortran function
default derivative is unstable	log-sum-exp, softmax, normalization
default derivative is inefficient	linear solvers, eigendecompositions
operation is approximate	iterative solver, projection, quantization
desired gradient differs from mathematical derivative	straight-through estimator

The mechanism differs by system.

System	Custom rule mechanism
TensorFlow	`tf.custom_gradient`
PyTorch	`torch.autograd.Function`
JAX	`custom_jvp`, `custom_vjp`
Zygote	ChainRules
Enzyme	rules and annotations for external functions
Tapenade	user-supplied derivative routines
ADIFOR	derivative specifications and transformed subroutines
Tinygrad	operation-level backward definitions

Custom gradients are powerful but dangerous. They create a trusted boundary. Once a user supplies a rule, the AD system usually assumes it is correct.

Performance Model

Performance depends on representation, mode, compiler access, and memory behavior.

System family	Performance strength	Performance risk
source transformation	compiler-optimized generated code	code size and build complexity
tensor tape	simple reverse execution	memory and Python overhead
staged tensor compiler	fusion and accelerator optimization	tracing and recompilation cost
compiler IR AD	low-level optimization and multi-language support	aliasing and IR complexity
minimal dynamic engine	clarity and low conceptual overhead	limited kernel and distributed optimization

There is no universally best architecture. A small neural network experiment, a production inference-training pipeline, a Fortran climate model, and a differentiable C++ simulator impose different constraints.

Architectural Lessons

The comparison suggests several durable design lessons.

First, AD systems are compiler systems even when they present themselves as libraries. They must analyze dependencies, transform programs, manage memory, and generate backward computations.

Second, reverse mode is a storage problem as much as a calculus problem. The derivative formulas are local and simple. The hard part is preserving exactly the values needed in the backward pass.

Third, purity helps. Functional programs are easier to differentiate, batch, compile, and parallelize. Mutation can be supported, but it raises the cost of analysis.

Fourth, restricting the primitive set improves reliability. Tensor frameworks succeed partly because they differentiate a controlled operation vocabulary rather than the whole host language.

Fifth, interoperability matters. Enzyme’s IR-level approach and Tapenade’s source-level approach address the same practical need: differentiating code that already exists outside machine learning frameworks.

Choosing an AD System

A practical selection can be framed by the program being differentiated.

Program type	Suitable systems
legacy Fortran simulation	ADIFOR, Tapenade, Enzyme
C or C++ numerical kernel	Tapenade, Enzyme
Python deep learning model	PyTorch, TensorFlow, JAX
functional array program	JAX
Julia scientific code	Zygote, Enzyme, other Julia AD systems
educational autograd engine	Tinygrad
accelerator-heavy tensor workload	TensorFlow, JAX, PyTorch
compiler research	Enzyme, Zygote, JAX internals

The choice is architectural. It depends less on the derivative rule for multiplication and more on program representation, state model, compiler access, and deployment environment.

Summary

The major AD systems form a spectrum.

At one end, Tinygrad and PyTorch expose dynamic graph reverse mode in a direct user-facing style. TensorFlow and JAX stage tensor programs for optimization and accelerator execution. Zygote moves AD into a high-level language IR. Tapenade and ADIFOR represent the classical source-transformation tradition for scientific codes. Enzyme lowers AD into the compiler backend.

All of them implement the chain rule. Their differences show where each system chooses to locate the chain rule: in source code, in a runtime tape, in a tensor graph, in a functional IR, or in compiler IR.