Complexity of Higher Orders

Higher-order automatic differentiation faces a fundamental problem: derivative structure grows combinatorially with order.

Complexity of Higher Orders

Higher-order automatic differentiation faces a fundamental problem: derivative structure grows combinatorially with order.

For a function

$$ f : \mathbb{R}^n \to \mathbb{R}, $$

the first derivative is a vector, the second derivative is a matrix, and higher derivatives become tensors.

The size growth is severe:

Order Object Dense size
1 gradient $n$
2 Hessian $n^2$
3 third-order tensor $n^3$
4 fourth-order tensor $n^4$

This growth is not an implementation artifact. It is an intrinsic property of higher-order multilinear structure.

Efficient higher-order AD therefore depends on avoiding explicit tensor materialization whenever possible.

Derivative Tensors

The $k$-th derivative of a scalar function is a symmetric multilinear map:

$$ D^k f(x) : (\mathbb{R}^n)^k \to \mathbb{R}. $$

In coordinates:

$$ (D^k f(x)){i_1 \cdots i_k} = \frac{ \partial^k f }{ \partial x{i_1}\cdots\partial x_{i_k} }. $$

Without symmetry, this tensor has:

$$ n^k $$

entries.

For even moderate values:

$n$ $k$ entries
100 2 $10^4$
100 3 $10^6$
100 4 $10^8$
1000 3 $10^9$

Explicit storage rapidly becomes infeasible.

Symmetry Reduces but Does Not Eliminate Growth

For smooth scalar functions, higher-order derivative tensors are symmetric under permutation of indices.

The number of unique entries becomes:

$$ \binom{n+k-1}{k}. $$

Examples:

Order Symmetric entries
2 $\frac{n(n+1)}{2}$
3 $\binom{n+2}{3}$
4 $\binom{n+3}{4}$

For fixed $k$, this scales approximately as:

$$ O(n^k/k!). $$

Symmetry helps substantially, but the growth remains polynomial of degree $k$.

Forward Mode Complexity

Forward mode propagates tangent information alongside primal values.

For first-order derivatives:

Quantity Cost
one directional derivative $O(C)$
full Jacobian $O(nC)$

where $C$ is the cost of evaluating the primal program.

For higher orders, nested forward mode introduces higher-dimensional tangent structure.

A naïve $k$-th order forward representation may require:

$$ O(n^k) $$

derivative components per intermediate value.

This creates both computation and storage growth.

Taylor Mode Complexity

Taylor mode computes truncated series coefficients.

Suppose Taylor order is:

$$ p. $$

Each intermediate variable stores:

$$ a_0, a_1, \ldots, a_p. $$

For multiplication, coefficients satisfy convolution recurrences:

$$ c_k = \sum_{i=0}^k a_i b_{k-i}. $$

A single multiplication therefore costs:

$$ O(p^2). $$

If the primal program costs $C$, Taylor mode typically costs approximately:

$$ O(p^2 C). $$

Memory cost is:

$$ O(pN), $$

where $N$ is the number of intermediates.

This is much better than explicit tensor construction for directional higher derivatives.

Reverse Mode Complexity

Reverse mode computes gradients efficiently for scalar-output functions.

For first order:

Quantity Cost
gradient $O(C)$
memory proportional to saved intermediates

Higher-order reverse mode is harder.

Differentiating the backward pass introduces:

Source Complexity growth
nested tapes additional storage
adjoint differentiation larger programs
saved residuals increased memory
recomputation increased runtime
mutation tracking bookkeeping overhead

In the worst case, higher-order reverse mode may grow exponentially in derivative order if implemented naïvely.

Practical systems rely heavily on checkpointing, operator formulations, and mixed-mode transforms.

Hessian Complexity

The Hessian is the first practically important higher-order object.

Dense Hessian complexity:

Quantity Complexity
storage $O(n^2)$
explicit construction $O(n^2)$ output
Hessian-vector product $O(C)$

The gap between explicit Hessians and Hessian-vector products is critical.

Suppose:

$$ n = 10^6. $$

Then:

Object Approximate size
gradient $10^6$ entries
Hessian $10^{12}$ entries

The Hessian is impossible to materialize densely. But Hessian-vector products remain practical.

This motivates matrix-free methods.

Operator Complexity

Most scalable higher-order methods expose derivative operators rather than tensors.

Examples:

Operator Meaning
$Jv$ Jacobian-vector product
$v^\top J$ vector-Jacobian product
$Hv$ Hessian-vector product
$D^k f[v,\ldots,v]$ directional kth derivative

The complexity depends on the number of directions rather than the full tensor size.

This often reduces exponential tensor growth to linear or polynomial directional growth.

Complexity of Directional Derivatives

A directional derivative compresses tensor structure into one direction.

For example:

$$ D^k f(x)[v,\ldots,v]. $$

This scalar quantity avoids storing the full tensor.

Taylor mode computes repeated directional derivatives efficiently:

Quantity Approximate cost
order $p$ directional expansion $O(p^2 C)$

This is far smaller than:

$$ O(n^p) $$

tensor construction.

Sparse Higher-Order Structure

Many programs have sparse derivative structure.

Suppose variables interact only locally in the computational graph.

Then many higher-order tensor entries vanish.

Sparse methods reduce complexity using:

Method Benefit
dependency analysis avoid zero entries
graph coloring reduce directional passes
compressed recovery reconstruct sparse tensors
block decomposition localized interactions
symbolic sparsity propagation structural inference

For structured scientific problems, sparsity often changes the practical complexity class completely.

Complexity of Mixed Modes

Mixed-mode AD combines forward and reverse transforms strategically.

Example:

Operation Efficient mode
gradient reverse
Jacobian-vector product forward
Hessian-vector product forward-over-reverse
vector-Hessian-vector nested directional
directional kth derivative Taylor mode

The complexity depends strongly on the input-output geometry.

Suppose:

$$ f : \mathbb{R}^n \to \mathbb{R}^m. $$

Then:

Mode Cost scaling
forward proportional to input directions
reverse proportional to output directions

Efficient higher-order systems choose the nesting that minimizes active derivative dimensionality.

Memory Complexity

Higher-order AD often becomes memory-bound before compute-bound.

A reverse-mode system must save intermediate values for the backward pass.

Higher-order reverse mode may need:

Memory object Purpose
primal values backward rules
tangents nested forward levels
cotangents nested reverse levels
tapes replay structure
residuals saved intermediates
checkpoint metadata recomputation planning

Memory can grow rapidly with nesting depth.

Checkpointing reduces storage at the cost of recomputation.

Checkpointing Complexity

Suppose a program has length:

$$ N. $$

Naïve reverse mode stores all intermediates:

$$ O(N) $$

memory.

Checkpointing trades storage for recomputation.

For higher-order AD, checkpointing becomes recursive.

The optimal tradeoff depends on:

Parameter Influence
nesting depth derivative levels
recomputation cost replay overhead
tape size memory pressure
residual size saved state
hardware bandwidth memory throughput

Advanced checkpointing schedules can reduce memory asymptotically.

Curse of Dimensionality

Higher-order derivatives suffer from the curse of dimensionality.

For dimension $n$ and derivative order $k$:

$$ \text{tensor size} \sim n^k. $$

This is unavoidable for arbitrary dense tensors.

Efficient methods therefore rely on structure:

Structure Compression
symmetry remove permutations
sparsity remove zeros
low rank factorize tensors
directional evaluation avoid explicit tensors
locality block decomposition
approximate geometry compressed representations

Without structure, high-order derivatives become computationally impossible at scale.

Neural Networks

Large neural networks illustrate these limits clearly.

Suppose a model has:

$$ 10^9 $$

parameters.

Then:

Quantity Size
gradient feasible
Hessian diagonal difficult
dense Hessian impossible
third-order tensor meaningless computationally

Practical deep learning therefore uses:

Technique Reason
gradients scalable
Hessian-vector products scalable curvature
low-rank approximations compressed geometry
Fisher approximations probabilistic structure
diagonal approximations memory reduction

Full higher-order tensors are almost never constructed.

Complexity Lower Bounds

Some costs are unavoidable.

A dense Hessian has:

$$ O(n^2) $$

output size.

No algorithm can output all entries in subquadratic time.

Similarly, a dense $k$-th derivative tensor requires:

$$ O(n^k) $$

output size.

Efficient AD methods do not defeat these lower bounds. They avoid them by changing the computational goal.

Instead of asking for the tensor itself, they ask for:

Reduced goal Complexity
tensor contraction lower
directional derivative lower
operator application lower
sparse recovery structure-dependent
local approximation compressed

This shift is fundamental.

Practical Complexity Rules

A useful engineering summary is:

Task Scalable? Preferred representation
gradient yes vector
Jacobian sometimes sparse/operator
Hessian rarely dense operator
third-order tensor usually no contractions
repeated directional derivatives yes Taylor mode
implicit higher-order derivatives often linear solves

Efficient higher-order AD is primarily about selecting the correct representation.

Conceptual Perspective

Higher-order derivatives encode local multilinear geometry.

The geometry itself grows combinatorially with dimension and order. Automatic differentiation cannot eliminate that mathematical reality.

What AD can do is preserve only the derivative information actually needed by downstream algorithms.

Scalable higher-order AD therefore depends less on raw differentiation mechanics and more on representation theory:

$$ \text{Which parts of the multilinear structure must be made explicit?} $$

That question determines the true complexity of higher-order automatic differentiation.