Efficient Higher-Order Methods

Higher-order derivatives contain rich geometric information, but naïve computation quickly becomes impractical.

Efficient Higher-Order Methods

Higher-order derivatives contain rich geometric information, but naïve computation quickly becomes impractical.

For a function

$$ f : \mathbb{R}^n \to \mathbb{R}, $$

the size of derivative objects grows rapidly:

Derivative Size
gradient $n$
Hessian $n^2$
third-order tensor $n^3$
fourth-order tensor $n^4$

Even when computation is mathematically possible, explicit tensor construction is often infeasible.

Efficient higher-order methods therefore focus on three goals:

Goal Strategy
avoid explicit tensors operator formulations
exploit structure sparsity, symmetry, low rank
reduce recomputation compositional derivative transforms

The central principle is simple:

Higher-order information should usually be treated as an operator, not as a dense tensor.

Derivative Operators Instead of Tensors

The full $k$-th derivative is a multilinear map:

$$ D^k f(x) : (\mathbb{R}^n)^k \to \mathbb{R}. $$

Instead of materializing all coefficients, efficient methods apply the derivative to directions.

Examples:

Object Efficient form
Hessian $Hv$
quadratic curvature $v^\top Hv$
third derivative tensor $D^3f(x)[u,v,w]$
repeated directional derivative $D^kf(x)[v,\ldots,v]$

These directional forms avoid combinatorial explosion.

Hessian-Vector Products

The most important efficient second-order primitive is:

$$ Hv = \nabla^2 f(x)v. $$

As discussed earlier, this can be computed using forward-over-reverse AD:

Hv = jvp(grad(f), x, v)

The key point is that complexity becomes linear in problem size rather than quadratic in tensor size.

This pattern generalizes.

Tensor-Vector Contractions

Suppose:

$$ T = D^3 f(x). $$

The full tensor has $n^3$ entries.

But many applications only need contractions such as:

$$ T[u,v] = D^3f(x)[u,v,\cdot]. $$

or:

$$ D^3f(x)[v,v,v]. $$

These directional contractions are far cheaper than building the tensor explicitly.

The general strategy is:

  1. linearize,
  2. apply derivative transforms,
  3. contract immediately,
  4. discard intermediate tensor structure.

Repeated Directional Derivatives

A repeated directional derivative is:

$$ D^k f(x)[v,\ldots,v]. $$

Taylor mode computes these efficiently.

Define:

$$ g(t)=f(x+tv). $$

Then:

$$ g^{(k)}(0) = D^k f(x)[v,\ldots,v]. $$

So the multidimensional problem reduces to a one-dimensional Taylor expansion.

This is often vastly cheaper than constructing the full derivative tensor.

Hyper-Dual Numbers

Hyper-dual numbers extend dual numbers to second-order derivatives.

Ordinary dual numbers use:

$$ a + b\epsilon, \quad \epsilon^2 = 0. $$

Hyper-dual numbers introduce two infinitesimals:

$$ a + b\epsilon_1 + c\epsilon_2 + d\epsilon_1\epsilon_2. $$

with:

$$ \epsilon_1^2 = 0, \quad \epsilon_2^2 = 0. $$

The mixed coefficient

$$ d $$

captures second-order mixed derivatives exactly.

This gives numerically stable Hessian computation without finite differences.

Hyper-duals are especially attractive for:

Use case Reason
small scientific models simple implementation
exact second derivatives no subtraction cancellation
verification mathematically clean
educational systems explicit algebraic structure

They become expensive at high order because infinitesimal combinations grow combinatorially.

Jet Spaces

A jet stores derivatives up to finite order.

A $k$-jet contains:

$$ (f, Df, D^2f, \ldots, D^k f). $$

Taylor mode can be interpreted as propagating jets through programs.

Efficient jet methods compress derivative structure:

Compression Idea
directional jets store derivatives only along directions
sparse jets exploit missing interactions
symmetric jets reuse tensor symmetry
truncated jets ignore high-order tails

Jets provide a principled algebraic framework for higher-order AD.

Exploiting Symmetry

Higher-order derivative tensors are symmetric for smooth scalar functions.

For example:

$$ \frac{\partial^3 f} {\partial x_i \partial x_j \partial x_k} $$

is invariant under index permutation.

Without symmetry, a third-order tensor has:

$$ n^3 $$

entries.

With symmetry, the number of unique entries becomes:

$$ \binom{n+2}{3}. $$

Similarly:

Order Dense entries Symmetric unique entries
2 $n^2$ $\frac{n(n+1)}{2}$
3 $n^3$ $\binom{n+2}{3}$
4 $n^4$ $\binom{n+3}{4}$

Efficient higher-order systems exploit this aggressively.

Sparsity

Many higher-order tensors are sparse.

A derivative entry vanishes when variables do not interact through the computation graph.

Example:

$$ f(x)=\sum_i x_i^4. $$

Mixed derivatives between unrelated variables are zero.

Sparse methods use:

Technique Purpose
graph analysis infer interactions
compressed seeding combine derivative directions
coloring reduce passes
block decomposition structured recovery
symbolic dependency tracking detect zero structure

Sparse higher-order methods are essential in scientific optimization and PDE systems.

Checkpointing

Higher-order reverse mode can require enormous memory.

Checkpointing trades memory for recomputation.

Instead of storing every intermediate:

  1. save selected checkpoints,
  2. recompute missing regions when needed,
  3. reconstruct derivative information lazily.

Higher-order checkpointing is harder because nested differentiation introduces multiple derivative levels.

A checkpointing system must track:

Object Requirement
primal values recomputable
derivative values level-aware
tapes scoped correctly
residuals preserved consistently
nested transforms replay-safe

Efficient higher-order AD depends heavily on sophisticated checkpoint scheduling.

Lazy Linearization

A useful abstraction is lazy linearization.

Instead of constructing full derivative objects immediately:

linearized = linearize(f, x)

the system returns a derivative operator.

Applying the operator computes derivatives only when requested:

linearized.apply(v)

This delays work and avoids unnecessary tensor construction.

Many modern AD systems internally treat derivatives this way even when exposing eager APIs.

Source Transformation vs Operator Overloading

Higher-order methods benefit strongly from source transformation.

Operator overloading systems dynamically create nested derivative objects:

Dual< Dual< float > >

or:

Adjoint< Adjoint< T > >

This is flexible but can create large runtime overhead.

Source transformation systems instead rewrite the program structurally.

Advantages include:

Advantage Reason
fusion eliminate intermediate derivative objects
static analysis optimize nesting
memory planning explicit residual control
sparsity analysis visible dependency graph
staging separate derivative levels

Compiler-based systems therefore scale better for advanced higher-order methods.

Mixed-Mode Strategies

Efficient systems often combine AD modes strategically.

Task Preferred strategy
gradient reverse mode
Jacobian-vector product forward mode
vector-Jacobian product reverse mode
Hessian-vector product forward-over-reverse
repeated directional derivatives Taylor mode
sparse Hessian recovery compressed forward mode
local curvature estimation directional contractions

No single AD mode dominates every workload.

Efficient higher-order AD is largely about choosing the correct composition.

Matrix-Free Methods

A recurring principle is matrix-free computation.

Instead of:

$$ H, \quad D^3f, \quad D^4f, $$

efficient systems expose operators:

matvec(v)
tensor_contract(u, v)
directional_derivative(v)

This supports:

Algorithm Uses
Krylov solvers Hessian-vector products
Newton-CG implicit Hessian
Lanczos curvature spectrum
trust-region methods local quadratic models
implicit differentiation linearized solves

The tensor exists mathematically but never materializes computationally.

Low-Rank Structure

Higher-order structure is often approximately low rank.

For example, a Hessian may admit:

$$ H \approx U \Lambda U^\top. $$

Efficient systems exploit this through:

Method Idea
truncated eigendecomposition dominant curvature directions
randomized projections approximate subspaces
Kronecker factorization structured approximations
block-diagonal models independent parameter groups
Fisher approximations probabilistic geometry

These approximations are heavily used in large neural networks.

Implicit Higher-Order Differentiation

Sometimes explicit higher-order derivatives are avoided entirely.

Suppose:

$$ g(x,y)=0. $$

Differentiating implicitly yields:

$$ \frac{dy}{dx} = - \left( \frac{\partial g}{\partial y} \right)^{-1} \frac{\partial g}{\partial x}. $$

Higher derivatives can also be derived implicitly.

This avoids differentiating through every iteration of a solver.

Efficient methods therefore combine:

Technique Benefit
implicit differentiation avoid unrolled computation
operator solves avoid explicit inverses
matrix-free methods reduce storage
checkpointing control memory

Parallelism

Higher-order methods can expose significant parallel structure.

Examples:

Parallelism Source
independent directional derivatives separate seeds
tensor contractions batched kernels
coefficient convolutions Taylor mode
sparse recovery graph partitions
block Hessians domain decomposition

GPU and TPU systems often exploit batched derivative propagation for throughput.

Practical Design Rules

Efficient higher-order systems usually follow these rules:

Rule Reason
avoid explicit tensors storage explosion
expose operators scalable interfaces
exploit symmetry reduce computation
exploit sparsity reduce passes
checkpoint aggressively control memory
combine AD modes match workload geometry
separate derivative levels correctness
prefer contractions avoid materialization

These principles matter more than any individual derivative algorithm.

Conceptual Perspective

First-order AD computes local linear structure.

Higher-order AD computes local multilinear geometry.

Efficient higher-order methods succeed by preserving only the parts of that geometry actually needed by downstream algorithms.

The central question is therefore not:

$$ \text{How do we build the derivative tensor?} $$

but instead:

$$ \text{What action of the derivative tensor is computationally necessary?} $$

That shift from explicit representation to operator semantics is the foundation of scalable higher-order automatic differentiation.