Hessian-Vector Products

A Hessian-vector product computes

Hessian-Vector Products

A Hessian-vector product computes

$$ H_f(x)v = \nabla^2 f(x)v, $$

where

$$ f : \mathbb{R}^n \to \mathbb{R}, $$

and

$$ v \in \mathbb{R}^n. $$

This operation applies the Hessian as a linear operator to a vector without explicitly constructing the full Hessian matrix.

For large systems, this distinction is fundamental. A dense Hessian requires $O(n^2)$ storage. A Hessian-vector product requires only $O(n)$ storage for the input and output vectors.

Most scalable second-order optimization methods are built around Hessian-vector products rather than explicit Hessians.

Directional Interpretation

The Hessian-vector product measures how the gradient changes in direction $v$.

Define

$$ g(x) = \nabla f(x). $$

Then

$$ H_f(x)v = Dg(x)[v]. $$

Equivalently,

$$ H_f(x)v = \frac{d}{d\epsilon} \nabla f(x + \epsilon v) \bigg|_{\epsilon = 0}. $$

So a Hessian-vector product is simply the directional derivative of the gradient.

This identity is the basis of most AD implementations.

Geometric Meaning

The gradient defines the local slope field of a function. The Hessian describes how that slope field bends.

Given a direction $v$:

Quantity Meaning
$\nabla f(x)$ local slope
$H_f(x)v$ change of slope along $v$
$v^\top H_f(x)v$ curvature along $v$

The vector

$$ H_f(x)v $$

points in the direction the gradient moves when the input moves infinitesimally along $v$.

Example

Let

$$ f(x, y) = x^2 y + \sin y. $$

The gradient is

$$ \nabla f(x, y) = \begin{bmatrix} 2xy \ x^2 + \cos y \end{bmatrix}. $$

The Hessian is

$$ H_f(x, y) = \begin{bmatrix} 2y & 2x \ 2x & -\sin y \end{bmatrix}. $$

For

$$ v = \begin{bmatrix} v_1 \ v_2 \end{bmatrix}, $$

the Hessian-vector product is

$$ H_f(x, y)v = \begin{bmatrix} 2yv_1 + 2xv_2 \ 2xv_1 - \sin(y)v_2 \end{bmatrix}. $$

Notice that we never need to materialize the full Hessian matrix to compute this product. We only need the action of the matrix on the vector.

Why Hessian-Vector Products Matter

Suppose

$$ n = 10^6. $$

A dense Hessian contains

$$ 10^{12} $$

entries.

Even with 8-byte floating point values, storage would require multiple terabytes.

But a Hessian-vector product maps one vector to another:

$$ \mathbb{R}^n \to \mathbb{R}^n. $$

The storage cost is linear in $n$.

Many optimization algorithms only need this operator action.

Algorithm Needs full Hessian? Needs HVP?
Newton-CG no yes
Conjugate gradient no yes
Lanczos no yes
Trust-region methods sometimes yes
Curvature diagnostics no yes
Krylov methods no yes

This makes Hessian-vector products the central second-order primitive in large-scale optimization.

Forward-over-Reverse AD

The standard AD implementation uses forward-over-reverse differentiation.

First compute the gradient using reverse mode:

$$ g(x) = \nabla f(x). $$

Then differentiate the gradient computation using forward mode:

$$ Dg(x)[v] = H_f(x)v. $$

Algorithmically:

g = grad(f)
hvp = jvp(g, x, v)

This is efficient because reverse mode computes gradients well for scalar-output functions, and forward mode efficiently propagates a directional derivative through that gradient computation.

The total cost is usually comparable to a small multiple of gradient evaluation.

Reverse-over-Forward AD

An alternative uses reverse-over-forward.

First compute the directional derivative:

$$ \phi_v(x) = Df(x)[v] = \nabla f(x)^\top v. $$

Then differentiate this scalar function with reverse mode:

$$ \nabla \phi_v(x) = H_f(x)v. $$

Algorithmically:

phi(x) = jvp(f, x, v)
hvp = grad(phi)(x)

This is mathematically equivalent to forward-over-reverse.

The preferred method depends on:

Factor Influence
compiler structure nesting overhead
tape implementation reverse-mode complexity
memory reuse checkpointing strategy
primitive support availability of JVP/VJP rules
hardware backend GPU fusion behavior

Modern systems often support both.

Pearlmutter's Method

A classical result due to Pearlmutter gives an efficient way to compute Hessian-vector products directly.

The core identity is:

$$ H_f(x)v = R_v{\nabla f(x)}, $$

where

$$ R_v $$

is a forward-mode directional derivative operator.

This means:

  1. Run reverse mode to define the gradient computation.
  2. Push a tangent vector through that reverse computation.

The resulting complexity is approximately:

Quantity Complexity
function evaluation $C$
gradient evaluation $\approx 2C$
Hessian-vector product $O(C)$

up to constant factors and implementation details.

This result is important because it avoids quadratic scaling.

Hessian-Free Methods

Many large-scale optimization methods are called Hessian-free because they never explicitly construct the Hessian.

Instead, they repeatedly query:

$$ v \mapsto H_f(x)v. $$

The Hessian becomes an implicit linear operator.

Newton-CG is a standard example.

Newton's equation is

$$ H_f(x)p = -\nabla f(x). $$

Rather than forming $H_f(x)$, conjugate gradient solves the linear system using only repeated Hessian-vector products.

Algorithmically:

while not converged:
    w = hvp(v)
    update Krylov basis
    update search direction

This scales to extremely large systems because the algorithm never materializes the dense matrix.

Curvature Along Directions

A Hessian-vector product also gives directional curvature information.

The scalar

$$ v^\top H_f(x)v $$

measures curvature along direction $v$.

This quantity appears in:

Area Usage
optimization trust-region methods
deep learning curvature diagnostics
physics local energy curvature
numerical analysis stability estimation
statistics Fisher-like approximations

If

$$ v^\top H_f(x)v > 0, $$

the function bends upward along $v$.

If

$$ v^\top H_f(x)v < 0, $$

the function bends downward.

If the value is near zero, the direction is locally flat.

Finite Difference Approximation

A Hessian-vector product can be approximated using finite differences of gradients:

$$ H_f(x)v \approx \frac{\nabla f(x + \epsilon v) - \nabla f(x)}{\epsilon}. $$

A centered approximation is more accurate:

$$ H_f(x)v \approx \frac{\nabla f(x + \epsilon v) - \nabla f(x - \epsilon v)} {2\epsilon}. $$

This requires only gradient evaluations.

However, finite differences introduce truncation error and floating point cancellation. AD-based Hessian-vector products are usually more accurate and often faster.

Linear Operator View

A Hessian-vector product defines a linear map:

$$ v \mapsto H_f(x)v. $$

The Hessian therefore behaves as a linear operator.

This suggests an important design principle.

Instead of exposing Hessians as dense arrays:

H = hessian(f)(x)

large systems should expose operator interfaces:

H = hessian_operator(f, x)

y = H.matvec(v)

This enables:

Capability Benefit
iterative solvers no dense matrix
lazy evaluation compute only needed products
sparse backends exploit structure
distributed execution shard operator application
GPU kernels fuse operator evaluation

This operator-centric design appears throughout modern scientific computing.

HVPs in Neural Networks

Deep neural networks rarely use explicit Hessians because parameter counts are enormous.

Still, Hessian-vector products are heavily used in:

Application Role
second-order optimization curvature-aware updates
meta-learning implicit gradients
influence functions parameter sensitivity
pruning curvature estimation
uncertainty estimation local geometry
neural tangent analysis linearization studies

In practice, most large ML systems treat Hessians as implicit operators accessed through HVP queries.

Memory Considerations

A Hessian-vector product can often be computed with memory usage close to gradient computation.

However, nesting reverse mode inside reverse mode can dramatically increase memory consumption.

Forward-over-reverse is usually preferred because:

Property Forward-over-reverse
memory usage moderate
implementation complexity manageable
nesting stability relatively good
operator efficiency high

Reverse-over-reverse may require differentiating tape management and adjoint accumulation logic, which is harder to optimize.

Structured Hessian Operators

Some systems expose specialized Hessian operators.

Examples include:

Structure Optimization
diagonal Hessian store only diagonal
block Hessian blockwise matvec
sparse Hessian sparse kernels
low-rank Hessian factored representation
Kronecker approximations efficient inversion
Fisher approximations probabilistic geometry

The operator abstraction makes these interchangeable from the optimizer's perspective.

Practical Principle

The full Hessian is usually a debugging or analysis object.

The Hessian-vector product is the scalable computational primitive.

A modern AD system should therefore optimize:

$$ v \mapsto H_f(x)v $$

as a first-class operation rather than treating it as a derived convenience built from dense Hessian construction.