Hessian-Vector Products

A Hessian-vector product computes

$$ H_f(x)v = \nabla^2 f(x)v, $$

where

$$ f : \mathbb{R}^n \to \mathbb{R}, $$

and

$$ v \in \mathbb{R}^n. $$

This operation applies the Hessian as a linear operator to a vector without explicitly constructing the full Hessian matrix.

For large systems, this distinction is fundamental. A dense Hessian requires $O(n^2)$ storage. A Hessian-vector product requires only $O(n)$ storage for the input and output vectors.

Most scalable second-order optimization methods are built around Hessian-vector products rather than explicit Hessians.

Directional Interpretation

The Hessian-vector product measures how the gradient changes in direction $v$.

Define

$$ g(x) = \nabla f(x). $$

Then

$$ H_f(x)v = Dg(x)[v]. $$

Equivalently,

$$ H_f(x)v = \frac{d}{d\epsilon} \nabla f(x + \epsilon v) \bigg|_{\epsilon = 0}. $$

So a Hessian-vector product is simply the directional derivative of the gradient.

This identity is the basis of most AD implementations.

Geometric Meaning

The gradient defines the local slope field of a function. The Hessian describes how that slope field bends.

Given a direction $v$:

Quantity	Meaning
$\nabla f(x)$	local slope
$H_f(x)v$	change of slope along $v$
$v^\top H_f(x)v$	curvature along $v$

The vector

$$ H_f(x)v $$

points in the direction the gradient moves when the input moves infinitesimally along $v$.

Example

Let

$$ f(x, y) = x^2 y + \sin y. $$

The gradient is

$$ \nabla f(x, y) = \begin{bmatrix} 2xy \ x^2 + \cos y \end{bmatrix}. $$

The Hessian is

$$ H_f(x, y) = \begin{bmatrix} 2y & 2x \ 2x & -\sin y \end{bmatrix}. $$

For

$$ v = \begin{bmatrix} v_1 \ v_2 \end{bmatrix}, $$

the Hessian-vector product is

$$ H_f(x, y)v = \begin{bmatrix} 2yv_1 + 2xv_2 \ 2xv_1 - \sin(y)v_2 \end{bmatrix}. $$

Notice that we never need to materialize the full Hessian matrix to compute this product. We only need the action of the matrix on the vector.

Why Hessian-Vector Products Matter

Suppose

$$ n = 10^6. $$

A dense Hessian contains

$$ 10^{12} $$

entries.

Even with 8-byte floating point values, storage would require multiple terabytes.

But a Hessian-vector product maps one vector to another:

$$ \mathbb{R}^n \to \mathbb{R}^n. $$

The storage cost is linear in $n$.

Many optimization algorithms only need this operator action.

Algorithm	Needs full Hessian?	Needs HVP?
Newton-CG	no	yes
Conjugate gradient	no	yes
Lanczos	no	yes
Trust-region methods	sometimes	yes
Curvature diagnostics	no	yes
Krylov methods	no	yes

This makes Hessian-vector products the central second-order primitive in large-scale optimization.

Forward-over-Reverse AD

The standard AD implementation uses forward-over-reverse differentiation.

First compute the gradient using reverse mode:

$$ g(x) = \nabla f(x). $$

Then differentiate the gradient computation using forward mode:

$$ Dg(x)[v] = H_f(x)v. $$

Algorithmically:

g = grad(f)
hvp = jvp(g, x, v)

This is efficient because reverse mode computes gradients well for scalar-output functions, and forward mode efficiently propagates a directional derivative through that gradient computation.

The total cost is usually comparable to a small multiple of gradient evaluation.

Reverse-over-Forward AD

An alternative uses reverse-over-forward.

First compute the directional derivative:

$$ \phi_v(x) = Df(x)[v] = \nabla f(x)^\top v. $$

Then differentiate this scalar function with reverse mode:

$$ \nabla \phi_v(x) = H_f(x)v. $$

Algorithmically:

phi(x) = jvp(f, x, v)
hvp = grad(phi)(x)

This is mathematically equivalent to forward-over-reverse.

The preferred method depends on:

Factor	Influence
compiler structure	nesting overhead
tape implementation	reverse-mode complexity
memory reuse	checkpointing strategy
primitive support	availability of JVP/VJP rules
hardware backend	GPU fusion behavior

Modern systems often support both.

Pearlmutter's Method

A classical result due to Pearlmutter gives an efficient way to compute Hessian-vector products directly.

The core identity is:

$$ H_f(x)v = R_v{\nabla f(x)}, $$

where

$$ R_v $$

is a forward-mode directional derivative operator.

This means:

Run reverse mode to define the gradient computation.
Push a tangent vector through that reverse computation.

The resulting complexity is approximately:

Quantity	Complexity
function evaluation	$C$
gradient evaluation	$\approx 2C$
Hessian-vector product	$O(C)$

up to constant factors and implementation details.

This result is important because it avoids quadratic scaling.

Hessian-Free Methods

Many large-scale optimization methods are called Hessian-free because they never explicitly construct the Hessian.

Instead, they repeatedly query:

$$ v \mapsto H_f(x)v. $$

The Hessian becomes an implicit linear operator.

Newton-CG is a standard example.

Newton's equation is

$$ H_f(x)p = -\nabla f(x). $$

Rather than forming $H_f(x)$, conjugate gradient solves the linear system using only repeated Hessian-vector products.

Algorithmically:

while not converged:
    w = hvp(v)
    update Krylov basis
    update search direction

This scales to extremely large systems because the algorithm never materializes the dense matrix.

Curvature Along Directions

A Hessian-vector product also gives directional curvature information.

The scalar

$$ v^\top H_f(x)v $$

measures curvature along direction $v$.

This quantity appears in:

Area	Usage
optimization	trust-region methods
deep learning	curvature diagnostics
physics	local energy curvature
numerical analysis	stability estimation
statistics	Fisher-like approximations

$$ v^\top H_f(x)v > 0, $$

the function bends upward along $v$.

$$ v^\top H_f(x)v < 0, $$

the function bends downward.

If the value is near zero, the direction is locally flat.

Finite Difference Approximation

A Hessian-vector product can be approximated using finite differences of gradients:

$$ H_f(x)v \approx \frac{\nabla f(x + \epsilon v) - \nabla f(x)}{\epsilon}. $$

A centered approximation is more accurate:

$$ H_f(x)v \approx \frac{\nabla f(x + \epsilon v) - \nabla f(x - \epsilon v)} {2\epsilon}. $$

This requires only gradient evaluations.

However, finite differences introduce truncation error and floating point cancellation. AD-based Hessian-vector products are usually more accurate and often faster.

Linear Operator View

A Hessian-vector product defines a linear map:

$$ v \mapsto H_f(x)v. $$

The Hessian therefore behaves as a linear operator.

This suggests an important design principle.

Instead of exposing Hessians as dense arrays:

H = hessian(f)(x)

large systems should expose operator interfaces:

H = hessian_operator(f, x)

y = H.matvec(v)

This enables:

Capability	Benefit
iterative solvers	no dense matrix
lazy evaluation	compute only needed products
sparse backends	exploit structure
distributed execution	shard operator application
GPU kernels	fuse operator evaluation

This operator-centric design appears throughout modern scientific computing.

HVPs in Neural Networks

Deep neural networks rarely use explicit Hessians because parameter counts are enormous.

Still, Hessian-vector products are heavily used in:

Application	Role
second-order optimization	curvature-aware updates
meta-learning	implicit gradients
influence functions	parameter sensitivity
pruning	curvature estimation
uncertainty estimation	local geometry
neural tangent analysis	linearization studies

In practice, most large ML systems treat Hessians as implicit operators accessed through HVP queries.

Memory Considerations

A Hessian-vector product can often be computed with memory usage close to gradient computation.

However, nesting reverse mode inside reverse mode can dramatically increase memory consumption.

Forward-over-reverse is usually preferred because:

Property	Forward-over-reverse
memory usage	moderate
implementation complexity	manageable
nesting stability	relatively good
operator efficiency	high

Reverse-over-reverse may require differentiating tape management and adjoint accumulation logic, which is harder to optimize.

Structured Hessian Operators

Some systems expose specialized Hessian operators.

Examples include:

Structure	Optimization
diagonal Hessian	store only diagonal
block Hessian	blockwise matvec
sparse Hessian	sparse kernels
low-rank Hessian	factored representation
Kronecker approximations	efficient inversion
Fisher approximations	probabilistic geometry

The operator abstraction makes these interchangeable from the optimizer's perspective.

Practical Principle

The full Hessian is usually a debugging or analysis object.

The Hessian-vector product is the scalable computational primitive.

A modern AD system should therefore optimize:

$$ v \mapsto H_f(x)v $$

as a first-class operation rather than treating it as a derived convenience built from dense Hessian construction.