Hessian-Vector Products
A Hessian-vector product computes
Hessian-Vector Products
A Hessian-vector product computes
$$ H_f(x)v = \nabla^2 f(x)v, $$
where
$$ f : \mathbb{R}^n \to \mathbb{R}, $$
and
$$ v \in \mathbb{R}^n. $$
This operation applies the Hessian as a linear operator to a vector without explicitly constructing the full Hessian matrix.
For large systems, this distinction is fundamental. A dense Hessian requires $O(n^2)$ storage. A Hessian-vector product requires only $O(n)$ storage for the input and output vectors.
Most scalable second-order optimization methods are built around Hessian-vector products rather than explicit Hessians.
Directional Interpretation
The Hessian-vector product measures how the gradient changes in direction $v$.
Define
$$ g(x) = \nabla f(x). $$
Then
$$ H_f(x)v = Dg(x)[v]. $$
Equivalently,
$$ H_f(x)v = \frac{d}{d\epsilon} \nabla f(x + \epsilon v) \bigg|_{\epsilon = 0}. $$
So a Hessian-vector product is simply the directional derivative of the gradient.
This identity is the basis of most AD implementations.
Geometric Meaning
The gradient defines the local slope field of a function. The Hessian describes how that slope field bends.
Given a direction $v$:
| Quantity | Meaning |
|---|---|
| $\nabla f(x)$ | local slope |
| $H_f(x)v$ | change of slope along $v$ |
| $v^\top H_f(x)v$ | curvature along $v$ |
The vector
$$ H_f(x)v $$
points in the direction the gradient moves when the input moves infinitesimally along $v$.
Example
Let
$$ f(x, y) = x^2 y + \sin y. $$
The gradient is
$$ \nabla f(x, y) = \begin{bmatrix} 2xy \ x^2 + \cos y \end{bmatrix}. $$
The Hessian is
$$ H_f(x, y) = \begin{bmatrix} 2y & 2x \ 2x & -\sin y \end{bmatrix}. $$
For
$$ v = \begin{bmatrix} v_1 \ v_2 \end{bmatrix}, $$
the Hessian-vector product is
$$ H_f(x, y)v = \begin{bmatrix} 2yv_1 + 2xv_2 \ 2xv_1 - \sin(y)v_2 \end{bmatrix}. $$
Notice that we never need to materialize the full Hessian matrix to compute this product. We only need the action of the matrix on the vector.
Why Hessian-Vector Products Matter
Suppose
$$ n = 10^6. $$
A dense Hessian contains
$$ 10^{12} $$
entries.
Even with 8-byte floating point values, storage would require multiple terabytes.
But a Hessian-vector product maps one vector to another:
$$ \mathbb{R}^n \to \mathbb{R}^n. $$
The storage cost is linear in $n$.
Many optimization algorithms only need this operator action.
| Algorithm | Needs full Hessian? | Needs HVP? |
|---|---|---|
| Newton-CG | no | yes |
| Conjugate gradient | no | yes |
| Lanczos | no | yes |
| Trust-region methods | sometimes | yes |
| Curvature diagnostics | no | yes |
| Krylov methods | no | yes |
This makes Hessian-vector products the central second-order primitive in large-scale optimization.
Forward-over-Reverse AD
The standard AD implementation uses forward-over-reverse differentiation.
First compute the gradient using reverse mode:
$$ g(x) = \nabla f(x). $$
Then differentiate the gradient computation using forward mode:
$$ Dg(x)[v] = H_f(x)v. $$
Algorithmically:
g = grad(f)
hvp = jvp(g, x, v)
This is efficient because reverse mode computes gradients well for scalar-output functions, and forward mode efficiently propagates a directional derivative through that gradient computation.
The total cost is usually comparable to a small multiple of gradient evaluation.
Reverse-over-Forward AD
An alternative uses reverse-over-forward.
First compute the directional derivative:
$$ \phi_v(x) = Df(x)[v] = \nabla f(x)^\top v. $$
Then differentiate this scalar function with reverse mode:
$$ \nabla \phi_v(x) = H_f(x)v. $$
Algorithmically:
phi(x) = jvp(f, x, v)
hvp = grad(phi)(x)
This is mathematically equivalent to forward-over-reverse.
The preferred method depends on:
| Factor | Influence |
|---|---|
| compiler structure | nesting overhead |
| tape implementation | reverse-mode complexity |
| memory reuse | checkpointing strategy |
| primitive support | availability of JVP/VJP rules |
| hardware backend | GPU fusion behavior |
Modern systems often support both.
Pearlmutter's Method
A classical result due to Pearlmutter gives an efficient way to compute Hessian-vector products directly.
The core identity is:
$$ H_f(x)v = R_v{\nabla f(x)}, $$
where
$$ R_v $$
is a forward-mode directional derivative operator.
This means:
- Run reverse mode to define the gradient computation.
- Push a tangent vector through that reverse computation.
The resulting complexity is approximately:
| Quantity | Complexity |
|---|---|
| function evaluation | $C$ |
| gradient evaluation | $\approx 2C$ |
| Hessian-vector product | $O(C)$ |
up to constant factors and implementation details.
This result is important because it avoids quadratic scaling.
Hessian-Free Methods
Many large-scale optimization methods are called Hessian-free because they never explicitly construct the Hessian.
Instead, they repeatedly query:
$$ v \mapsto H_f(x)v. $$
The Hessian becomes an implicit linear operator.
Newton-CG is a standard example.
Newton's equation is
$$ H_f(x)p = -\nabla f(x). $$
Rather than forming $H_f(x)$, conjugate gradient solves the linear system using only repeated Hessian-vector products.
Algorithmically:
while not converged:
w = hvp(v)
update Krylov basis
update search direction
This scales to extremely large systems because the algorithm never materializes the dense matrix.
Curvature Along Directions
A Hessian-vector product also gives directional curvature information.
The scalar
$$ v^\top H_f(x)v $$
measures curvature along direction $v$.
This quantity appears in:
| Area | Usage |
|---|---|
| optimization | trust-region methods |
| deep learning | curvature diagnostics |
| physics | local energy curvature |
| numerical analysis | stability estimation |
| statistics | Fisher-like approximations |
If
$$ v^\top H_f(x)v > 0, $$
the function bends upward along $v$.
If
$$ v^\top H_f(x)v < 0, $$
the function bends downward.
If the value is near zero, the direction is locally flat.
Finite Difference Approximation
A Hessian-vector product can be approximated using finite differences of gradients:
$$ H_f(x)v \approx \frac{\nabla f(x + \epsilon v) - \nabla f(x)}{\epsilon}. $$
A centered approximation is more accurate:
$$ H_f(x)v \approx \frac{\nabla f(x + \epsilon v) - \nabla f(x - \epsilon v)} {2\epsilon}. $$
This requires only gradient evaluations.
However, finite differences introduce truncation error and floating point cancellation. AD-based Hessian-vector products are usually more accurate and often faster.
Linear Operator View
A Hessian-vector product defines a linear map:
$$ v \mapsto H_f(x)v. $$
The Hessian therefore behaves as a linear operator.
This suggests an important design principle.
Instead of exposing Hessians as dense arrays:
H = hessian(f)(x)
large systems should expose operator interfaces:
H = hessian_operator(f, x)
y = H.matvec(v)
This enables:
| Capability | Benefit |
|---|---|
| iterative solvers | no dense matrix |
| lazy evaluation | compute only needed products |
| sparse backends | exploit structure |
| distributed execution | shard operator application |
| GPU kernels | fuse operator evaluation |
This operator-centric design appears throughout modern scientific computing.
HVPs in Neural Networks
Deep neural networks rarely use explicit Hessians because parameter counts are enormous.
Still, Hessian-vector products are heavily used in:
| Application | Role |
|---|---|
| second-order optimization | curvature-aware updates |
| meta-learning | implicit gradients |
| influence functions | parameter sensitivity |
| pruning | curvature estimation |
| uncertainty estimation | local geometry |
| neural tangent analysis | linearization studies |
In practice, most large ML systems treat Hessians as implicit operators accessed through HVP queries.
Memory Considerations
A Hessian-vector product can often be computed with memory usage close to gradient computation.
However, nesting reverse mode inside reverse mode can dramatically increase memory consumption.
Forward-over-reverse is usually preferred because:
| Property | Forward-over-reverse |
|---|---|
| memory usage | moderate |
| implementation complexity | manageable |
| nesting stability | relatively good |
| operator efficiency | high |
Reverse-over-reverse may require differentiating tape management and adjoint accumulation logic, which is harder to optimize.
Structured Hessian Operators
Some systems expose specialized Hessian operators.
Examples include:
| Structure | Optimization |
|---|---|
| diagonal Hessian | store only diagonal |
| block Hessian | blockwise matvec |
| sparse Hessian | sparse kernels |
| low-rank Hessian | factored representation |
| Kronecker approximations | efficient inversion |
| Fisher approximations | probabilistic geometry |
The operator abstraction makes these interchangeable from the optimizer's perspective.
Practical Principle
The full Hessian is usually a debugging or analysis object.
The Hessian-vector product is the scalable computational primitive.
A modern AD system should therefore optimize:
$$ v \mapsto H_f(x)v $$
as a first-class operation rather than treating it as a derived convenience built from dense Hessian construction.