Efficient Seeding Strategies

Forward mode automatic differentiation computes Jacobian-vector products:

Efficient Seeding Strategies

Forward mode automatic differentiation computes Jacobian-vector products:

$$ J_f(x)v. $$

The seed vector

$$ v $$

determines which derivative information propagates through the program. Choosing good seeds is therefore central to efficiency.

A naive approach computes one Jacobian column at a time using basis vectors:

$$ e_1, e_2, \ldots, e_n. $$

This always works, but it may waste computation when the Jacobian has structure. Efficient seeding strategies reduce the number of forward passes, reduce tangent storage, or improve hardware utilization.

Basis seeding

The simplest strategy uses standard basis vectors.

For

$$ f : \mathbb{R}^n \to \mathbb{R}^m, $$

seed:

$$ v = e_i. $$

Then forward mode computes:

$$ J_f(x)e_i, $$

which equals the $i$-th Jacobian column.

For example, if

$$ n = 3, $$

then:

$$ e_1 = \begin{bmatrix} 1 \ 0 \ 0 \end{bmatrix}, \qquad e_2 = \begin{bmatrix} 0 \ 1 \ 0 \end{bmatrix}, \qquad e_3 = \begin{bmatrix} 0 \ 0 \ 1 \end{bmatrix}. $$

The full Jacobian is assembled column-by-column:

$$ J_f(x) = \begin{bmatrix} J_f(x)e_1 & J_f(x)e_2 & J_f(x)e_3 \end{bmatrix}. $$

This method is simple and numerically stable. Its cost is proportional to the input dimension:

$$ O(nC_f). $$

Dense identity seeding

Instead of one basis vector per pass, vector forward mode can seed all basis directions simultaneously.

Use:

$$ V = I_n. $$

Then:

$$ J_f(x)V = J_f(x). $$

This computes the entire Jacobian in one execution.

However, each variable now carries an $n$-dimensional tangent vector. The runtime and memory cost become:

$$ O(nC_f), \qquad O(nM_f). $$

This strategy is practical only for moderate $n$.

Arbitrary directional seeding

Sometimes only directional sensitivity is needed.

Suppose we want:

$$ J_f(x)v. $$

Then we directly seed:

$$ \dot{x} = v. $$

This avoids constructing the full Jacobian.

Applications include:

Application Direction meaning
sensitivity analysis perturbation direction
optimization search direction
PDE solvers linearized update
continuation methods parameter path
implicit differentiation perturbation constraint

Directional seeding is often the most efficient use of forward mode.

Random seeding

Random seeds are useful for probabilistic estimation and debugging.

Choose:

$$ v_i \sim \mathcal{N}(0,1). $$

Then compute:

$$ J_f(x)v. $$

Applications include:

Use case Goal
Jacobian norm estimation approximate operator norm
randomized linear algebra low-rank structure
derivative testing detect implementation bugs
trace estimation Hutchinson-type methods
sensitivity sampling stochastic analysis

Random directional derivatives often reveal structural problems without building the full Jacobian.

Structured seeding

Many Jacobians are sparse. Basis seeding wastes work because most derivative entries are zero.

Suppose:

$$ J_f(x) = \begin{bmatrix}

  • & 0 & * & 0 \ 0 & * & 0 & * \
  • & 0 & 0 & 0 \end{bmatrix}. $$

Columns 1 and 2 never contribute to the same output row. Their seed directions can therefore share one tangent dimension.

Instead of:

$$ e_1, \quad e_2, \quad e_3, \quad e_4, $$

we may use compressed seeds:

$$ v_1 = \begin{bmatrix} 1 \ 1 \ 0 \ 0 \end{bmatrix}, \qquad v_2 = \begin{bmatrix} 0 \ 0 \ 1 \ 1 \end{bmatrix}. $$

Now two forward passes recover all columns.

This is the basis of compressed Jacobian computation.

Graph coloring

Efficient sparse seeding is usually formulated as a graph coloring problem.

Define the column interference graph:

  • each Jacobian column is a node,
  • two columns share an edge if they can affect the same output row.

Columns with no conflict may share a tangent direction.

Coloring assigns independent columns the same seed channel.

If the coloring uses $k$ colors, the Jacobian can be recovered in $k$ forward passes instead of $n$.

For sparse systems:

$$ k \ll n. $$

This can reduce derivative cost dramatically.

Applications:

Domain Sparsity source
PDE discretization local coupling
finite element methods mesh locality
circuit simulation component connectivity
robotics chain topology
optimization block constraints

Example of compressed seeding

Suppose:

$$ J = \begin{bmatrix} a & 0 & c \ 0 & b & 0 \end{bmatrix}. $$

Columns 1 and 2 do not overlap in rows.

Use compressed seed:

$$ v = \begin{bmatrix} 1 \ 1 \ 0 \end{bmatrix}. $$

Then:

$$ Jv = \begin{bmatrix} a \ b \end{bmatrix}. $$

One forward pass recovers both columns simultaneously.

Next seed:

$$ v = \begin{bmatrix} 0 \ 0 \ 1 \end{bmatrix}. $$

This gives:

$$ \begin{bmatrix} c \ 0 \end{bmatrix}. $$

Only two passes are needed instead of three.

Block seeding

Variables often appear naturally in groups.

Suppose:

$$ x = (x^{(1)}, x^{(2)}, \ldots, x^{(k)}). $$

Each block may correspond to:

  • one physical subsystem,
  • one tensor,
  • one parameter group,
  • one mesh region,
  • one neural layer.

Block seeding assigns tangent dimensions per block rather than per scalar variable.

Advantages:

Benefit Explanation
better cache locality contiguous tangent access
vectorized kernels tensor-friendly layout
simpler memory management fewer sparse structures
reduced overhead coarser granularity

This is common in tensor AD systems.

Tensor seeding

Machine learning systems usually work with tensors rather than scalar variables.

Suppose:

$$ W \in \mathbb{R}^{m \times n}. $$

Instead of seeding each scalar independently, a tangent tensor is attached:

$$ \dot{W} \in \mathbb{R}^{m \times n}. $$

Tensor-level seeding allows:

  • batched JVPs,
  • hardware tensor acceleration,
  • fused tangent kernels,
  • layerwise differentiation.

Modern frameworks frequently optimize around tensor tangent propagation rather than scalar dual arithmetic.

Seeding for Hessian-vector products

Forward mode combines naturally with reverse mode.

Suppose:

$$ f : \mathbb{R}^n \to \mathbb{R}. $$

Its gradient is:

$$ \nabla f(x). $$

Differentiate the gradient in direction $v$:

$$ H_f(x)v, $$

where $H_f(x)$ is the Hessian.

A common strategy:

  1. compute gradient with reverse mode,
  2. apply forward mode to the reverse computation.

The forward seed defines the Hessian-vector direction.

This avoids explicit Hessian construction.

Seed matrices

General vector forward mode uses a seed matrix:

$$ V \in \mathbb{R}^{n \times k}. $$

Each column is one tangent direction.

The output is:

$$ J_f(x)V. $$

Choice of $V$ determines:

Seed type Result
identity full Jacobian
basis subset selected columns
random stochastic directional probes
sparse compressed sparse Jacobian recovery
low-rank projected sensitivities
block structured subsystem derivatives

Thus seeding defines the derivative query algebraically.

Adaptive seeding

Some systems choose seeds dynamically.

Example strategies:

Strategy Goal
runtime sparsity discovery reduce tangent dimensions
adaptive coloring exploit observed dependencies
low-rank updates reduce directional count
dynamic batching improve GPU occupancy
selective propagation avoid inactive paths

Adaptive methods are important when derivative structure changes across executions.

Seed reuse

In iterative algorithms, the same derivative directions often appear repeatedly.

Examples:

Algorithm Reused direction
Newton-Krylov methods Krylov basis vectors
continuation methods path tangent
optimization search direction
sensitivity analysis parameter perturbations

Caching or reusing tangent allocations can reduce overhead.

Some systems preallocate tangent buffers to avoid repeated memory allocation during many JVP evaluations.

Forward-over-forward seeding

Nested forward mode introduces hierarchical tangent structures.

Suppose:

$$ x + \epsilon_1 a + \epsilon_2 b + \epsilon_1\epsilon_2 c. $$

Now seeds exist at multiple derivative levels.

This enables:

Nesting Result
forward-over-forward second derivatives
vector-forward-over-forward Hessians
nested sparse tangents structured higher-order tensors

Seed management becomes increasingly important as derivative order grows.

Numerical conditioning

The seed direction also affects numerical behavior.

If:

$$ J_f(x)v \approx 0, $$

small floating point errors may dominate.

Large or badly scaled seed vectors can amplify instability.

Good practice includes:

  • scaling tangent directions,
  • normalizing perturbation vectors,
  • avoiding extremely sparse tiny magnitudes,
  • using structured seeds aligned with problem geometry.

This becomes important in ill-conditioned scientific problems.

Seeding and hardware layout

Efficient implementations often organize tangent dimensions to match hardware.

CPU implementations may use:

  • SIMD lane packing,
  • structure-of-arrays layouts,
  • cache-aligned tangent blocks.

GPU implementations may use:

  • batched tensor kernels,
  • warp-aligned tangent dimensions,
  • fused primal-tangent kernels.

The algebraic seed structure therefore interacts directly with low-level systems design.

Tradeoffs

Different seeding strategies optimize different goals.

Strategy Best for Weakness
basis seeding simplicity many passes
dense identity small full Jacobians memory explosion
directional sensitivity queries incomplete information
sparse compressed sparse systems coloring overhead
block seeding tensor programs coarse granularity
random stochastic estimation approximate results

No single strategy dominates all workloads.

Summary

The seed vector or seed matrix determines which derivative information forward mode computes. Efficient seeding strategies exploit structure in the derivative problem to reduce runtime, memory usage, and tangent dimensionality.

The simplest strategy uses basis vectors to recover Jacobian columns individually. More advanced methods use compressed sparse seeds, graph coloring, block tangents, tensor seeds, or randomized projections. In large-scale systems, the efficiency of forward mode depends as much on seeding strategy as on the differentiation rules themselves.