Sparse Tensor Derivatives

Most real computational problems are sparse. Large matrices and tensors often contain mostly zeros, structured blocks, or local interactions. Sparse representations reduce...

Sparse Tensor Derivatives

Most real computational problems are sparse. Large matrices and tensors often contain mostly zeros, structured blocks, or local interactions. Sparse representations reduce memory usage and computational cost by storing only the nonzero structure.

Automatic differentiation must preserve sparsity whenever possible. If differentiation destroys sparsity structure, the resulting gradients can become prohibitively expensive.

Sparse differentiation therefore has two goals:

  1. Compute correct derivatives.
  2. Preserve structural sparsity through the derivative pipeline.

Sparsity

A tensor is sparse if most entries are zero.

For a matrix

$$ A \in \mathbb{R}^{m \times n}, $$

define:

$$ \operatorname{nnz}(A) = \text{number of nonzero entries}. $$

If

$$ \operatorname{nnz}(A) \ll mn, $$

then sparse representations are beneficial.

Examples:

Domain Sparse Structure
Graphs Sparse adjacency matrices
PDE discretization Local stencil matrices
Recommender systems Sparse user-item interactions
NLP Sparse bag-of-words vectors
Scientific simulation Sparse Jacobians
Optimization Sparse constraints

AD systems that ignore sparsity often become memory-bound long before arithmetic becomes expensive.

Sparse Storage Formats

Sparse tensors are stored structurally.

Common formats:

Format Structure
COO Coordinate list
CSR Compressed sparse row
CSC Compressed sparse column
BSR Block sparse row
ELLPACK Fixed-width row storage
Hybrid sparse Mixed sparse/dense layout

For COO:

row_indices
column_indices
values

represent the tensor.

Example:

$$ A = \begin{bmatrix} 1 & 0 & 0 \ 0 & 0 & 2 \ 3 & 0 & 4 \end{bmatrix} $$

COO storage:

rows   = [0,1,2,2]
cols   = [0,2,0,2]
values = [1,2,3,4]

The derivative rules depend on whether sparsity structure is fixed or variable.

Sparse Structure vs Sparse Values

This distinction is critical.

A sparse tensor has:

  1. Structural sparsity pattern.
  2. Numerical values.

Example:

indices = fixed
values  = differentiable

Most AD systems differentiate the values but not the indices.

The indices are discrete objects. Ordinary differentiation does not apply to them.

Thus sparse tensor differentiation usually assumes:

$$ \text{sparsity pattern is fixed locally}. $$

This is analogous to pivot choices in matrix factorizations.

Sparse Matrix-Vector Product

Consider:

$$ y = Ax, $$

where $A$ is sparse.

The forward differential is:

$$ dy = dA,x + A,dx. $$

The reverse rules remain:

$$ \bar{A} \mathrel{+}= \bar{y}x^T, $$

$$ \bar{x} \mathrel{+}= A^T\bar{y}. $$

But sparse structure changes implementation significantly.

If $A$ has fixed sparsity pattern:

  • Only stored entries receive gradients.
  • Zero entries outside the pattern are not materialized.

Thus:

$$ \bar{A}_{ij} = 0 $$

for structurally absent entries.

The gradient inherits the sparsity mask.

Sparse Gradient Projection

Suppose:

$$ A_{ij} = 0 $$

structurally.

Even if the dense gradient formula would produce a nonzero value there, the sparse parameterization prevents that entry from existing.

Thus sparse differentiation often applies a projection:

$$ \bar{A} \leftarrow P_{\mathcal{S}}(\bar{A}), $$

where:

$$ \mathcal{S} = \text{sparsity pattern}. $$

This means:

$$ (P_{\mathcal{S}}(G)){ij} = \begin{cases} G{ij}, & (i,j)\in\mathcal{S}, \ 0, & \text{otherwise}. \end{cases} $$

The parameter space itself is sparse.

Sparse Jacobians

Many functions have sparse Jacobians.

Suppose:

$$ f : \mathbb{R}^n \to \mathbb{R}^m. $$

The Jacobian:

$$ J_{ij} = \frac{\partial f_i}{\partial x_j} $$

is sparse if most outputs depend only on nearby inputs.

Examples:

System Jacobian Structure
PDE stencil Local neighborhood
ODE integrator Banded structure
Graph computation Edge-local dependence
Physical simulation Sparse interactions

Sparse Jacobians are central to scientific computing.

Jacobian Sparsity Propagation

Sparsity propagates structurally through composition.

Suppose:

$$ f(x) = h(g(x)). $$

Then:

$$ J_f = J_h J_g. $$

Even if both factors are sparse, the product may become denser.

This is called fill-in.

Fill-in is one of the main computational problems in sparse AD.

Example:

Matrix Type Multiplication Result
Diagonal × diagonal Diagonal
Banded × banded Wider band
Sparse graph adjacency powers Increasing density

AD systems must balance:

sparsity preservation
reordering
factorization cost
memory growth

Forward Mode and Sparse Jacobians

Forward mode computes:

$$ Jv. $$

If the Jacobian is sparse, one can compute multiple directional derivatives simultaneously using graph coloring.

Suppose columns of the Jacobian do not overlap structurally. Then they can share a perturbation seed.

This reduces the number of forward passes.

Example:

seed matrix S

produces:

$$ JS. $$

Careful seed design reconstructs the sparse Jacobian from fewer evaluations.

This technique is widely used in scientific computing AD systems.

Graph Coloring

Graph coloring reduces directional derivative evaluations.

Define a column-interference graph:

  • Each Jacobian column is a node.
  • Two columns share an edge if they may contribute to the same output row.

Columns with different colors can be seeded together.

If the graph uses $k$ colors, only $k$ forward passes are needed instead of $n$.

Example:

Problem Variables Colors Needed
Dense Jacobian $n$ $n$
Tridiagonal Jacobian $n$ 3
Grid stencil Large $n$ Small constant

Sparse coloring can dramatically reduce AD cost.

Reverse Mode and Sparse Accumulation

Reverse mode computes:

$$ J^T u. $$

Sparse reverse mode faces different challenges:

Problem Cause
Sparse accumulation Many paths contribute
Atomic updates Parallel accumulation
Fill-in Adjoint propagation
Memory locality Irregular structure

Sparse reverse kernels often require scatter-add operations:

for edge (i,j):
    grad_x[j] += local_grad * grad_y[i]

Parallel sparse accumulation is difficult because many threads may update the same gradient location.

Sparse Hessians

Second derivatives are often sparse even when first derivatives are not.

Suppose:

$$ f : \mathbb{R}^n \to \mathbb{R}. $$

The Hessian:

$$ H_{ij} = \frac{\partial^2 f} {\partial x_i \partial x_j} $$

is sparse when variables interact locally.

Examples:

System Hessian Structure
Mechanical systems Local coupling
PDE discretization Sparse stencil
Graph optimization Edge-local structure

Sparse Hessians are important because dense Hessians scale as:

$$ O(n^2). $$

Sparse methods can reduce complexity dramatically.

Hessian-Vector Products

Most systems avoid forming Hessians explicitly.

Instead they compute:

$$ Hv. $$

This preserves sparsity implicitly.

Methods include:

Method Technique
Forward-over-reverse JVP through reverse pass
Reverse-over-forward VJP through forward pass
Pearlmutter trick Efficient Hessian-vector product

Sparse Hessian-vector products are fundamental in large-scale optimization.

Sparse Solves in Reverse Mode

Sparse linear solves appear frequently in implicit differentiation.

Suppose:

$$ Ax = b, $$

with sparse $A$.

The reverse rule requires:

$$ A^T g = \bar{x}. $$

Thus the backward pass itself requires a sparse solve.

The cost of reverse mode may therefore approach the cost of another numerical solve.

Efficient sparse AD often depends on:

symbolic factorization reuse
fill-reducing reorderings
cached sparsity structure
iterative solves

Sparse Tensor Contraction

Sparse tensor contraction generalizes sparse matrix multiplication.

Suppose:

$$ C_{ij} = \sum_k A_{ik}B_{kj}. $$

Only nonzero entries contribute.

Efficient sparse contraction requires:

Concern Importance
Index ordering Memory locality
Duplicate accumulation Correctness
Intermediate fill-in Complexity
Parallel scheduling Throughput

The reverse rules remain algebraically identical to dense contraction. The challenge is structural efficiency.

Dynamic Sparsity

Some systems change sparsity patterns dynamically.

Examples:

System Dynamic Structure
Attention masks Input-dependent edges
Pruned neural networks Learned sparsity
Adaptive meshes Refinement/coarsening
Sparse routing Conditional execution

Dynamic sparsity creates discontinuities.

Changing whether an entry exists is a discrete operation.

AD usually treats the current sparsity pattern as fixed during a backward pass.

Sparse Attention

Transformer sparsity patterns are increasingly important.

Examples:

Pattern Structure
Local attention Banded
Block sparse attention Block structure
Routing attention Dynamic graph
Long-context attention Hierarchical sparsity

Sparse attention derivatives require efficient scatter/gather kernels and structured reductions.

Memory bandwidth often dominates arithmetic cost.

Sparse Graph Operations

Graph neural networks naturally produce sparse operators.

A graph adjacency matrix:

$$ A_{ij} = 1 $$

if an edge exists.

Message passing often has form:

$$ H' = AH W. $$

Reverse mode propagates gradients through sparse neighborhood aggregation.

Sparse graph AD must handle:

edge accumulation
neighbor reduction
scatter-add
batch graph packing
irregular memory access

Graph differentiation is fundamentally sparse differentiation.

Structured Sparsity

Not all sparsity is random.

Structured patterns include:

Structure Example
Diagonal Scaling operators
Banded PDE discretization
Block sparse Attention
Toeplitz Signal processing
Kronecker Tensor decompositions

Structured sparsity enables specialized derivative kernels.

Example:

A banded matrix with bandwidth $k$ allows:

$$ O(kn) $$

operations instead of:

$$ O(n^2). $$

AD systems that preserve structure can inherit these savings.

Sparse Layout Metadata

A sparse tensor primitive must track:

indices
storage format
shape
batch axes
duplicate semantics
sortedness
coalescing state

Two sparse tensors may represent the same values but differ operationally because their storage layouts differ.

Gradient kernels must preserve consistency.

Sparse Accumulation Semantics

Sparse systems often permit duplicate coordinates:

(i,j,val1)
(i,j,val2)

Semantics may define:

$$ A_{ij} = val1 + val2. $$

Backward accumulation must then preserve this additive interpretation.

Sparse differentiation therefore depends on precise storage semantics.

Implementation Challenges

Sparse AD introduces difficult systems problems:

Challenge Explanation
Irregular memory access Poor cache locality
Atomic gradient accumulation Parallel conflicts
Fill-in Growing intermediate density
Layout conversion COO ↔ CSR ↔ CSC
Dynamic sparsity Discrete structure changes
GPU utilization Irregular workloads

Sparse kernels are often memory-bound rather than compute-bound.

Practical Guidance

Prefer:

Good Practice Avoid
Fixed sparsity structure Constant layout changes
Structured sparsity Arbitrary irregular sparsity
Hessian-vector products Dense Hessian materialization
Sparse solves Dense inversion
Graph coloring Full Jacobian assembly
Projector gradients Materializing dense masks

For AD implementers:

Focus Area Importance
Accumulation correctness Critical
Sparse metadata tracking Critical
Efficient scatter-add Essential
Fill-in control Major performance issue
Layout-aware kernels Essential

Summary

Sparse tensor differentiation extends ordinary AD with structural constraints. The mathematical derivative rules remain mostly unchanged, but the implementation must preserve sparsity patterns, avoid unnecessary fill-in, and manage irregular memory behavior efficiently.

The key distinction is between sparse values and sparse structure. Values are differentiable. Structural sparsity patterns are usually treated as fixed local topology. Efficient sparse AD depends on preserving that topology throughout forward and reverse computation.