Singular Value Decomposition

The singular value decomposition SVD is one of the most important matrix factorizations in numerical linear algebra. It appears in dimensionality reduction, least squares,...

Singular Value Decomposition

The singular value decomposition (SVD) is one of the most important matrix factorizations in numerical linear algebra. It appears in dimensionality reduction, least squares, low-rank approximation, spectral regularization, recommendation systems, scientific computing, and modern machine learning.

For automatic differentiation, SVD is both powerful and difficult. Singular values are comparatively stable. Singular vectors are not. The derivative formulas expose this instability directly.

Definition

For a matrix

$$ A \in \mathbb{R}^{m \times n}, $$

the singular value decomposition is

$$ A = U\Sigma V^T, $$

where:

$$ U \in \mathbb{R}^{m \times m}, \qquad U^TU = I, $$

$$ V \in \mathbb{R}^{n \times n}, \qquad V^TV = I, $$

and

$$ \Sigma = \operatorname{diag}(\sigma_1,\ldots,\sigma_r), $$

with

$$ \sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_r \ge 0. $$

The singular values are the square roots of the eigenvalues of

$$ A^TA $$

or

$$ AA^T. $$

The columns of $U$ are left singular vectors. The columns of $V$ are right singular vectors.

Most numerical systems use reduced SVD:

$$ U \in \mathbb{R}^{m \times r}, \qquad \Sigma \in \mathbb{R}^{r \times r}, \qquad V \in \mathbb{R}^{n \times r}, $$

where

$$ r = \operatorname{rank}(A). $$

Why SVD Matters

SVD provides optimal low-rank structure.

The best rank-$k$ approximation of $A$ in Frobenius norm is

$$ A_k = U_k \Sigma_k V_k^T. $$

Applications include:

Application Role
PCA Principal components
Compression Low-rank approximation
Recommender systems Matrix factorization
Scientific computing Ill-conditioned systems
Optimization Spectral regularization
Vision Latent representations
NLP Embedding reduction
Control theory Balanced truncation

Because SVD appears inside many differentiable systems, robust AD rules are essential.

Differential of Singular Values

The singular values are comparatively well behaved.

Suppose singular values are distinct. Let

$$ u_i $$

and

$$ v_i $$

be corresponding singular vectors:

$$ Av_i = \sigma_i u_i, $$

$$ A^T u_i = \sigma_i v_i. $$

Differentiate:

$$ dA,v_i + A,dv_i = d\sigma_i,u_i + \sigma_i,du_i. $$

Left-multiply by $u_i^T$:

$$ u_i^T dA,v_i + u_i^T A,dv_i = d\sigma_i + \sigma_i u_i^T du_i. $$

Using

$$ u_i^T A = \sigma_i v_i^T, $$

we obtain

$$ u_i^T A,dv_i = \sigma_i v_i^T dv_i. $$

Because singular vectors remain normalized,

$$ u_i^T du_i = 0, \qquad v_i^T dv_i = 0. $$

Therefore:

$$ d\sigma_i = u_i^T dA,v_i. $$

This gives the gradient:

$$ \nabla_A \sigma_i = u_i v_i^T. $$

This rule is stable even when singular vectors themselves are unstable.

Reverse Rule for Singular Values

Suppose a scalar loss depends only on singular values:

$$ L = f(\sigma_1,\ldots,\sigma_r). $$

Let

$$ \bar{\sigma}_i = \frac{\partial L}{\partial \sigma_i}. $$

Then:

$$ dL = \sum_i \bar{\sigma}_i d\sigma_i. $$

Substitute:

$$ dL = \sum_i \bar{\sigma}_i u_i^T dA,v_i. $$

Rewriting:

$$ dL = \operatorname{tr} \left( U,\operatorname{diag}(\bar{\sigma}),V^T dA^T \right). $$

Therefore:

$$ \bar{A} = U,\operatorname{diag}(\bar{\sigma}),V^T. $$

This is one of the most important spectral gradient formulas.

Singular Vector Derivatives

Singular vectors are much more delicate.

Differentiating them produces terms involving:

$$ \frac{1}{\sigma_i^2 - \sigma_j^2}. $$

If singular values are close, the derivatives become large. If singular values are equal, the derivatives become undefined.

This is analogous to eigenvector differentiation.

The instability arises because singular subspaces are intrinsic, but individual basis vectors are not.

If

$$ \sigma_i = \sigma_j, $$

then any orthonormal basis of the corresponding singular subspace is valid.

The decomposition is therefore non-unique.

Sign Ambiguity

If

$$ (u_i, v_i) $$

is a singular vector pair, then

$$ (-u_i, -v_i) $$

represents the same singular mode.

Thus singular vectors have unavoidable sign ambiguity.

Small numerical perturbations may flip signs unpredictably:

$$ u_i \to -u_i, \qquad v_i \to -v_i. $$

The forward computation remains correct, but gradients involving singular vectors may change discontinuously.

Losses depending directly on singular vector coordinates are therefore fragile.

Full Reverse Rule

Suppose the loss depends on both singular values and singular vectors:

$$ L(U,\Sigma,V). $$

Let:

$$ \bar{U} = \frac{\partial L}{\partial U}, \qquad \bar{\Sigma} = \frac{\partial L}{\partial \Sigma}, \qquad \bar{V} = \frac{\partial L}{\partial V}. $$

Define matrix $F$:

$$ F_{ij} = \begin{cases} \frac{1}{\sigma_i^2 - \sigma_j^2}, & i \ne j, \ 0, & i=j. \end{cases} $$

The backward rule contains terms:

$$ F \odot (U^T\bar{U} - \bar{U}^T U), $$

and similarly for $V$.

The exact expression is complicated, but the important structure is simple:

  1. Singular-value gradients are stable.
  2. Singular-vector gradients contain inverse spectral gaps.
  3. Repeated singular values create undefined derivatives.

Geometric Interpretation

SVD identifies orthogonal subspaces.

The stable object is the subspace projector:

$$ UU^T, \qquad VV^T. $$

The basis vectors themselves are arbitrary coordinates inside the subspace.

This distinction is critical.

Good objectives depend on:

Stable Quantity Unstable Quantity
Singular values Singular vector signs
Subspace projectors Individual basis vectors
Low-rank reconstructions Raw singular coordinates
Spectral norms Ordered vector identities

When possible, design losses around invariant geometry.

Spectral Norm

The spectral norm is the largest singular value:

$$ |A|_2 = \sigma_1. $$

If the largest singular value is simple:

$$ \nabla_A |A|_2 = u_1 v_1^T. $$

This appears in:

Use Case Purpose
Spectral normalization Stabilize neural networks
Lipschitz constraints Robust optimization
Control theory Gain bounds
Numerical analysis Condition estimation

If the top singular value is repeated, the gradient becomes set-valued.

Nuclear Norm

The nuclear norm is

$$ |A|_* = \sum_i \sigma_i. $$

Its gradient for full-rank matrices with distinct singular values is

$$ \nabla_A |A|_* = UV^T. $$

The nuclear norm is widely used for low-rank regularization.

At rank-deficient points, the derivative becomes non-unique and requires subgradient theory.

Low-Rank Approximation

Suppose

$$ A_k = U_k \Sigma_k V_k^T $$

keeps only the top $k$ singular values.

This operation is discontinuous when:

$$ \sigma_k = \sigma_{k+1}. $$

Near rank transitions, gradients become unstable.

Hard truncation creates nondifferentiable boundaries.

Soft spectral weighting is often preferable:

$$ \tilde{\Sigma}_{ii} = f(\sigma_i), $$

for a smooth function $f$.

Examples:

Function Effect
Soft threshold Shrink singular values
Exponential decay Smooth filtering
Logistic gating Continuous truncation

SVD and PCA

Principal component analysis computes the leading singular vectors of centered data.

Suppose:

$$ X \in \mathbb{R}^{N \times d}. $$

PCA often uses:

$$ X = U\Sigma V^T. $$

Principal directions are columns of $V$.

Differentiating through PCA inherits all singular-vector instability issues.

If principal components swap order under perturbation, gradients become discontinuous.

Using covariance projectors:

$$ VV^T $$

is more stable than depending on individual components.

SVD and Matrix Completion

Recommendation systems often optimize low-rank factors:

$$ A \approx UV^T. $$

Differentiating through an explicit SVD is usually avoided. Instead, systems optimize factors directly.

This avoids:

Problem Cause
Singular-vector instability Spectral degeneracy
Expensive backward pass Full decomposition
Rank-transition discontinuities Hard truncation

Direct factor parameterization is often more stable.

Differentiating Truncated SVD

Many applications compute only the top $k$ singular values and vectors.

This creates additional challenges:

Issue Explanation
Ordering discontinuity Singular values may swap
Rank instability Small modes appear/disappear
Iterative solver truncation Backward graph incomplete
Approximation error Forward and backward mismatch

For iterative methods like Lanczos or power iteration, implicit differentiation may be preferable.

Iterative SVD Methods

Large systems rarely compute full dense SVD.

Instead they use:

Method Use
Power iteration Largest singular value
Lanczos Partial spectrum
Randomized SVD Large sparse matrices
Block Krylov Fast low-rank approximation

AD can:

  1. Differentiate through all iterations.
  2. Use implicit differentiation.

Differentiating through iterations creates long computation graphs and memory costs. Implicit methods are often more scalable.

Complex-Valued SVD

For complex matrices:

$$ A = U\Sigma V^H. $$

The adjoint uses conjugate transpose:

$$ V^H. $$

Complex differentiation requires Wirtinger calculus or equivalent conventions.

Orthogonality becomes unitarity:

$$ U^H U = I. $$

Complex spectral differentiation is even more subtle because phase ambiguity replaces sign ambiguity.

Numerical Stability

SVD backward passes are numerically sensitive.

Common stabilization techniques:

Technique Purpose
Spectral gap regularization Avoid division blowup
Truncated spectra Remove unstable modes
Symmetrization Reduce numerical drift
Soft spectral functions Avoid discontinuous truncation
Projector losses Avoid basis instability

No implementation can fully eliminate instability near repeated singular values because the mathematical derivative itself becomes singular.

Batch SVD

Modern tensor systems support batched SVD:

$$ A : (B,m,n). $$

Each batch element is factorized independently:

$$ A_b = U_b \Sigma_b V_b^T. $$

The backward pass must preserve:

batch axes
matrix axes
spectral ordering
orthogonality constraints

Shape errors in batched spectral code are common and difficult to debug.

Implementation Metadata

An SVD primitive should record:

full vs reduced decomposition
singular value ordering
sign conventions
batch dimensions
real vs complex arithmetic
rank truncation
iterative solver details

The backward rule depends on all of these choices.

Practical Guidance

Prefer objectives based on:

Stable Avoid
Singular values Raw singular vectors
Spectral norms Vector coordinates
Nuclear norms Arbitrary basis orientation
Subspace projectors Sign-sensitive losses
Smooth spectral filters Hard truncation

When singular vectors are unavoidable:

Strategy Benefit
Add spectral gap regularization Improves conditioning
Use projector formulations Removes basis ambiguity
Avoid repeated spectra Reduces instability
Use implicit methods Better scalability

Summary

SVD is one of the most expressive and difficult primitives in automatic differentiation. Singular values have stable first-order perturbations. Singular vectors do not. Their derivatives contain inverse spectral gaps and become undefined at repeated singular values.

The stable geometric object is the singular subspace, not the arbitrary basis chosen to represent it. Robust differentiable systems should therefore formulate objectives in terms of invariant spectral quantities whenever possible.