Singular Value Decomposition

The singular value decomposition (SVD) is one of the most important matrix factorizations in numerical linear algebra. It appears in dimensionality reduction, least squares, low-rank approximation, spectral regularization, recommendation systems, scientific computing, and modern machine learning.

For automatic differentiation, SVD is both powerful and difficult. Singular values are comparatively stable. Singular vectors are not. The derivative formulas expose this instability directly.

Definition

For a matrix

$$ A \in \mathbb{R}^{m \times n}, $$

the singular value decomposition is

$$ A = U\Sigma V^T, $$

where:

$$ U \in \mathbb{R}^{m \times m}, \qquad U^TU = I, $$

$$ V \in \mathbb{R}^{n \times n}, \qquad V^TV = I, $$

and

$$ \Sigma = \operatorname{diag}(\sigma_1,\ldots,\sigma_r), $$

with

$$ \sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_r \ge 0. $$

The singular values are the square roots of the eigenvalues of

$$ A^TA $$

$$ AA^T. $$

The columns of $U$ are left singular vectors. The columns of $V$ are right singular vectors.

Most numerical systems use reduced SVD:

$$ U \in \mathbb{R}^{m \times r}, \qquad \Sigma \in \mathbb{R}^{r \times r}, \qquad V \in \mathbb{R}^{n \times r}, $$

where

$$ r = \operatorname{rank}(A). $$

Why SVD Matters

SVD provides optimal low-rank structure.

The best rank-$k$ approximation of $A$ in Frobenius norm is

$$ A_k = U_k \Sigma_k V_k^T. $$

Applications include:

Application	Role
PCA	Principal components
Compression	Low-rank approximation
Recommender systems	Matrix factorization
Scientific computing	Ill-conditioned systems
Optimization	Spectral regularization
Vision	Latent representations
NLP	Embedding reduction
Control theory	Balanced truncation

Because SVD appears inside many differentiable systems, robust AD rules are essential.

Differential of Singular Values

The singular values are comparatively well behaved.

Suppose singular values are distinct. Let

$$ u_i $$

and

$$ v_i $$

be corresponding singular vectors:

$$ Av_i = \sigma_i u_i, $$

$$ A^T u_i = \sigma_i v_i. $$

Differentiate:

$$ dA,v_i + A,dv_i = d\sigma_i,u_i + \sigma_i,du_i. $$

Left-multiply by $u_i^T$:

$$ u_i^T dA,v_i + u_i^T A,dv_i = d\sigma_i + \sigma_i u_i^T du_i. $$

Using

$$ u_i^T A = \sigma_i v_i^T, $$

we obtain

$$ u_i^T A,dv_i = \sigma_i v_i^T dv_i. $$

Because singular vectors remain normalized,

$$ u_i^T du_i = 0, \qquad v_i^T dv_i = 0. $$

Therefore:

$$ d\sigma_i = u_i^T dA,v_i. $$

This gives the gradient:

$$ \nabla_A \sigma_i = u_i v_i^T. $$

This rule is stable even when singular vectors themselves are unstable.

Reverse Rule for Singular Values

Suppose a scalar loss depends only on singular values:

$$ L = f(\sigma_1,\ldots,\sigma_r). $$

Let

$$ \bar{\sigma}_i = \frac{\partial L}{\partial \sigma_i}. $$

Then:

$$ dL = \sum_i \bar{\sigma}_i d\sigma_i. $$

Substitute:

$$ dL = \sum_i \bar{\sigma}_i u_i^T dA,v_i. $$

Rewriting:

$$ dL = \operatorname{tr} \left( U,\operatorname{diag}(\bar{\sigma}),V^T dA^T \right). $$

Therefore:

$$ \bar{A} = U,\operatorname{diag}(\bar{\sigma}),V^T. $$

This is one of the most important spectral gradient formulas.

Singular Vector Derivatives

Singular vectors are much more delicate.

Differentiating them produces terms involving:

$$ \frac{1}{\sigma_i^2 - \sigma_j^2}. $$

If singular values are close, the derivatives become large. If singular values are equal, the derivatives become undefined.

This is analogous to eigenvector differentiation.

The instability arises because singular subspaces are intrinsic, but individual basis vectors are not.

$$ \sigma_i = \sigma_j, $$

then any orthonormal basis of the corresponding singular subspace is valid.

The decomposition is therefore non-unique.

Sign Ambiguity

$$ (u_i, v_i) $$

is a singular vector pair, then

$$ (-u_i, -v_i) $$

represents the same singular mode.

Thus singular vectors have unavoidable sign ambiguity.

Small numerical perturbations may flip signs unpredictably:

$$ u_i \to -u_i, \qquad v_i \to -v_i. $$

The forward computation remains correct, but gradients involving singular vectors may change discontinuously.

Losses depending directly on singular vector coordinates are therefore fragile.

Full Reverse Rule

Suppose the loss depends on both singular values and singular vectors:

$$ L(U,\Sigma,V). $$

Let:

$$ \bar{U} = \frac{\partial L}{\partial U}, \qquad \bar{\Sigma} = \frac{\partial L}{\partial \Sigma}, \qquad \bar{V} = \frac{\partial L}{\partial V}. $$

Define matrix $F$:

$$ F_{ij} = \begin{cases} \frac{1}{\sigma_i^2 - \sigma_j^2}, & i \ne j, \ 0, & i=j. \end{cases} $$

The backward rule contains terms:

$$ F \odot (U^T\bar{U} - \bar{U}^T U), $$

and similarly for $V$.

The exact expression is complicated, but the important structure is simple:

Singular-value gradients are stable.
Singular-vector gradients contain inverse spectral gaps.
Repeated singular values create undefined derivatives.

Geometric Interpretation

SVD identifies orthogonal subspaces.

The stable object is the subspace projector:

$$ UU^T, \qquad VV^T. $$

The basis vectors themselves are arbitrary coordinates inside the subspace.

This distinction is critical.

Good objectives depend on:

Stable Quantity	Unstable Quantity
Singular values	Singular vector signs
Subspace projectors	Individual basis vectors
Low-rank reconstructions	Raw singular coordinates
Spectral norms	Ordered vector identities

When possible, design losses around invariant geometry.

Spectral Norm

The spectral norm is the largest singular value:

$$ |A|_2 = \sigma_1. $$

If the largest singular value is simple:

$$ \nabla_A |A|_2 = u_1 v_1^T. $$

This appears in:

Use Case	Purpose
Spectral normalization	Stabilize neural networks
Lipschitz constraints	Robust optimization
Control theory	Gain bounds
Numerical analysis	Condition estimation

If the top singular value is repeated, the gradient becomes set-valued.

Nuclear Norm

The nuclear norm is

$$ |A|_* = \sum_i \sigma_i. $$

Its gradient for full-rank matrices with distinct singular values is

$$ \nabla_A |A|_* = UV^T. $$

The nuclear norm is widely used for low-rank regularization.

At rank-deficient points, the derivative becomes non-unique and requires subgradient theory.

Low-Rank Approximation

Suppose

$$ A_k = U_k \Sigma_k V_k^T $$

keeps only the top $k$ singular values.

This operation is discontinuous when:

$$ \sigma_k = \sigma_{k+1}. $$

Near rank transitions, gradients become unstable.

Hard truncation creates nondifferentiable boundaries.

Soft spectral weighting is often preferable:

$$ \tilde{\Sigma}_{ii} = f(\sigma_i), $$

for a smooth function $f$.

Examples:

Function	Effect
Soft threshold	Shrink singular values
Exponential decay	Smooth filtering
Logistic gating	Continuous truncation

SVD and PCA

Principal component analysis computes the leading singular vectors of centered data.

Suppose:

$$ X \in \mathbb{R}^{N \times d}. $$

PCA often uses:

$$ X = U\Sigma V^T. $$

Principal directions are columns of $V$.

Differentiating through PCA inherits all singular-vector instability issues.

If principal components swap order under perturbation, gradients become discontinuous.

Using covariance projectors:

$$ VV^T $$

is more stable than depending on individual components.

SVD and Matrix Completion

Recommendation systems often optimize low-rank factors:

$$ A \approx UV^T. $$

Differentiating through an explicit SVD is usually avoided. Instead, systems optimize factors directly.

This avoids:

Problem	Cause
Singular-vector instability	Spectral degeneracy
Expensive backward pass	Full decomposition
Rank-transition discontinuities	Hard truncation

Direct factor parameterization is often more stable.

Differentiating Truncated SVD

Many applications compute only the top $k$ singular values and vectors.

This creates additional challenges:

Issue	Explanation
Ordering discontinuity	Singular values may swap
Rank instability	Small modes appear/disappear
Iterative solver truncation	Backward graph incomplete
Approximation error	Forward and backward mismatch

For iterative methods like Lanczos or power iteration, implicit differentiation may be preferable.

Iterative SVD Methods

Large systems rarely compute full dense SVD.

Instead they use:

Method	Use
Power iteration	Largest singular value
Lanczos	Partial spectrum
Randomized SVD	Large sparse matrices
Block Krylov	Fast low-rank approximation

AD can:

Differentiate through all iterations.
Use implicit differentiation.

Differentiating through iterations creates long computation graphs and memory costs. Implicit methods are often more scalable.

Complex-Valued SVD

For complex matrices:

$$ A = U\Sigma V^H. $$

The adjoint uses conjugate transpose:

$$ V^H. $$

Complex differentiation requires Wirtinger calculus or equivalent conventions.

Orthogonality becomes unitarity:

$$ U^H U = I. $$

Complex spectral differentiation is even more subtle because phase ambiguity replaces sign ambiguity.

Numerical Stability

SVD backward passes are numerically sensitive.

Common stabilization techniques:

Technique	Purpose
Spectral gap regularization	Avoid division blowup
Truncated spectra	Remove unstable modes
Symmetrization	Reduce numerical drift
Soft spectral functions	Avoid discontinuous truncation
Projector losses	Avoid basis instability

No implementation can fully eliminate instability near repeated singular values because the mathematical derivative itself becomes singular.

Batch SVD

Modern tensor systems support batched SVD:

$$ A : (B,m,n). $$

Each batch element is factorized independently:

$$ A_b = U_b \Sigma_b V_b^T. $$

The backward pass must preserve:

batch axes
matrix axes
spectral ordering
orthogonality constraints

Shape errors in batched spectral code are common and difficult to debug.

Implementation Metadata

An SVD primitive should record:

full vs reduced decomposition
singular value ordering
sign conventions
batch dimensions
real vs complex arithmetic
rank truncation
iterative solver details

The backward rule depends on all of these choices.

Practical Guidance

Prefer objectives based on:

Stable	Avoid
Singular values	Raw singular vectors
Spectral norms	Vector coordinates
Nuclear norms	Arbitrary basis orientation
Subspace projectors	Sign-sensitive losses
Smooth spectral filters	Hard truncation

When singular vectors are unavoidable:

Strategy	Benefit
Add spectral gap regularization	Improves conditioning
Use projector formulations	Removes basis ambiguity
Avoid repeated spectra	Reduces instability
Use implicit methods	Better scalability

Summary

SVD is one of the most expressive and difficult primitives in automatic differentiation. Singular values have stable first-order perturbations. Singular vectors do not. Their derivatives contain inverse spectral gaps and become undefined at repeated singular values.

The stable geometric object is the singular subspace, not the arbitrary basis chosen to represent it. Robust differentiable systems should therefore formulate objectives in terms of invariant spectral quantities whenever possible.