Chapter 88. Iterative Methods for Linear Systems

Iterative methods solve a linear system by producing a sequence of approximate solutions. Instead of factoring the matrix once and solving the system directly, an iterative method starts from an initial guess and repeatedly improves it.

The target problem is

$$ Ax=b. $$

A direct method, such as Gaussian elimination, tries to compute the solution in a finite sequence of algebraic steps. An iterative method constructs vectors

$$ x^{(0)}, x^{(1)}, x^{(2)}, \ldots $$

with the goal that

$$ x^{(k)} \to x $$

as (k) increases.

Iterative methods are especially important for large sparse systems, where direct factorization may require too much memory or computation. Large sparse systems arise in discretized differential equations, graph problems, optimization, inverse problems, simulation, and machine learning. Standard classifications distinguish stationary methods such as Jacobi and Gauss-Seidel from nonstationary Krylov methods such as conjugate gradient and GMRES.

88.1 Direct Methods and Iterative Methods

A direct method attempts to solve

$$ Ax=b $$

by transforming the system into an equivalent system that can be solved exactly in exact arithmetic.

Examples include:

Method	Main idea
Gaussian elimination	eliminate variables
LU factorization	factor (A) into triangular matrices
Cholesky factorization	factor symmetric positive definite matrices
QR factorization	use orthogonal transformations

An iterative method instead computes a sequence of approximations. It may stop before the exact solution is reached, once the residual or error is small enough.

Method type	Strength	Weakness
Direct	predictable, robust for moderate dense systems	expensive for large sparse systems
Iterative	memory efficient, often fast for sparse systems	convergence depends on matrix structure

For large sparse matrices, iterative methods are often the only practical choice.

88.2 The Residual

Given an approximate solution (x^{(k)}), the residual is

$$ r^{(k)} = b - Ax^{(k)}. $$

The residual measures how much the current approximation fails to satisfy the equation.

$$ r^{(k)} = 0, $$

then

$$ Ax^{(k)} = b, $$

so (x^{(k)}) is an exact solution.

Most iterative methods use the residual to decide how to correct the current approximation.

88.3 The Error

Let (x) be the exact solution, and let

$$ e^{(k)} = x - x^{(k)} $$

be the error at iteration (k).

Because

$$ Ax=b, $$

we have

$$ A e^{(k)} = A(x-x^{(k)}) = b-Ax^{(k)} = r^{(k)}. $$

Thus

$$ A e^{(k)} = r^{(k)}. $$

The residual is the image of the error under (A). A small residual implies a small error only when the matrix is well-conditioned.

88.4 General Fixed-Point Iteration

Many iterative methods can be written as fixed-point iterations.

Starting from

$$ Ax=b, $$

we rewrite the equation in the form

$$ x = Gx + c. $$

Then define

$$ x^{(k+1)} = Gx^{(k)} + c. $$

The matrix (G) is called the iteration matrix.

If this iteration converges, its limit (x) satisfies

$$ x = Gx+c, $$

which is equivalent to the original system.

88.5 Convergence of Fixed-Point Iteration

Let (x) be the exact solution. Subtracting

$$ x = Gx+c $$

from

$$ x^{(k+1)} = Gx^{(k)}+c $$

gives

$$ e^{(k+1)} = G e^{(k)}. $$

Therefore,

$$ e^{(k)} = G^k e^{(0)}. $$

The iteration converges for every starting vector if and only if

$$ \rho(G)<1, $$

where (\rho(G)) is the spectral radius of (G), the largest absolute value of its eigenvalues.

This criterion is central for stationary iterative methods.

88.6 Matrix Splitting

A common construction uses a splitting of the matrix

$$ A = M - N, $$

where (M) is easy to invert.

Then

$$ Ax=b $$

becomes

$$ (M-N)x=b, $$

$$ Mx = Nx+b. $$

This gives the iteration

$$ Mx^{(k+1)} = Nx^{(k)} + b. $$

Equivalently,

$$ x^{(k+1)} = M^{-1}N x^{(k)} + M^{-1}b. $$

Thus the iteration matrix is

$$ G = M^{-1}N. $$

The quality of the method depends on choosing (M) so that solving systems with (M) is cheap and (M^{-1}N) has favorable convergence behavior.

88.7 Diagonal, Lower, and Upper Parts

Write

$$ A = D + L + U, $$

where:

Symbol	Meaning
(D)	diagonal part of (A)
(L)	strictly lower triangular part
(U)	strictly upper triangular part

Different choices of (M) produce different classical methods.

Method	Choice of (M)
Jacobi	(D)
Gauss-Seidel	(D+L)
SOR	(\frac{1}{\omega}D+L)

These methods are called stationary because the same iteration matrix is used at every step.

88.8 Jacobi Method

The Jacobi method solves each equation for its diagonal variable using values from the previous iteration.

For the system

$$ Ax=b, $$

with entries (a_{ij}), the update is

$$ x_i^{(k+1)} = \frac{1}{a_{ii}} \left( b_i-\sum_{j\ne i}a_{ij}x_j^{(k)} \right). $$

The Jacobi method corresponds to the splitting

$$ A = D - (-(L+U)). $$

Its iteration matrix is

$$ G_J = -D^{-1}(L+U). $$

Jacobi is simple and parallelizable because each component of (x^{(k+1)}) uses only values from the previous iteration.

Its convergence may be slow.

88.9 Gauss-Seidel Method

The Gauss-Seidel method updates components sequentially and immediately uses the newest available values.

The update is

$$ x_i^{(k+1)} = \frac{1}{a_{ii}} \left( b_i - \sum_{j<i}a_{ij}x_j^{(k+1)} - \sum_{j>i}a_{ij}x_j^{(k)} \right). $$

The method corresponds to the splitting

$$ M=D+L. $$

Its iteration matrix is

$$ G_{GS}=-(D+L)^{-1}U. $$

Gauss-Seidel often converges faster than Jacobi because it uses updated values as soon as they are available, though it remains relatively slow compared with modern Krylov methods.

88.10 Successive Over-Relaxation

Successive over-relaxation, or SOR, modifies Gauss-Seidel by introducing a relaxation parameter (\omega).

The update has the form

$$ x_i^{(k+1)} = (1-\omega)x_i^{(k)} + \frac{\omega}{a_{ii}} \left( b_i - \sum_{j<i}a_{ij}x_j^{(k+1)} - \sum_{j>i}a_{ij}x_j^{(k)} \right). $$

The parameter (\omega) controls the step size.

Range	Interpretation
(0<\omega<1)	under-relaxation
(\omega=1)	Gauss-Seidel
(1<\omega<2)	over-relaxation

When (\omega) is chosen well, SOR may converge much faster than Gauss-Seidel. A poor choice can slow convergence or cause divergence.

88.11 Convergence Conditions

Convergence depends on the matrix and the method.

Some useful sufficient conditions are:

Matrix property	Consequence
Strict diagonal dominance	Jacobi and Gauss-Seidel often converge
Symmetric positive definiteness	Gauss-Seidel converges
Spectral radius less than one	fixed-point iteration converges
Poor conditioning	convergence may be slow

Strict diagonal dominance means

$$ |a_{ii}|

\sum_{j\ne i}|a_{ij}| $$

for each row.

This condition says that each diagonal entry dominates the off-diagonal entries in its row.

88.12 Stopping Criteria

An iterative method must decide when to stop.

Common stopping criteria include:

Criterion	Form
Absolute residual	(\|r^{(k)}\| \le \tau)
Relative residual	(\frac{\|r^{(k)}\|}{\|b\|} \le \tau)
Step size	(\|x^{(k+1)}-x^{(k)}\| \le \tau)
Maximum iterations	(k \le k_{\max})

The relative residual is commonly used because it accounts for the scale of (b).

However, residual-based stopping must be interpreted carefully. If (A) is ill-conditioned, a small residual may still correspond to a large solution error.

88.13 Stationary and Nonstationary Methods

Stationary methods use the same update rule at every iteration.

Examples:

Stationary method	Feature
Jacobi	parallel, simple
Gauss-Seidel	sequential, often faster
SOR	relaxation parameter
SSOR	symmetric form useful as preconditioner

Nonstationary methods adapt the iteration using information from previous steps.

Examples:

Nonstationary method	Typical use
Conjugate gradient	symmetric positive definite systems
MINRES	symmetric indefinite systems
GMRES	nonsymmetric systems
BiCGSTAB	nonsymmetric systems
Chebyshev iteration	spectrum approximately known

Modern large-scale solvers usually rely on nonstationary methods, especially Krylov subspace methods.

88.14 Krylov Subspaces

Krylov methods search for approximations in spaces generated by repeated multiplication by (A).

Given an initial residual

$$ r^{(0)} = b-Ax^{(0)}, $$

the (k)-th Krylov subspace is

$$ \mathcal{K}_k(A,r^{(0)}) = \operatorname{span} {r^{(0)}, Ar^{(0)}, A^2r^{(0)}, \ldots, A^{k-1}r^{(0)}}. $$

A Krylov method chooses (x^{(k)}) from

$$ x^{(0)}+\mathcal{K}_k(A,r^{(0)}). $$

This approach is powerful because it uses only matrix-vector products.

For sparse matrices, matrix-vector products are cheap compared with factorization.

88.15 Conjugate Gradient Method

The conjugate gradient method solves

$$ Ax=b $$

when (A) is symmetric positive definite.

It constructs approximations in Krylov subspaces and minimizes the error in the energy norm.

The method uses three main sequences:

Sequence	Meaning
(x^{(k)})	approximate solution
(r^{(k)})	residual
(p^{(k)})	search direction

The search directions are (A)-conjugate:

$$ (p^{(i)})^T A p^{(j)} = 0 $$

for (i\ne j).

In exact arithmetic, conjugate gradient reaches the exact solution in at most (n) iterations for an (n \times n) system. In floating point arithmetic, it is used as an iterative method and often stops much earlier when the residual is sufficiently small.

88.16 Basic Conjugate Gradient Algorithm

Given (x^{(0)}), define

$$ r^{(0)} = b-Ax^{(0)}, \qquad p^{(0)} = r^{(0)}. $$

For (k=0,1,2,\ldots):

$$ \alpha_k = \frac{(r^{(k)})^Tr^{(k)}}{(p^{(k)})^TAp^{(k)}}. $$

$$ x^{(k+1)} = x^{(k)}+\alpha_k p^{(k)}. $$

$$ r^{(k+1)} = r^{(k)}-\alpha_k A p^{(k)}. $$

$$ \beta_k = \frac{(r^{(k+1)})^Tr^{(k+1)}}{(r^{(k)})^Tr^{(k)}}. $$

$$ p^{(k+1)} = r^{(k+1)}+\beta_k p^{(k)}. $$

The method requires one matrix-vector product per iteration.

This makes it attractive for large sparse symmetric positive definite systems.

88.17 Convergence of Conjugate Gradient

The convergence of conjugate gradient depends on the eigenvalues of (A).

For symmetric positive definite (A), a standard estimate is

$$ \frac{|x-x^{(k)}|_A}{|x-x^{(0)}|_A} \le 2 \left( \frac{\sqrt{\kappa_2(A)}-1} {\sqrt{\kappa_2(A)}+1} \right)^k. $$

Here

$$ |v|_A = \sqrt{v^TAv}. $$

The estimate shows that convergence slows as the condition number grows.

Eigenvalue clustering can make convergence much faster than the worst-case bound suggests.

88.18 GMRES

GMRES stands for generalized minimal residual method.

It is designed for nonsymmetric systems.

At step (k), GMRES chooses an approximation in the Krylov subspace that minimizes the residual norm:

$$ |b-Ax^{(k)}|_2. $$

GMRES is robust, but it stores an increasing number of basis vectors as (k) grows. To control memory, practical implementations often use restarted GMRES.

Restarting reduces memory cost, but it may slow convergence.

88.19 Preconditioning

Preconditioning transforms the system into an equivalent system that is easier for an iterative method to solve.

Instead of solving

$$ Ax=b, $$

we choose a matrix (P) that approximates (A) but is easier to invert.

A left-preconditioned system is

$$ P^{-1}Ax=P^{-1}b. $$

A good preconditioner makes

$$ P^{-1}A $$

better conditioned or more favorable spectrally.

Preconditioning is often the difference between a method that converges quickly and one that does not converge in useful time.

88.20 Examples of Preconditioners

Common preconditioners include:

Preconditioner	Main idea
Jacobi	use diagonal of (A)
SSOR	use symmetric relaxation
Incomplete LU	approximate LU factorization
Incomplete Cholesky	approximate Cholesky factorization
Multigrid	reduce errors across scales
Domain decomposition	split problem into subdomains

The best preconditioner depends on matrix structure.

There is no universally optimal choice.

88.21 Sparse Matrices

Iterative methods are closely tied to sparse matrices.

A sparse matrix has mostly zero entries. It is stored by recording only the nonzero entries and their locations.

The cost of one matrix-vector product is proportional to the number of nonzero entries:

$$ O(\operatorname{nnz}(A)). $$

This is much cheaper than dense multiplication when

$$ \operatorname{nnz}(A) \ll n^2. $$

For this reason, iterative methods can solve systems far larger than direct dense methods.

88.22 Matrix-Free Methods

Some applications do not form (A) explicitly.

Instead, they provide a function that computes

$$ v \mapsto Av. $$

Such methods are called matrix-free.

Krylov methods are well suited to matrix-free computation because they mainly require matrix-vector products.

This is important in:

Application	Reason
PDE solvers	matrix may be too large to assemble
optimization	Hessian-vector products are available
inverse problems	operator defined by simulation
machine learning	Jacobian or Hessian products computed implicitly

Matrix-free methods reduce storage cost and allow very large-scale computation.

88.23 Scaling and Normalization

Scaling can improve iterative convergence.

Poorly scaled systems may have entries with vastly different magnitudes. This can slow convergence and worsen numerical behavior.

Common scaling techniques include:

Technique	Goal
Row scaling	balance equations
Column scaling	balance variables
Diagonal equilibration	reduce magnitude disparities
Symmetric scaling	preserve symmetry

Scaling is often applied before preconditioning.

88.24 Parallelism

Iterative methods are often suitable for parallel computation.

Matrix-vector products can be parallelized across rows.

Vector updates can be parallelized componentwise.

Dot products require global reductions, which may become bottlenecks in distributed systems.

Operation	Parallel behavior
Sparse matrix-vector product	good locality if partitioned well
Vector addition	highly parallel
Dot product	requires reduction
Preconditioner application	depends on preconditioner

Designing scalable iterative solvers requires attention to communication cost, not only arithmetic cost.

88.25 Choosing an Iterative Method

A practical choice depends on matrix properties.

Matrix type	Common method
Symmetric positive definite	Conjugate gradient
Symmetric indefinite	MINRES
Nonsymmetric	GMRES, BiCGSTAB
Very structured PDE matrix	multigrid
Ill-conditioned sparse matrix	Krylov method with preconditioning
Small dense matrix	direct factorization

Before choosing a method, one should ask:

Question	Why it matters
Is (A) symmetric?	determines eligible methods
Is (A) positive definite?	enables conjugate gradient
Is (A) sparse?	favors iterative methods
Is a good preconditioner available?	controls convergence
How accurate must the result be?	determines stopping tolerance

88.26 Failure Modes

Iterative methods may fail or perform poorly for several reasons.

Failure mode	Cause
Divergence	iteration matrix spectral radius too large
Stagnation	residual stops decreasing
Slow convergence	poor conditioning or unfavorable spectrum
Loss of orthogonality	floating point error in Krylov basis
Bad preconditioner	transformed system remains difficult
Poor stopping rule	iteration stops too early or too late

Failure should be diagnosed through residual history, conditioning estimates, and knowledge of matrix structure.

88.27 Residual History

A residual history records values such as

$$ |r^{(k)}| $$

over iterations.

It helps identify the behavior of the method.

Pattern	Interpretation
Steady decrease	normal convergence
Rapid early decrease, then stagnation	limited precision or poor preconditioner
Oscillation	nonsymmetric or indefinite behavior
Growth	divergence or instability
Flat residual	no useful progress

Residual history is one of the main diagnostic tools for iterative solvers.

88.28 Summary

Iterative methods solve linear systems by improving an approximate solution step by step.

They are essential for large sparse systems because they avoid dense factorization and often require only matrix-vector products.

The central objects are:

Concept	Meaning
Residual	(r^{(k)}=b-Ax^{(k)})
Error	(e^{(k)}=x-x^{(k)})
Iteration matrix	controls stationary convergence
Spectral radius	determines fixed-point convergence
Krylov subspace	search space for modern methods
Preconditioner	transforms system for faster convergence

Classical stationary methods include Jacobi, Gauss-Seidel, and SOR. Modern large-scale methods include conjugate gradient, GMRES, MINRES, and BiCGSTAB. In practice, preconditioning is often the decisive ingredient.