Chapter 88. Iterative Methods for Linear Systems

Chapter 88. Iterative Methods for Linear Systems

Iterative methods solve a linear system by producing a sequence of approximate solutions. Instead of factoring the matrix once and solving the system directly, an iterative method starts from an initial guess and repeatedly improves it.

The target problem is

$$ Ax=b. $$

A direct method, such as Gaussian elimination, tries to compute the solution in a finite sequence of algebraic steps. An iterative method constructs vectors

$$ x^{(0)}, x^{(1)}, x^{(2)}, \ldots $$

with the goal that

$$ x^{(k)} \to x $$

as (k) increases.

Iterative methods are especially important for large sparse systems, where direct factorization may require too much memory or computation. Large sparse systems arise in discretized differential equations, graph problems, optimization, inverse problems, simulation, and machine learning. Standard classifications distinguish stationary methods such as Jacobi and Gauss-Seidel from nonstationary Krylov methods such as conjugate gradient and GMRES.

88.1 Direct Methods and Iterative Methods

A direct method attempts to solve

$$ Ax=b $$

by transforming the system into an equivalent system that can be solved exactly in exact arithmetic.

Examples include:

Method Main idea
Gaussian elimination eliminate variables
LU factorization factor (A) into triangular matrices
Cholesky factorization factor symmetric positive definite matrices
QR factorization use orthogonal transformations

An iterative method instead computes a sequence of approximations. It may stop before the exact solution is reached, once the residual or error is small enough.

Method type Strength Weakness
Direct predictable, robust for moderate dense systems expensive for large sparse systems
Iterative memory efficient, often fast for sparse systems convergence depends on matrix structure

For large sparse matrices, iterative methods are often the only practical choice.

88.2 The Residual

Given an approximate solution (x^{(k)}), the residual is

$$ r^{(k)} = b - Ax^{(k)}. $$

The residual measures how much the current approximation fails to satisfy the equation.

If

$$ r^{(k)} = 0, $$

then

$$ Ax^{(k)} = b, $$

so (x^{(k)}) is an exact solution.

Most iterative methods use the residual to decide how to correct the current approximation.

88.3 The Error

Let (x) be the exact solution, and let

$$ e^{(k)} = x - x^{(k)} $$

be the error at iteration (k).

Because

$$ Ax=b, $$

we have

$$ A e^{(k)} = A(x-x^{(k)}) = b-Ax^{(k)} = r^{(k)}. $$

Thus

$$ A e^{(k)} = r^{(k)}. $$

The residual is the image of the error under (A). A small residual implies a small error only when the matrix is well-conditioned.

88.4 General Fixed-Point Iteration

Many iterative methods can be written as fixed-point iterations.

Starting from

$$ Ax=b, $$

we rewrite the equation in the form

$$ x = Gx + c. $$

Then define

$$ x^{(k+1)} = Gx^{(k)} + c. $$

The matrix (G) is called the iteration matrix.

If this iteration converges, its limit (x) satisfies

$$ x = Gx+c, $$

which is equivalent to the original system.

88.5 Convergence of Fixed-Point Iteration

Let (x) be the exact solution. Subtracting

$$ x = Gx+c $$

from

$$ x^{(k+1)} = Gx^{(k)}+c $$

gives

$$ e^{(k+1)} = G e^{(k)}. $$

Therefore,

$$ e^{(k)} = G^k e^{(0)}. $$

The iteration converges for every starting vector if and only if

$$ \rho(G)<1, $$

where (\rho(G)) is the spectral radius of (G), the largest absolute value of its eigenvalues.

This criterion is central for stationary iterative methods.

88.6 Matrix Splitting

A common construction uses a splitting of the matrix

$$ A = M - N, $$

where (M) is easy to invert.

Then

$$ Ax=b $$

becomes

$$ (M-N)x=b, $$

so

$$ Mx = Nx+b. $$

This gives the iteration

$$ Mx^{(k+1)} = Nx^{(k)} + b. $$

Equivalently,

$$ x^{(k+1)} = M^{-1}N x^{(k)} + M^{-1}b. $$

Thus the iteration matrix is

$$ G = M^{-1}N. $$

The quality of the method depends on choosing (M) so that solving systems with (M) is cheap and (M^{-1}N) has favorable convergence behavior.

88.7 Diagonal, Lower, and Upper Parts

Write

$$ A = D + L + U, $$

where:

Symbol Meaning
(D) diagonal part of (A)
(L) strictly lower triangular part
(U) strictly upper triangular part

Different choices of (M) produce different classical methods.

Method Choice of (M)
Jacobi (D)
Gauss-Seidel (D+L)
SOR (\frac{1}{\omega}D+L)

These methods are called stationary because the same iteration matrix is used at every step.

88.8 Jacobi Method

The Jacobi method solves each equation for its diagonal variable using values from the previous iteration.

For the system

$$ Ax=b, $$

with entries (a_{ij}), the update is

$$ x_i^{(k+1)} = \frac{1}{a_{ii}} \left( b_i-\sum_{j\ne i}a_{ij}x_j^{(k)} \right). $$

The Jacobi method corresponds to the splitting

$$ A = D - (-(L+U)). $$

Its iteration matrix is

$$ G_J = -D^{-1}(L+U). $$

Jacobi is simple and parallelizable because each component of (x^{(k+1)}) uses only values from the previous iteration.

Its convergence may be slow.

88.9 Gauss-Seidel Method

The Gauss-Seidel method updates components sequentially and immediately uses the newest available values.

The update is

$$ x_i^{(k+1)} = \frac{1}{a_{ii}} \left( b_i - \sum_{j<i}a_{ij}x_j^{(k+1)} - \sum_{j>i}a_{ij}x_j^{(k)} \right). $$

The method corresponds to the splitting

$$ M=D+L. $$

Its iteration matrix is

$$ G_{GS}=-(D+L)^{-1}U. $$

Gauss-Seidel often converges faster than Jacobi because it uses updated values as soon as they are available, though it remains relatively slow compared with modern Krylov methods.

88.10 Successive Over-Relaxation

Successive over-relaxation, or SOR, modifies Gauss-Seidel by introducing a relaxation parameter (\omega).

The update has the form

$$ x_i^{(k+1)} = (1-\omega)x_i^{(k)} + \frac{\omega}{a_{ii}} \left( b_i - \sum_{j<i}a_{ij}x_j^{(k+1)} - \sum_{j>i}a_{ij}x_j^{(k)} \right). $$

The parameter (\omega) controls the step size.

Range Interpretation
(0<\omega<1) under-relaxation
(\omega=1) Gauss-Seidel
(1<\omega<2) over-relaxation

When (\omega) is chosen well, SOR may converge much faster than Gauss-Seidel. A poor choice can slow convergence or cause divergence.

88.11 Convergence Conditions

Convergence depends on the matrix and the method.

Some useful sufficient conditions are:

Matrix property Consequence
Strict diagonal dominance Jacobi and Gauss-Seidel often converge
Symmetric positive definiteness Gauss-Seidel converges
Spectral radius less than one fixed-point iteration converges
Poor conditioning convergence may be slow

Strict diagonal dominance means

$$ |a_{ii}|

\sum_{j\ne i}|a_{ij}| $$

for each row.

This condition says that each diagonal entry dominates the off-diagonal entries in its row.

88.12 Stopping Criteria

An iterative method must decide when to stop.

Common stopping criteria include:

Criterion Form
Absolute residual (|r^{(k)}| \le \tau)
Relative residual (\frac{|r^{(k)}|}{|b|} \le \tau)
Step size (|x^{(k+1)}-x^{(k)}| \le \tau)
Maximum iterations (k \le k_{\max})

The relative residual is commonly used because it accounts for the scale of (b).

However, residual-based stopping must be interpreted carefully. If (A) is ill-conditioned, a small residual may still correspond to a large solution error.

88.13 Stationary and Nonstationary Methods

Stationary methods use the same update rule at every iteration.

Examples:

Stationary method Feature
Jacobi parallel, simple
Gauss-Seidel sequential, often faster
SOR relaxation parameter
SSOR symmetric form useful as preconditioner

Nonstationary methods adapt the iteration using information from previous steps.

Examples:

Nonstationary method Typical use
Conjugate gradient symmetric positive definite systems
MINRES symmetric indefinite systems
GMRES nonsymmetric systems
BiCGSTAB nonsymmetric systems
Chebyshev iteration spectrum approximately known

Modern large-scale solvers usually rely on nonstationary methods, especially Krylov subspace methods.

88.14 Krylov Subspaces

Krylov methods search for approximations in spaces generated by repeated multiplication by (A).

Given an initial residual

$$ r^{(0)} = b-Ax^{(0)}, $$

the (k)-th Krylov subspace is

$$ \mathcal{K}_k(A,r^{(0)}) = \operatorname{span} {r^{(0)}, Ar^{(0)}, A^2r^{(0)}, \ldots, A^{k-1}r^{(0)}}. $$

A Krylov method chooses (x^{(k)}) from

$$ x^{(0)}+\mathcal{K}_k(A,r^{(0)}). $$

This approach is powerful because it uses only matrix-vector products.

For sparse matrices, matrix-vector products are cheap compared with factorization.

88.15 Conjugate Gradient Method

The conjugate gradient method solves

$$ Ax=b $$

when (A) is symmetric positive definite.

It constructs approximations in Krylov subspaces and minimizes the error in the energy norm.

The method uses three main sequences:

Sequence Meaning
(x^{(k)}) approximate solution
(r^{(k)}) residual
(p^{(k)}) search direction

The search directions are (A)-conjugate:

$$ (p^{(i)})^T A p^{(j)} = 0 $$

for (i\ne j).

In exact arithmetic, conjugate gradient reaches the exact solution in at most (n) iterations for an (n \times n) system. In floating point arithmetic, it is used as an iterative method and often stops much earlier when the residual is sufficiently small.

88.16 Basic Conjugate Gradient Algorithm

Given (x^{(0)}), define

$$ r^{(0)} = b-Ax^{(0)}, \qquad p^{(0)} = r^{(0)}. $$

For (k=0,1,2,\ldots):

$$ \alpha_k = \frac{(r^{(k)})^Tr^{(k)}}{(p^{(k)})^TAp^{(k)}}. $$

$$ x^{(k+1)} = x^{(k)}+\alpha_k p^{(k)}. $$

$$ r^{(k+1)} = r^{(k)}-\alpha_k A p^{(k)}. $$

$$ \beta_k = \frac{(r^{(k+1)})^Tr^{(k+1)}}{(r^{(k)})^Tr^{(k)}}. $$

$$ p^{(k+1)} = r^{(k+1)}+\beta_k p^{(k)}. $$

The method requires one matrix-vector product per iteration.

This makes it attractive for large sparse symmetric positive definite systems.

88.17 Convergence of Conjugate Gradient

The convergence of conjugate gradient depends on the eigenvalues of (A).

For symmetric positive definite (A), a standard estimate is

$$ \frac{|x-x^{(k)}|_A}{|x-x^{(0)}|_A} \le 2 \left( \frac{\sqrt{\kappa_2(A)}-1} {\sqrt{\kappa_2(A)}+1} \right)^k. $$

Here

$$ |v|_A = \sqrt{v^TAv}. $$

The estimate shows that convergence slows as the condition number grows.

Eigenvalue clustering can make convergence much faster than the worst-case bound suggests.

88.18 GMRES

GMRES stands for generalized minimal residual method.

It is designed for nonsymmetric systems.

At step (k), GMRES chooses an approximation in the Krylov subspace that minimizes the residual norm:

$$ |b-Ax^{(k)}|_2. $$

GMRES is robust, but it stores an increasing number of basis vectors as (k) grows. To control memory, practical implementations often use restarted GMRES.

Restarting reduces memory cost, but it may slow convergence.

88.19 Preconditioning

Preconditioning transforms the system into an equivalent system that is easier for an iterative method to solve.

Instead of solving

$$ Ax=b, $$

we choose a matrix (P) that approximates (A) but is easier to invert.

A left-preconditioned system is

$$ P^{-1}Ax=P^{-1}b. $$

A good preconditioner makes

$$ P^{-1}A $$

better conditioned or more favorable spectrally.

Preconditioning is often the difference between a method that converges quickly and one that does not converge in useful time.

88.20 Examples of Preconditioners

Common preconditioners include:

Preconditioner Main idea
Jacobi use diagonal of (A)
SSOR use symmetric relaxation
Incomplete LU approximate LU factorization
Incomplete Cholesky approximate Cholesky factorization
Multigrid reduce errors across scales
Domain decomposition split problem into subdomains

The best preconditioner depends on matrix structure.

There is no universally optimal choice.

88.21 Sparse Matrices

Iterative methods are closely tied to sparse matrices.

A sparse matrix has mostly zero entries. It is stored by recording only the nonzero entries and their locations.

The cost of one matrix-vector product is proportional to the number of nonzero entries:

$$ O(\operatorname{nnz}(A)). $$

This is much cheaper than dense multiplication when

$$ \operatorname{nnz}(A) \ll n^2. $$

For this reason, iterative methods can solve systems far larger than direct dense methods.

88.22 Matrix-Free Methods

Some applications do not form (A) explicitly.

Instead, they provide a function that computes

$$ v \mapsto Av. $$

Such methods are called matrix-free.

Krylov methods are well suited to matrix-free computation because they mainly require matrix-vector products.

This is important in:

Application Reason
PDE solvers matrix may be too large to assemble
optimization Hessian-vector products are available
inverse problems operator defined by simulation
machine learning Jacobian or Hessian products computed implicitly

Matrix-free methods reduce storage cost and allow very large-scale computation.

88.23 Scaling and Normalization

Scaling can improve iterative convergence.

Poorly scaled systems may have entries with vastly different magnitudes. This can slow convergence and worsen numerical behavior.

Common scaling techniques include:

Technique Goal
Row scaling balance equations
Column scaling balance variables
Diagonal equilibration reduce magnitude disparities
Symmetric scaling preserve symmetry

Scaling is often applied before preconditioning.

88.24 Parallelism

Iterative methods are often suitable for parallel computation.

Matrix-vector products can be parallelized across rows.

Vector updates can be parallelized componentwise.

Dot products require global reductions, which may become bottlenecks in distributed systems.

Operation Parallel behavior
Sparse matrix-vector product good locality if partitioned well
Vector addition highly parallel
Dot product requires reduction
Preconditioner application depends on preconditioner

Designing scalable iterative solvers requires attention to communication cost, not only arithmetic cost.

88.25 Choosing an Iterative Method

A practical choice depends on matrix properties.

Matrix type Common method
Symmetric positive definite Conjugate gradient
Symmetric indefinite MINRES
Nonsymmetric GMRES, BiCGSTAB
Very structured PDE matrix multigrid
Ill-conditioned sparse matrix Krylov method with preconditioning
Small dense matrix direct factorization

Before choosing a method, one should ask:

Question Why it matters
Is (A) symmetric? determines eligible methods
Is (A) positive definite? enables conjugate gradient
Is (A) sparse? favors iterative methods
Is a good preconditioner available? controls convergence
How accurate must the result be? determines stopping tolerance

88.26 Failure Modes

Iterative methods may fail or perform poorly for several reasons.

Failure mode Cause
Divergence iteration matrix spectral radius too large
Stagnation residual stops decreasing
Slow convergence poor conditioning or unfavorable spectrum
Loss of orthogonality floating point error in Krylov basis
Bad preconditioner transformed system remains difficult
Poor stopping rule iteration stops too early or too late

Failure should be diagnosed through residual history, conditioning estimates, and knowledge of matrix structure.

88.27 Residual History

A residual history records values such as

$$ |r^{(k)}| $$

over iterations.

It helps identify the behavior of the method.

Pattern Interpretation
Steady decrease normal convergence
Rapid early decrease, then stagnation limited precision or poor preconditioner
Oscillation nonsymmetric or indefinite behavior
Growth divergence or instability
Flat residual no useful progress

Residual history is one of the main diagnostic tools for iterative solvers.

88.28 Summary

Iterative methods solve linear systems by improving an approximate solution step by step.

They are essential for large sparse systems because they avoid dense factorization and often require only matrix-vector products.

The central objects are:

Concept Meaning
Residual (r^{(k)}=b-Ax^{(k)})
Error (e^{(k)}=x-x^{(k)})
Iteration matrix controls stationary convergence
Spectral radius determines fixed-point convergence
Krylov subspace search space for modern methods
Preconditioner transforms system for faster convergence

Classical stationary methods include Jacobi, Gauss-Seidel, and SOR. Modern large-scale methods include conjugate gradient, GMRES, MINRES, and BiCGSTAB. In practice, preconditioning is often the decisive ingredient.