Chapter 91. Conjugate Gradient Method

The conjugate gradient method is an iterative method for solving linear systems

$$ Ax=b $$

where (A) is symmetric positive definite. It is one of the most important Krylov subspace methods in numerical linear algebra.

The method combines three ideas:

Idea	Role
Quadratic minimization	interprets (Ax=b) as an optimization problem
Conjugate directions	avoids undoing previous progress
Krylov subspaces	builds approximations using matrix-vector products

For large sparse symmetric positive definite systems, conjugate gradient is often the standard first method to try. It avoids matrix factorization and requires mainly matrix-vector products, vector updates, and inner products. In exact arithmetic, it reaches the exact solution in at most (n) steps for an (n \times n) system, but in floating point arithmetic it is used as an iterative method and is usually stopped when the residual is small enough.

91.1 Symmetric Positive Definite Systems

The conjugate gradient method applies to systems

$$ Ax=b $$

where (A) is symmetric positive definite.

Symmetric means

$$ A^T=A. $$

Positive definite means

$$ x^TAx>0 $$

for every nonzero vector (x).

These assumptions are essential. They imply that (A) has positive eigenvalues, defines an inner product, and determines a strictly convex quadratic function.

Many applied problems produce symmetric positive definite matrices, including finite difference methods, finite element methods, least squares normal equations, graph Laplacian reductions, and quadratic optimization problems.

91.2 Linear Systems as Quadratic Minimization

If (A) is symmetric positive definite, solving

$$ Ax=b $$

is equivalent to minimizing the quadratic function

$$ \phi(x) = \frac{1}{2}x^TAx-b^Tx. $$

The gradient is

$$ \nabla \phi(x)=Ax-b. $$

Thus the stationary condition

$$ \nabla \phi(x)=0 $$

is exactly

$$ Ax=b. $$

Since (A) is positive definite, the quadratic function is strictly convex. Therefore it has a unique minimizer, and that minimizer is the unique solution of the linear system.

91.3 Residual and Gradient

At an approximate solution (x^{(k)}), define the residual

$$ r^{(k)}=b-Ax^{(k)}. $$

The gradient of (\phi) is

$$ \nabla\phi(x^{(k)})=Ax^{(k)}-b. $$

Therefore,

$$ r^{(k)}=-\nabla\phi(x^{(k)}). $$

The residual is the negative gradient of the quadratic objective. It points in the steepest descent direction.

Steepest descent uses the residual direction at each step. Conjugate gradient improves on steepest descent by choosing directions that are mutually conjugate with respect to (A).

91.4 The Energy Inner Product

For symmetric positive definite (A), define the (A)-inner product by

$$ \langle u,v\rangle_A=u^TAv. $$

The associated norm is

$$ |u|_A=\sqrt{u^TAu}. $$

This norm is called the energy norm.

Two vectors (p) and (q) are (A)-conjugate if

$$ p^TAq=0. $$

This is orthogonality measured in the geometry induced by (A), rather than ordinary Euclidean geometry.

91.5 Conjugate Directions

Suppose we have nonzero directions

$$ p^{(0)},p^{(1)},\ldots,p^{(m-1)} $$

that are mutually (A)-conjugate:

$$ (p^{(i)})^TAp^{(j)}=0 \qquad i\ne j. $$

If we minimize (\phi) successively along these directions, each step does not interfere with the progress made in previous directions.

This is the key advantage over steepest descent.

Steepest descent may zigzag across elongated level sets. Conjugate directions account for the geometry of the quadratic surface and avoid repeated correction in the same effective direction.

91.6 One-Dimensional Minimization

Given a current approximation (x^{(k)}) and a search direction (p^{(k)}), choose

$$ x^{(k+1)}=x^{(k)}+\alpha_k p^{(k)}. $$

We choose (\alpha_k) to minimize

$$ \phi(x^{(k)}+\alpha p^{(k)}). $$

Differentiate with respect to (\alpha):

$$ \frac{d}{d\alpha} \phi(x^{(k)}+\alpha p^{(k)}) = (p^{(k)})^T(Ax^{(k)}-b) + \alpha (p^{(k)})^TAp^{(k)}. $$

Since

$$ r^{(k)}=b-Ax^{(k)}, $$

setting the derivative to zero gives

$$ \alpha_k = \frac{(p^{(k)})^Tr^{(k)}}{(p^{(k)})^TAp^{(k)}}. $$

In the standard conjugate gradient recurrence, this becomes

$$ \alpha_k = \frac{(r^{(k)})^Tr^{(k)}}{(p^{(k)})^TAp^{(k)}}. $$

91.7 Residual Update

After choosing (\alpha_k), the new approximation is

$$ x^{(k+1)} = x^{(k)}+\alpha_kp^{(k)}. $$

The residual updates as

$$ r^{(k+1)} = b-Ax^{(k+1)}. $$

Substitute the update for (x^{(k+1)}):

$$ r^{(k+1)} = b-A(x^{(k)}+\alpha_kp^{(k)}) = b-Ax^{(k)}-\alpha_kAp^{(k)}. $$

Therefore,

$$ r^{(k+1)} = r^{(k)}-\alpha_kAp^{(k)}. $$

This update avoids recomputing (Ax^{(k+1)}) from scratch.

91.8 Constructing the Next Direction

The next search direction is formed from the new residual and the previous direction:

$$ p^{(k+1)} = r^{(k+1)}+\beta_kp^{(k)}. $$

The coefficient (\beta_k) is chosen so that

$$ (p^{(k+1)})^TAp^{(k)}=0. $$

For the standard conjugate gradient method,

$$ \beta_k = \frac{(r^{(k+1)})^Tr^{(k+1)}}{(r^{(k)})^Tr^{(k)}}. $$

Thus the method stores only the current approximation, residual, search direction, and one matrix-vector product.

91.9 The Basic Algorithm

Choose an initial vector (x^{(0)}).

Set

$$ r^{(0)}=b-Ax^{(0)}, \qquad p^{(0)}=r^{(0)}. $$

For

$$ k=0,1,2,\ldots $$

compute

$$ \alpha_k = \frac{(r^{(k)})^Tr^{(k)}}{(p^{(k)})^TAp^{(k)}}. $$

Then update

$$ x^{(k+1)} = x^{(k)}+\alpha_kp^{(k)}. $$

Update the residual:

$$ r^{(k+1)} = r^{(k)}-\alpha_kAp^{(k)}. $$

If the residual is small enough, stop.

Otherwise compute

$$ \beta_k = \frac{(r^{(k+1)})^Tr^{(k+1)}}{(r^{(k)})^Tr^{(k)}}. $$

Then update

$$ p^{(k+1)} = r^{(k+1)}+\beta_kp^{(k)}. $$

This is the standard three-term recurrence form of conjugate gradient.

91.10 Pseudocode

conjugate_gradient(A, b, x, tol, max_iter):
    r = b - A * x
    p = r
    rsold = dot(r, r)

    for k = 0 to max_iter - 1:
        Ap = A * p
        alpha = rsold / dot(p, Ap)

        x = x + alpha * p
        r = r - alpha * Ap

        rsnew = dot(r, r)

        if sqrt(rsnew) / norm(b) <= tol:
            return x

        beta = rsnew / rsold
        p = r + beta * p
        rsold = rsnew

    return x

Each iteration requires:

Operation	Count
Matrix-vector product	1
Dot products	2
Vector updates	several
Stored vectors	few

The dominant cost for sparse matrices is usually the matrix-vector product.

91.11 Orthogonality of Residuals

In exact arithmetic, the residuals produced by conjugate gradient are mutually orthogonal:

$$ (r^{(i)})^Tr^{(j)}=0 \qquad i\ne j. $$

This means each iteration removes an error component in a new independent direction.

Since there can be at most (n) mutually orthogonal nonzero vectors in (\mathbb{R}^n), the method reaches zero residual in at most (n) steps in exact arithmetic. This is the finite termination property.

91.12 Conjugacy of Search Directions

The search directions are mutually (A)-conjugate:

$$ (p^{(i)})^TAp^{(j)}=0 \qquad i\ne j. $$

This is stronger than ordinary independence.

It means the directions are orthogonal under the energy inner product.

Because the objective function has Hessian (A), (A)-conjugacy is exactly the orthogonality needed to minimize the quadratic efficiently.

91.13 Krylov Subspace Interpretation

The (k)-th Krylov subspace is

$$ \mathcal{K}_k(A,r^{(0)}) = \operatorname{span}{r^{(0)},Ar^{(0)},A^2r^{(0)},\ldots,A^{k-1}r^{(0)}}. $$

The conjugate gradient iterate satisfies

$$ x^{(k)} \in x^{(0)}+\mathcal{K}_k(A,r^{(0)}). $$

Among all vectors in this affine Krylov space, (x^{(k)}) minimizes the error in the energy norm:

$$ |x-x^{(k)}|_A. $$

This Krylov interpretation explains why the method only needs matrix-vector products.

91.14 Polynomial Approximation View

Since

$$ e^{(k)}=x-x^{(k)}, $$

the error after (k) steps can be written as

$$ e^{(k)}=q_k(A)e^{(0)} $$

for a polynomial (q_k) of degree (k) satisfying

$$ q_k(0)=1. $$

The method chooses a polynomial that makes the error small over the eigenvalues of (A).

Thus convergence depends not only on the largest and smallest eigenvalues, but also on the distribution of all eigenvalues.

Clustered eigenvalues often lead to rapid convergence.

91.15 Convergence Bound

For symmetric positive definite (A), a standard bound is

$$ \frac{|x-x^{(k)}|_A}{|x-x^{(0)}|_A} \le 2 \left( \frac{\sqrt{\kappa_2(A)}-1} {\sqrt{\kappa_2(A)}+1} \right)^k. $$

Here

$$ \kappa_2(A) = \frac{\lambda_{\max}(A)}{\lambda_{\min}(A)} $$

is the spectral condition number.

This bound shows that convergence becomes slower as the condition number grows. It also explains why conjugate gradient is often much faster than Jacobi or Gauss-Seidel for well-preconditioned symmetric positive definite systems.

91.16 Effect of Eigenvalue Distribution

The condition number gives a useful worst-case estimate, but it does not tell the full story.

Conjugate gradient may converge rapidly when eigenvalues are clustered, even if the condition number is large.

Intuitively, each iteration builds a polynomial that is small at the eigenvalues of (A). If many eigenvalues lie in clusters, a low-degree polynomial can be small on most of them.

Thus two matrices with the same condition number may produce different convergence histories.

91.17 Preconditioned Conjugate Gradient

Preconditioning replaces the original system by an equivalent one that has better spectral properties.

Choose a symmetric positive definite matrix (M) that approximates (A) and is easy to solve with.

The preconditioned method uses

$$ Mz^{(k)}=r^{(k)}. $$

The vector (z^{(k)}) is the preconditioned residual.

The goal is to make the effective system better conditioned. In practice, preconditioning is often necessary for fast convergence of conjugate gradient.

91.18 Preconditioned Algorithm

Choose (x^{(0)}).

Set

$$ r^{(0)}=b-Ax^{(0)}. $$

Solve

$$ Mz^{(0)}=r^{(0)}. $$

Set

$$ p^{(0)}=z^{(0)}. $$

For (k=0,1,2,\ldots):

$$ \alpha_k = \frac{(r^{(k)})^Tz^{(k)}}{(p^{(k)})^TAp^{(k)}}. $$

$$ x^{(k+1)} = x^{(k)}+\alpha_kp^{(k)}. $$

$$ r^{(k+1)} = r^{(k)}-\alpha_kAp^{(k)}. $$

Solve

$$ Mz^{(k+1)}=r^{(k+1)}. $$

$$ \beta_k = \frac{(r^{(k+1)})^Tz^{(k+1)}}{(r^{(k)})^Tz^{(k)}}. $$

$$ p^{(k+1)} = z^{(k+1)}+\beta_kp^{(k)}. $$

91.19 Common Preconditioners

Common preconditioners include:

Preconditioner	Main idea
Jacobi	use the diagonal of (A)
Incomplete Cholesky	approximate Cholesky factorization
SSOR	symmetric relaxation-based preconditioner
Multigrid	solve error components across scales
Domain decomposition	split problem into subdomains
Algebraic multigrid	build hierarchy from matrix structure

A good preconditioner must balance two costs:

Requirement	Meaning
Strong approximation	reduces iteration count
Cheap application	keeps each iteration inexpensive

A preconditioner that is too expensive may erase the benefit of faster convergence.

91.20 Floating Point Behavior

In exact arithmetic, residuals remain mutually orthogonal and directions remain (A)-conjugate.

In floating point arithmetic, these properties gradually degrade.

Consequences include:

Exact arithmetic property	Floating point behavior
finite termination in at most (n) steps	rarely exact termination
mutually orthogonal residuals	orthogonality is gradually lost
mutually (A)-conjugate directions	conjugacy is gradually lost
exact residual recurrence	recursive residual may drift

For this reason, implementations sometimes recompute the true residual

$$ b-Ax^{(k)} $$

occasionally, rather than relying only on the recurrence.

91.21 Stopping Criteria

Common stopping rules use the relative residual:

$$ \frac{|r^{(k)}|_2}{|b|_2}\le \tau. $$

Other choices include:

Criterion	Form
relative residual	(\|r^{(k)}\|/\|b\|)
preconditioned residual	((r^{(k)})^Tz^{(k)})
estimated error	bound using condition estimate
maximum iterations	fixed safety limit
stagnation	residual no longer decreases

A small residual must still be interpreted with conditioning. For ill-conditioned systems, a small residual may correspond to a larger solution error.

91.22 Computational Cost

For a sparse matrix with (\operatorname{nnz}(A)) nonzero entries, one conjugate gradient iteration costs approximately

$$ O(\operatorname{nnz}(A)). $$

The method stores only a small number of vectors:

Vector	Meaning
(x)	approximate solution
(r)	residual
(p)	search direction
(Ap)	matrix-vector product
(z)	preconditioned residual, if used

This low storage cost makes the method suitable for very large sparse systems.

91.23 Comparison with Steepest Descent

Steepest descent uses

$$ p^{(k)}=r^{(k)} $$

at every step.

Conjugate gradient uses

$$ p^{(k+1)}=r^{(k+1)}+\beta_kp^{(k)}. $$

Feature	Steepest descent	Conjugate gradient
Direction	residual only	residual plus previous direction
Geometry	Euclidean steepest direction	(A)-conjugate directions
Typical convergence	slow on elongated quadratics	much faster
Memory	very low	very low
Matrix requirement	SPD for standard form	SPD

Conjugate gradient can be viewed as steepest descent with memory adjusted to the quadratic geometry.

91.24 Comparison with Direct Methods

Direct methods factor the matrix.

Conjugate gradient does not.

Feature	Direct factorization	Conjugate gradient
Matrix type	many types	SPD only
Work	predictable	depends on convergence
Memory	may be large due to fill-in	low
Sparse systems	fill-in may dominate	often efficient
Accuracy	high if stable	depends on tolerance and conditioning
Reuse for many right-hand sides	strong	weaker unless preconditioner reused

For a single large sparse SPD system, conjugate gradient is often preferable. For many right-hand sides with the same matrix, a direct factorization may become competitive.

91.25 Failure Modes

Conjugate gradient may fail or perform poorly when its assumptions are violated or when convergence is too slow.

Failure mode	Cause
denominator nonpositive	(A) not positive definite
slow convergence	large condition number
stagnation	poor preconditioner or roundoff
loss of conjugacy	floating point error
residual drift	recursive residual differs from true residual
breakdown in PCG	preconditioner not SPD

If (A) is symmetric indefinite, MINRES is often more appropriate. If (A) is nonsymmetric, GMRES or BiCGSTAB is usually used.

91.26 Practical Checklist

Before using conjugate gradient, check:

Question	Reason
Is (A) symmetric?	required by standard CG
Is (A) positive definite?	required for positive denominators and convexity
Is (A) sparse or matrix-free?	favors CG
Is a good preconditioner available?	controls iteration count
Is the stopping tolerance meaningful?	residual does not equal error
Is the matrix scaled reasonably?	improves numerical behavior

For many applications, the most important implementation decision is the preconditioner.

91.27 Summary

The conjugate gradient method solves symmetric positive definite systems

$$ Ax=b $$

by minimizing the quadratic

$$ \phi(x)=\frac{1}{2}x^TAx-b^Tx $$

over expanding Krylov subspaces.

Its basic recurrence is:

$$ r^{(0)}=b-Ax^{(0)}, \qquad p^{(0)}=r^{(0)}. $$

$$ \alpha_k = \frac{(r^{(k)})^Tr^{(k)}}{(p^{(k)})^TAp^{(k)}}. $$

$$ x^{(k+1)}=x^{(k)}+\alpha_kp^{(k)}. $$

$$ r^{(k+1)}=r^{(k)}-\alpha_kAp^{(k)}. $$

$$ \beta_k = \frac{(r^{(k+1)})^Tr^{(k+1)}}{(r^{(k)})^Tr^{(k)}}. $$

$$ p^{(k+1)}=r^{(k+1)}+\beta_kp^{(k)}. $$

In exact arithmetic, it terminates in at most (n) steps. In practice, it is used as a fast iterative method. Its effectiveness depends strongly on conditioning, eigenvalue distribution, and preconditioning.