Chapter 134. Numerical Optimization

134.1 Introduction

Numerical optimization studies algorithms for finding minima or maxima of functions using finite computation.

Linear algebra is one of the foundations of numerical optimization. Vectors represent variables, matrices represent constraints and derivatives, and linear systems appear inside nearly every optimization algorithm.

The basic optimization problem is:

$$ \text{minimize } f(x) $$

subject to

$$ x\in C, $$

where:

Symbol	Meaning
(x)	Unknown vector
(f(x))	Objective function
(C)	Feasible set

Optimization problems appear throughout science and engineering:

Area	Optimization task
Machine learning	Loss minimization
Statistics	Maximum likelihood
Control theory	Optimal control
Signal processing	Error minimization
Finance	Portfolio optimization
Physics	Energy minimization
Numerical PDEs	Variational methods

The central challenge is computational: how to efficiently approximate optimal solutions in finite precision arithmetic.

134.2 Optimization Problems

An optimization problem consists of:

Variables,
An objective function,
Constraints.

The objective function assigns a real number to each feasible point:

$$ f:\mathbb{R}^n\to\mathbb{R}. $$

The feasible set contains all allowed vectors.

For example:

$$ \text{minimize } x_1^2+x_2^2 $$

subject to

$$ x_1+x_2=1. $$

The objective measures squared distance from the origin. The constraint restricts the solution to a line.

Optimization seeks the feasible point with smallest objective value.

134.3 Linear Optimization

A linear optimization problem has:

Component	Form
Objective	Linear
Constraints	Linear

The standard form of linear programming is:

$$ \text{minimize } c^Tx $$

subject to

$$ Ax=b, $$

and

$$ x\ge0. $$

Here:

Symbol	Meaning
(x)	Decision vector
(c)	Cost vector
(A)	Constraint matrix
(b)	Constraint values

The feasible region is a convex polyhedron.

Linear programming is important because:

Many practical problems are naturally linear,
Nonlinear problems are often approximated linearly,
Efficient algorithms exist.

The geometry of linear programming is fundamentally linear-algebraic.

134.4 Least Squares Problems

One of the simplest optimization problems is least squares.

Given

$$ A\in\mathbb{R}^{m\times n}, \qquad b\in\mathbb{R}^m, $$

the least squares problem minimizes

$$ |Ax-b|^2. $$

This problem appears when the system

$$ Ax=b $$

has no exact solution.

The least squares solution satisfies the normal equations:

$$ A^TAx=A^Tb. $$

A^TAx=A^Tb

The matrix

$$ A^TA $$

is symmetric and positive semidefinite.

Least squares is central in regression, data fitting, signal reconstruction, and inverse problems.

134.5 Convex Optimization

A convex optimization problem minimizes a convex function over a convex set.

The problem is:

$$ \text{minimize } f(x) $$

subject to

$$ x\in C, $$

where:

Property	Meaning
(C) convex	Feasible region has no holes or bends inward
(f) convex	Local minima are global

Convex optimization is especially important because it avoids many pathological behaviors of general nonlinear optimization.

If (f) is differentiable and convex, then:

$$ \nabla f(x)=0 $$

implies (x) is globally optimal.

Convexity transforms optimization from a potentially difficult global problem into a tractable local one.

134.6 Gradient Descent

Gradient descent is one of the most fundamental optimization algorithms.

Given a differentiable function

$$ f:\mathbb{R}^n\to\mathbb{R}, $$

the gradient vector is

$$ \nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}\ \vdots\ \frac{\partial f}{\partial x_n} \end{bmatrix}. $$

The gradient points in the direction of steepest increase.

Therefore:

$$ -\nabla f(x) $$

points in the direction of steepest decrease.

Gradient descent updates:

$$ x_{k+1} = x_k-\alpha_k\nabla f(x_k), $$

where:

Symbol	Meaning
(x_k)	Current iterate
(\alpha_k)	Step size
(\nabla f(x_k))	Gradient

x_{k+1}=x_k-\alpha_k\nabla f(x_k)

The algorithm repeatedly moves downhill.

134.7 Quadratic Optimization

A quadratic optimization problem minimizes a quadratic function:

$$ f(x) = \frac12 x^TQx+c^Tx+d, $$

where:

Symbol	Meaning
(Q)	Symmetric matrix
(c)	Linear term
(d)	Constant

The gradient is

$$ \nabla f(x)=Qx+c. $$

Setting the gradient equal to zero gives:

$$ Qx=-c. $$

Thus quadratic optimization reduces to solving a linear system.

If (Q) is positive definite, then the problem has a unique minimizer.

Quadratic problems are central because many nonlinear functions are locally approximated by quadratics.

134.8 Hessian Matrices

The Hessian matrix collects second derivatives.

For a twice-differentiable function

$$ f:\mathbb{R}^n\to\mathbb{R}, $$

the Hessian is

$$ H_f(x) = \left( \frac{\partial^2 f}{\partial x_i\partial x_j} \right). $$

The Hessian measures local curvature.

Important cases:

Hessian	Interpretation
Positive definite	Local minimum
Negative definite	Local maximum
Indefinite	Saddle point

Second-order optimization methods use Hessians to accelerate convergence.

134.9 Newton's Method

Newton's method uses second-order information.

The update rule is:

$$ x_{k+1} = x_k - H_f(x_k)^{-1}\nabla f(x_k). $$

x_{k+1}=x_k-H_f(x_k)^{-1}\nabla f(x_k)

The method approximates the objective locally by a quadratic function and minimizes that quadratic approximation.

Advantages:

Advantage	Reason
Fast local convergence	Uses curvature
Scale adaptation	Hessian rescales directions

Disadvantages:

Disadvantage	Reason
Expensive	Requires solving linear systems
Hessian storage	Large matrices
Possible instability	Hessian may be singular

Large-scale optimization often approximates Newton steps rather than computing them exactly.

134.10 Constrained Optimization

Many optimization problems include constraints.

Equality constraints have the form:

$$ g_i(x)=0. $$

Inequality constraints have the form:

$$ h_j(x)\le0. $$

The feasible set becomes:

$$ C = {x:g_i(x)=0,\ h_j(x)\le0}. $$

Constraints change the geometry of optimization.

The optimum may occur on a boundary rather than in the interior.

134.11 Lagrange Multipliers

Lagrange multipliers solve equality-constrained optimization problems.

Suppose we minimize

$$ f(x) $$

subject to

$$ g(x)=0. $$

Define the Lagrangian:

$$ L(x,\lambda) = f(x)+\lambda g(x). $$

Optimality conditions become:

$$ \nabla_x L(x,\lambda)=0, $$

and

$$ g(x)=0. $$

This yields:

$$ \nabla f(x) = -\lambda \nabla g(x). $$

\nabla f(x)=-\lambda \nabla g(x)

Geometrically, the objective gradient becomes parallel to the constraint gradient.

134.12 Karush-Kuhn-Tucker Conditions

The Karush-Kuhn-Tucker conditions generalize Lagrange multipliers to inequality constraints.

For problems with constraints

$$ h_i(x)\le0, $$

the KKT conditions include:

Condition	Meaning
Stationarity	Gradient balance
Primal feasibility	Constraints satisfied
Dual feasibility	Multipliers nonnegative
Complementary slackness	Active constraints identified

The KKT system forms the basis of many modern optimization algorithms.

134.13 Interior-Point Methods

Interior-point methods solve constrained optimization problems by moving through the interior of the feasible region.

Rather than enforcing constraints exactly at each step, they use barrier functions such as:

$$ -\mu\sum_i \log(-h_i(x)). $$

The logarithm approaches infinity near the boundary, preventing the iterates from leaving the feasible region.

Interior-point methods are especially important in:

Problem class	Application
Linear programming	Large sparse systems
Quadratic programming	Control and finance
Semidefinite programming	Convex matrix optimization

These methods rely heavily on sparse linear algebra.

134.14 Conjugate Gradient Method

The conjugate gradient method solves large symmetric positive definite systems:

$$ Ax=b. $$

It may also be viewed as minimizing the quadratic function:

$$ f(x) = \frac12x^TAx-b^Tx. $$

f(x)=\frac12x^TAx-b^Tx

Conjugate gradient constructs search directions that are mutually (A)-orthogonal.

Advantages include:

Advantage	Reason
No matrix inversion	Iterative method
Exploits sparsity	Efficient for large systems
Low memory	Matrix-vector products only

Conjugate gradient is one of the central algorithms in numerical optimization.

134.15 Stochastic Gradient Descent

Stochastic gradient descent, or SGD, approximates gradients using random subsets of data.

Instead of computing:

$$ \nabla f(x), $$

one computes an estimate:

$$ g_k \approx \nabla f(x_k). $$

The update becomes:

$$ x_{k+1} = x_k-\alpha_k g_k. $$

SGD is fundamental in machine learning because exact gradients may be too expensive for massive datasets.

Noise in the gradient introduces randomness but greatly reduces computational cost.

134.16 Regularization

Optimization problems may be ill-posed or unstable.

Regularization adds penalty terms to stabilize solutions.

Examples include:

Method	Objective
Ridge regression	(\|Ax-b\|^2+\lambda\|x\|^2)
Lasso	(\|Ax-b\|^2+\lambda\|x\|_1)

Ridge regression penalizes large Euclidean norm.

Lasso promotes sparsity.

Regularization changes both the geometry and numerical conditioning of optimization problems.

134.17 Duality

Many optimization problems possess dual formulations.

The primal problem:

$$ \text{minimize } f(x) $$

may correspond to a dual problem involving multipliers or supporting hyperplanes.

Duality provides:

Benefit	Interpretation
Lower bounds	Optimality certification
Sensitivity analysis	Constraint interpretation
Alternative algorithms	Easier computation

Strong duality occurs when primal and dual optimal values coincide.

Convexity is closely tied to duality theory.

134.18 Numerical Stability

Optimization algorithms operate in finite precision arithmetic.

Important issues include:

Issue	Effect
Roundoff error	Loss of accuracy
Ill-conditioning	Amplified perturbations
Cancellation	Numerical instability
Overflow/underflow	Computational failure

Stable algorithms minimize error growth.

Condition numbers are especially important because they measure sensitivity of solutions to perturbations.

134.19 Sparse Optimization

Many optimization problems involve sparse matrices.

A sparse matrix has mostly zero entries.

Sparse linear algebra allows:

Advantage	Reason
Lower memory usage	Store nonzeros only
Faster computation	Skip zero operations
Scalable algorithms	Large systems feasible

Sparse optimization is fundamental in:

Area	Typical problem
PDE discretization	Sparse linear systems
Network optimization	Graph sparsity
Machine learning	Sparse features
Signal processing	Compressed sensing

134.20 Optimization and Linear Algebra

Nearly every optimization method relies on linear algebra.

Optimization concept	Linear algebra role
Gradient	Vector
Hessian	Matrix
Constraints	Linear systems
Newton step	Solve matrix equation
Least squares	Orthogonal projection
Convexity	Positive semidefinite Hessian
Duality	Linear functionals
Iterative methods	Krylov subspaces

Modern optimization is fundamentally computational linear algebra.

134.21 Summary

Numerical optimization studies computational methods for minimizing or maximizing functions.

The central concepts are:

Concept	Meaning
Objective function	Quantity being optimized
Feasible set	Allowed region
Convexity	Guarantees global optimality
Gradient	First-order change
Hessian	Second-order curvature
Newton method	Quadratic local approximation
Least squares	Error minimization
Conjugate gradient	Iterative quadratic solver
Interior-point method	Barrier-based constrained optimization
Stochastic gradient	Randomized optimization
Duality	Complementary optimization formulation
Regularization	Stabilization through penalties

Numerical optimization combines geometry, calculus, probability, and linear algebra into algorithms for solving large computational problems.