Chapter 134. Numerical Optimization

Chapter 134. Numerical Optimization

134.1 Introduction

Numerical optimization studies algorithms for finding minima or maxima of functions using finite computation.

Linear algebra is one of the foundations of numerical optimization. Vectors represent variables, matrices represent constraints and derivatives, and linear systems appear inside nearly every optimization algorithm.

The basic optimization problem is:

$$ \text{minimize } f(x) $$

subject to

$$ x\in C, $$

where:

Symbol Meaning
(x) Unknown vector
(f(x)) Objective function
(C) Feasible set

Optimization problems appear throughout science and engineering:

Area Optimization task
Machine learning Loss minimization
Statistics Maximum likelihood
Control theory Optimal control
Signal processing Error minimization
Finance Portfolio optimization
Physics Energy minimization
Numerical PDEs Variational methods

The central challenge is computational: how to efficiently approximate optimal solutions in finite precision arithmetic.

134.2 Optimization Problems

An optimization problem consists of:

  1. Variables,
  2. An objective function,
  3. Constraints.

The objective function assigns a real number to each feasible point:

$$ f:\mathbb{R}^n\to\mathbb{R}. $$

The feasible set contains all allowed vectors.

For example:

$$ \text{minimize } x_1^2+x_2^2 $$

subject to

$$ x_1+x_2=1. $$

The objective measures squared distance from the origin. The constraint restricts the solution to a line.

Optimization seeks the feasible point with smallest objective value.

134.3 Linear Optimization

A linear optimization problem has:

Component Form
Objective Linear
Constraints Linear

The standard form of linear programming is:

$$ \text{minimize } c^Tx $$

subject to

$$ Ax=b, $$

and

$$ x\ge0. $$

Here:

Symbol Meaning
(x) Decision vector
(c) Cost vector
(A) Constraint matrix
(b) Constraint values

The feasible region is a convex polyhedron.

Linear programming is important because:

  1. Many practical problems are naturally linear,
  2. Nonlinear problems are often approximated linearly,
  3. Efficient algorithms exist.

The geometry of linear programming is fundamentally linear-algebraic.

134.4 Least Squares Problems

One of the simplest optimization problems is least squares.

Given

$$ A\in\mathbb{R}^{m\times n}, \qquad b\in\mathbb{R}^m, $$

the least squares problem minimizes

$$ |Ax-b|^2. $$

This problem appears when the system

$$ Ax=b $$

has no exact solution.

The least squares solution satisfies the normal equations:

$$ A^TAx=A^Tb. $$

A^TAx=A^Tb

The matrix

$$ A^TA $$

is symmetric and positive semidefinite.

Least squares is central in regression, data fitting, signal reconstruction, and inverse problems.

134.5 Convex Optimization

A convex optimization problem minimizes a convex function over a convex set.

The problem is:

$$ \text{minimize } f(x) $$

subject to

$$ x\in C, $$

where:

Property Meaning
(C) convex Feasible region has no holes or bends inward
(f) convex Local minima are global

Convex optimization is especially important because it avoids many pathological behaviors of general nonlinear optimization.

If (f) is differentiable and convex, then:

$$ \nabla f(x)=0 $$

implies (x) is globally optimal.

Convexity transforms optimization from a potentially difficult global problem into a tractable local one.

134.6 Gradient Descent

Gradient descent is one of the most fundamental optimization algorithms.

Given a differentiable function

$$ f:\mathbb{R}^n\to\mathbb{R}, $$

the gradient vector is

$$ \nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}\ \vdots\ \frac{\partial f}{\partial x_n} \end{bmatrix}. $$

The gradient points in the direction of steepest increase.

Therefore:

$$ -\nabla f(x) $$

points in the direction of steepest decrease.

Gradient descent updates:

$$ x_{k+1} = x_k-\alpha_k\nabla f(x_k), $$

where:

Symbol Meaning
(x_k) Current iterate
(\alpha_k) Step size
(\nabla f(x_k)) Gradient

x_{k+1}=x_k-\alpha_k\nabla f(x_k)

The algorithm repeatedly moves downhill.

134.7 Quadratic Optimization

A quadratic optimization problem minimizes a quadratic function:

$$ f(x) = \frac12 x^TQx+c^Tx+d, $$

where:

Symbol Meaning
(Q) Symmetric matrix
(c) Linear term
(d) Constant

The gradient is

$$ \nabla f(x)=Qx+c. $$

Setting the gradient equal to zero gives:

$$ Qx=-c. $$

Thus quadratic optimization reduces to solving a linear system.

If (Q) is positive definite, then the problem has a unique minimizer.

Quadratic problems are central because many nonlinear functions are locally approximated by quadratics.

134.8 Hessian Matrices

The Hessian matrix collects second derivatives.

For a twice-differentiable function

$$ f:\mathbb{R}^n\to\mathbb{R}, $$

the Hessian is

$$ H_f(x) = \left( \frac{\partial^2 f}{\partial x_i\partial x_j} \right). $$

The Hessian measures local curvature.

Important cases:

Hessian Interpretation
Positive definite Local minimum
Negative definite Local maximum
Indefinite Saddle point

Second-order optimization methods use Hessians to accelerate convergence.

134.9 Newton's Method

Newton's method uses second-order information.

The update rule is:

$$ x_{k+1} = x_k - H_f(x_k)^{-1}\nabla f(x_k). $$

x_{k+1}=x_k-H_f(x_k)^{-1}\nabla f(x_k)

The method approximates the objective locally by a quadratic function and minimizes that quadratic approximation.

Advantages:

Advantage Reason
Fast local convergence Uses curvature
Scale adaptation Hessian rescales directions

Disadvantages:

Disadvantage Reason
Expensive Requires solving linear systems
Hessian storage Large matrices
Possible instability Hessian may be singular

Large-scale optimization often approximates Newton steps rather than computing them exactly.

134.10 Constrained Optimization

Many optimization problems include constraints.

Equality constraints have the form:

$$ g_i(x)=0. $$

Inequality constraints have the form:

$$ h_j(x)\le0. $$

The feasible set becomes:

$$ C = {x:g_i(x)=0,\ h_j(x)\le0}. $$

Constraints change the geometry of optimization.

The optimum may occur on a boundary rather than in the interior.

134.11 Lagrange Multipliers

Lagrange multipliers solve equality-constrained optimization problems.

Suppose we minimize

$$ f(x) $$

subject to

$$ g(x)=0. $$

Define the Lagrangian:

$$ L(x,\lambda) = f(x)+\lambda g(x). $$

Optimality conditions become:

$$ \nabla_x L(x,\lambda)=0, $$

and

$$ g(x)=0. $$

This yields:

$$ \nabla f(x) = -\lambda \nabla g(x). $$

\nabla f(x)=-\lambda \nabla g(x)

Geometrically, the objective gradient becomes parallel to the constraint gradient.

134.12 Karush-Kuhn-Tucker Conditions

The Karush-Kuhn-Tucker conditions generalize Lagrange multipliers to inequality constraints.

For problems with constraints

$$ h_i(x)\le0, $$

the KKT conditions include:

Condition Meaning
Stationarity Gradient balance
Primal feasibility Constraints satisfied
Dual feasibility Multipliers nonnegative
Complementary slackness Active constraints identified

The KKT system forms the basis of many modern optimization algorithms.

134.13 Interior-Point Methods

Interior-point methods solve constrained optimization problems by moving through the interior of the feasible region.

Rather than enforcing constraints exactly at each step, they use barrier functions such as:

$$ -\mu\sum_i \log(-h_i(x)). $$

The logarithm approaches infinity near the boundary, preventing the iterates from leaving the feasible region.

Interior-point methods are especially important in:

Problem class Application
Linear programming Large sparse systems
Quadratic programming Control and finance
Semidefinite programming Convex matrix optimization

These methods rely heavily on sparse linear algebra.

134.14 Conjugate Gradient Method

The conjugate gradient method solves large symmetric positive definite systems:

$$ Ax=b. $$

It may also be viewed as minimizing the quadratic function:

$$ f(x) = \frac12x^TAx-b^Tx. $$

f(x)=\frac12x^TAx-b^Tx

Conjugate gradient constructs search directions that are mutually (A)-orthogonal.

Advantages include:

Advantage Reason
No matrix inversion Iterative method
Exploits sparsity Efficient for large systems
Low memory Matrix-vector products only

Conjugate gradient is one of the central algorithms in numerical optimization.

134.15 Stochastic Gradient Descent

Stochastic gradient descent, or SGD, approximates gradients using random subsets of data.

Instead of computing:

$$ \nabla f(x), $$

one computes an estimate:

$$ g_k \approx \nabla f(x_k). $$

The update becomes:

$$ x_{k+1} = x_k-\alpha_k g_k. $$

SGD is fundamental in machine learning because exact gradients may be too expensive for massive datasets.

Noise in the gradient introduces randomness but greatly reduces computational cost.

134.16 Regularization

Optimization problems may be ill-posed or unstable.

Regularization adds penalty terms to stabilize solutions.

Examples include:

Method Objective
Ridge regression (|Ax-b|^2+\lambda|x|^2)
Lasso (|Ax-b|^2+\lambda|x|_1)

Ridge regression penalizes large Euclidean norm.

Lasso promotes sparsity.

Regularization changes both the geometry and numerical conditioning of optimization problems.

134.17 Duality

Many optimization problems possess dual formulations.

The primal problem:

$$ \text{minimize } f(x) $$

may correspond to a dual problem involving multipliers or supporting hyperplanes.

Duality provides:

Benefit Interpretation
Lower bounds Optimality certification
Sensitivity analysis Constraint interpretation
Alternative algorithms Easier computation

Strong duality occurs when primal and dual optimal values coincide.

Convexity is closely tied to duality theory.

134.18 Numerical Stability

Optimization algorithms operate in finite precision arithmetic.

Important issues include:

Issue Effect
Roundoff error Loss of accuracy
Ill-conditioning Amplified perturbations
Cancellation Numerical instability
Overflow/underflow Computational failure

Stable algorithms minimize error growth.

Condition numbers are especially important because they measure sensitivity of solutions to perturbations.

134.19 Sparse Optimization

Many optimization problems involve sparse matrices.

A sparse matrix has mostly zero entries.

Sparse linear algebra allows:

Advantage Reason
Lower memory usage Store nonzeros only
Faster computation Skip zero operations
Scalable algorithms Large systems feasible

Sparse optimization is fundamental in:

Area Typical problem
PDE discretization Sparse linear systems
Network optimization Graph sparsity
Machine learning Sparse features
Signal processing Compressed sensing

134.20 Optimization and Linear Algebra

Nearly every optimization method relies on linear algebra.

Optimization concept Linear algebra role
Gradient Vector
Hessian Matrix
Constraints Linear systems
Newton step Solve matrix equation
Least squares Orthogonal projection
Convexity Positive semidefinite Hessian
Duality Linear functionals
Iterative methods Krylov subspaces

Modern optimization is fundamentally computational linear algebra.

134.21 Summary

Numerical optimization studies computational methods for minimizing or maximizing functions.

The central concepts are:

Concept Meaning
Objective function Quantity being optimized
Feasible set Allowed region
Convexity Guarantees global optimality
Gradient First-order change
Hessian Second-order curvature
Newton method Quadratic local approximation
Least squares Error minimization
Conjugate gradient Iterative quadratic solver
Interior-point method Barrier-based constrained optimization
Stochastic gradient Randomized optimization
Duality Complementary optimization formulation
Regularization Stabilization through penalties

Numerical optimization combines geometry, calculus, probability, and linear algebra into algorithms for solving large computational problems.