Chapter 120. Machine Learning

Chapter 120. Machine Learning

Machine learning studies algorithms that improve their behavior from data.

A machine learning system receives examples, extracts structure, and produces predictions, classifications, rankings, embeddings, or decisions. Linear algebra is central because data are represented as vectors and matrices, models are often linear or locally linear, and training is usually an optimization problem over parameters.

Many standard learning methods can be written in the form

$$ \text{data matrix} \to \text{linear map} \to \text{loss function} \to \text{optimization}. $$

This is why linear algebra appears throughout machine learning: in regression, classification, kernels, embeddings, neural networks, dimensionality reduction, and numerical optimization. Stanford CS229 notes, for example, develop regularization, kernels, SVMs, and optimization using vector and matrix notation.

120.1 Data as Vectors

A single data point is often represented as a vector

$$ x = \begin{bmatrix} x_1 \ x_2 \ \vdots \ x_d \end{bmatrix} \in \mathbb{R}^d. $$

Each component is a feature.

For example, a house may be represented by features such as size, number of rooms, distance to city center, and age. An image may be represented by pixel values. A document may be represented by word counts or embedding coordinates.

Object Vector representation
Image Pixel vector or feature vector
Text Token vector, count vector, or embedding
Audio Sample vector or spectral feature vector
User Preference vector
Product Attribute or embedding vector
Sensor reading Measurement vector

Once data are vectors, distances, inner products, projections, norms, and linear transformations become available.

120.2 Data Matrices

A dataset with (m) examples and (d) features is stored as a matrix

$$ X = \begin{bmatrix}

  • & x_1^T & - \
  • & x_2^T & - \ & \vdots & \
  • & x_m^T & - \end{bmatrix} \in \mathbb{R}^{m\times d}. $$

Each row is one example. Each column is one feature.

The target values are often stored as a vector

$$ y = \begin{bmatrix} y_1 \ y_2 \ \vdots \ y_m \end{bmatrix}. $$

The pair ((X,y)) is the basic supervised learning dataset.

The design is the same as in linear regression. Machine learning extends this framework to broader models, losses, constraints, and data types.

120.3 Features

A feature is a numerical quantity used by a model.

A feature map converts raw input into a vector:

$$ \phi : \mathcal{X} \to \mathbb{R}^d. $$

For an input object (a), the model uses

$$ x=\phi(a). $$

Feature maps may be simple or complex.

Raw object Possible features
Text Word counts, TF-IDF, embeddings
Image Pixels, patches, learned features
Graph Degrees, neighborhoods, spectral features
User behavior Counts, recency, preferences
Time series Lags, moving averages, Fourier coefficients

Feature design changes the geometry of the learning problem. A model that is linear in one feature space may represent nonlinear behavior in the original input space.

120.4 Linear Prediction

The simplest prediction model is linear:

$$ \hat{y}=w^Tx+b. $$

Here (w\in\mathbb{R}^d) is the weight vector and (b\in\mathbb{R}) is the bias.

The value (w_j) measures how feature (j) contributes to the prediction.

For the full dataset,

$$ \hat{y}=Xw+b\mathbf{1}. $$

If the bias is included as an extra feature equal to (1), then the model becomes

$$ \hat{y}=Xw. $$

This is the same matrix form used in least squares.

120.5 Loss Functions

A loss function measures prediction error.

For regression, a common loss is squared error:

$$ \ell(\hat{y},y)=(\hat{y}-y)^2. $$

For a dataset, the empirical risk is

$$ L(w)=\frac{1}{m}\sum_{i=1}^m (w^Tx_i-y_i)^2. $$

In matrix form,

$$ L(w)=\frac{1}{m}|Xw-y|^2. $$

Thus linear regression is a least squares problem.

Other learning problems use different losses.

Problem Common loss
Regression Squared error
Binary classification Logistic loss
Margin classification Hinge loss
Multiclass classification Cross-entropy
Ranking Pairwise ranking loss
Embedding learning Contrastive loss

The choice of loss changes the optimization problem.

120.6 Training

Training means choosing model parameters to minimize loss.

For a linear model, training solves

$$ \min_w L(w). $$

For squared loss,

$$ \min_w |Xw-y|^2. $$

This is ordinary least squares.

If (X^TX) is invertible, the solution is

$$ \hat{w}=(X^TX)^{-1}X^Ty. $$

In larger or more complex models, training is usually done by iterative optimization, such as gradient descent.

120.7 Gradient Descent

Gradient descent updates parameters by moving opposite the gradient:

$$ w_{k+1}=w_k-\eta \nabla L(w_k). $$

Here (\eta>0) is the learning rate.

For squared loss,

$$ L(w)=\frac{1}{m}|Xw-y|^2, $$

the gradient is

$$ \nabla L(w)=\frac{2}{m}X^T(Xw-y). $$

Thus each update uses matrix-vector products with (X) and (X^T).

This is one reason large-scale machine learning depends heavily on efficient linear algebra kernels.

120.8 Regularization

Regularization adds a penalty to the training objective.

A common example is ridge regularization:

$$ \min_w |Xw-y|^2+\lambda|w|^2. $$

Here (\lambda>0) controls the penalty strength.

The solution satisfies

$$ (X^TX+\lambda I)w=X^Ty. $$

Regularization discourages overly large parameter vectors and improves stability when features are correlated or data are limited. CS229 notes describe (\ell_2) regularization as encouraging small parameter norm and, in gradient descent form, as weight decay.

120.9 Classification

In binary classification, the target is usually

$$ y\in{-1,+1} $$

or

$$ y\in{0,1}. $$

A linear classifier computes a score

$$ s=w^Tx+b. $$

The predicted class may be

$$ \hat{y}=\operatorname{sign}(s). $$

Geometrically, the equation

$$ w^Tx+b=0 $$

defines a hyperplane.

This hyperplane separates feature space into two half-spaces. The vector (w) is normal to the separating hyperplane.

Classification is therefore a geometric problem in vector space.

120.10 Logistic Regression

Logistic regression converts a linear score into a probability.

The score is

$$ s=w^Tx+b. $$

The probability of class (1) is

$$ p=\sigma(s)=\frac{1}{1+e^{-s}}. $$

The sigmoid function maps real numbers to values between (0) and (1).

Training usually minimizes cross-entropy loss:

$$ L(w) = -\sum_{i=1}^m \left[ y_i\log p_i+(1-y_i)\log(1-p_i) \right]. $$

Although the model is linear in its score, the probability is nonlinear in the parameters.

The training objective is convex for ordinary logistic regression.

120.11 Support Vector Machines

A support vector machine seeks a separating hyperplane with a large margin.

For binary labels

$$ y_i\in{-1,+1}, $$

a linear classifier uses

$$ f(x)=w^Tx+b. $$

The margin condition is

$$ y_i(w^Tx_i+b)\geq 1. $$

The hard-margin SVM solves

$$ \min_{w,b}\frac{1}{2}|w|^2 $$

subject to the margin constraints.

The support vectors are training points that lie on or inside the margin boundary. SVMs are commonly developed through optimization, kernels, and inner products, and the kernel trick replaces explicit high-dimensional feature maps by kernel evaluations.

120.12 Kernels

A kernel is a function

$$ K(x,z) $$

that behaves like an inner product in some feature space:

$$ K(x,z)=\langle \phi(x),\phi(z)\rangle. $$

The kernel trick allows algorithms to use inner products in a high-dimensional feature space without explicitly constructing the feature vectors.

Common kernels include:

Kernel Formula
Linear (K(x,z)=x^Tz)
Polynomial (K(x,z)=(x^Tz+c)^p)
Gaussian RBF (K(x,z)=\exp(-\gamma|x-z|^2))

Kernel methods show that linear algebra can operate implicitly through Gram matrices.

The Gram matrix has entries

$$ G_{ij}=K(x_i,x_j). $$

120.13 Embeddings

An embedding maps an object into a vector space.

For example, a word, document, image, user, or item may be mapped to

$$ z\in\mathbb{R}^d. $$

The goal is that geometry in the embedding space reflects semantic or task-relevant structure.

Similar objects should have nearby vectors. Dissimilar objects should have distant or differently oriented vectors.

Common similarity measures include:

Measure Formula
Dot product (x^Tz)
Euclidean distance (|x-z|)
Cosine similarity (\frac{x^Tz}{|x||z|})

Embeddings turn complex objects into points in a learned vector space.

120.14 Neural Networks

A neural network is a composition of linear maps and nonlinear activation functions.

A single layer has the form

$$ h=\sigma(Wx+b). $$

Here (W) is a weight matrix, (b) is a bias vector, and (\sigma) is applied componentwise.

A multilayer network has the form

$$ f(x) = W_L\sigma(W_{L-1}\sigma(\cdots\sigma(W_1x+b_1)))+b_L. $$

The matrices (W_1,\ldots,W_L) contain the learned parameters.

Without nonlinear activation functions, the composition would collapse to one linear map. Nonlinearity is what allows neural networks to represent complex functions.

Still, each layer is built from matrix multiplication.

120.15 Backpropagation

Backpropagation computes gradients of a loss with respect to all parameters in a neural network.

It applies the chain rule through the computational graph.

For a layer

$$ h=\sigma(Wx+b), $$

the gradient with respect to (W) has an outer-product structure:

$$ \frac{\partial L}{\partial W} = \delta x^T, $$

where (\delta) is the backpropagated error vector for the layer.

Thus training neural networks repeatedly uses matrix products, transposes, Jacobian-vector products, and outer products.

MIT notes on matrix calculus for machine learning emphasize this matrix-based calculus viewpoint for differentiating vector and matrix expressions used in learning systems.

120.16 Dimensionality Reduction

High-dimensional data are often compressed into lower-dimensional representations.

A linear dimensionality reduction map has the form

$$ z=W^Tx, $$

where

$$ W\in\mathbb{R}^{d\times r}, \qquad r<d. $$

The vector (z\in\mathbb{R}^r) is a lower-dimensional representation of (x).

Good dimensionality reduction preserves important structure: variance, distance, neighborhood relations, class separation, or task information.

Linear algebra supplies the central methods, especially eigenvalue decompositions and singular value decompositions.

120.17 Principal Component Analysis

Principal component analysis, or PCA, finds orthogonal directions of maximum variance.

Given centered data matrix (X), the sample covariance matrix is proportional to

$$ X^TX. $$

The principal components are eigenvectors of (X^TX).

Equivalently, PCA can be computed by the singular value decomposition

$$ X=U\Sigma V^T. $$

The columns of (V) give principal directions.

Keeping the first (r) directions gives a rank-(r) approximation of the data.

PCA is therefore a direct application of eigenvalues, orthogonality, projection, and SVD.

120.18 Matrix Factorization

Many machine learning problems use matrix factorization.

Suppose (R\in\mathbb{R}^{m\times n}) stores user-item ratings.

A low-rank model approximates

$$ R\approx UV^T, $$

where

$$ U\in\mathbb{R}^{m\times r}, \qquad V\in\mathbb{R}^{n\times r}. $$

The row (u_i) represents user (i). The row (v_j) represents item (j).

The predicted rating is

$$ \hat{R}_{ij}=u_i^Tv_j. $$

This is the basis of many recommender systems.

The model assumes that observed preferences are governed by a smaller number of latent factors.

120.19 Attention

Attention mechanisms compare query, key, and value vectors.

For matrices

$$ Q,\quad K,\quad V, $$

scaled dot-product attention computes

$$ \operatorname{Attention}(Q,K,V) = \operatorname{softmax} \left( \frac{QK^T}{\sqrt{d}} \right)V. $$

The matrix

$$ QK^T $$

contains pairwise similarities between queries and keys.

The softmax converts similarities into weights.

Multiplication by (V) forms weighted sums of value vectors.

Attention is therefore a structured sequence of matrix multiplication, normalization, and weighted averaging.

120.20 Covariance

Covariance measures how features vary together.

For centered data matrix (X), the covariance matrix is

$$ \Sigma=\frac{1}{m}X^TX. $$

The entry (\Sigma_{ij}) measures the relationship between feature (i) and feature (j).

The diagonal entries are variances. The off-diagonal entries are covariances.

Covariance matrices are symmetric positive semidefinite.

They appear in PCA, Gaussian models, whitening, metric learning, Kalman filters, and uncertainty estimation.

120.21 Whitening

Whitening transforms data so that its covariance becomes the identity matrix.

If

$$ \Sigma=V\Lambda V^T $$

is an eigendecomposition of the covariance matrix, then a whitening transform is

$$ W=\Lambda^{-1/2}V^T. $$

For centered data (x), define

$$ z=Wx. $$

Then, ideally,

$$ \operatorname{Cov}(z)=I. $$

Whitening removes scale and correlation from the data.

It is used as preprocessing, in statistical models, and in representation learning.

Many machine learning systems retrieve nearest vectors.

Given a query vector (q), find database vectors (x_i) that maximize

$$ q^Tx_i $$

or minimize

$$ |q-x_i|. $$

This is nearest neighbor search in a vector space.

Applications include semantic search, image retrieval, recommendation, clustering, duplicate detection, and memory retrieval.

At small scale, exact search can compute all distances. At large scale, approximate nearest neighbor methods use indexing structures, quantization, hashing, or graph search.

The underlying computations remain vector operations.

120.23 Generalization

Training error measures performance on the training data. Generalization concerns performance on new data.

Linear algebra enters generalization through model capacity, norm constraints, rank, margins, and regularization.

For example, a classifier with a larger margin is often more robust to small perturbations. A low-rank representation may discard noise. A small norm solution may avoid overly sensitive predictions.

These are geometric ideas.

Machine learning theory studies how these geometric constraints affect prediction on unseen examples.

120.24 Overfitting

Overfitting occurs when a model fits noise or accidental structure in the training set.

A model with too much flexibility may achieve low training loss but poor test performance.

Linear algebraic symptoms include:

Symptom Meaning
Ill-conditioned design matrix Unstable parameters
Very large weights Sensitive predictions
High rank with noisy directions Noise fitted as signal
Small singular values used heavily Poor numerical stability
Near-duplicate features Multicollinearity

Regularization, dimensionality reduction, early stopping, and more data can reduce overfitting.

120.25 Machine Learning and Linear Algebra

The main dictionary is direct.

Machine learning object Linear algebra object
Example Vector
Dataset Matrix
Feature map Vector-valued function
Linear model Dot product
Neural network layer Matrix plus nonlinearity
Training Optimization over vectors and matrices
Loss gradient Vector or matrix derivative
Regularization Norm penalty
Kernel method Gram matrix
Embedding Learned coordinate vector
PCA Eigenvalue problem
Matrix factorization Low-rank approximation
Attention Matrix products and weighted sums
Similarity search Inner products and distances

This table explains why linear algebra is not only a prerequisite for machine learning. It is part of the structure of the field.

120.26 Summary

Machine learning represents data as vectors and collections of data as matrices.

Linear models use dot products. Regression solves least squares problems. Classification separates vector spaces by hyperplanes. Kernels replace explicit feature vectors by inner products. Neural networks compose matrix transformations with nonlinearities. PCA, embeddings, attention, recommender systems, and similarity search all rely on matrix operations.

The central principle is representation plus optimization. Data are placed in vector spaces, models transform those vectors, losses measure error, and algorithms adjust parameters by linear algebraic computation.