Chapter 140. Modern Applications in AI

Chapter 140. Modern Applications in AI

140.1 Introduction

Modern artificial intelligence is built on linear algebra.

Data is represented as vectors. Batches of data are matrices or tensors. Neural network layers are affine transformations followed by nonlinear functions. Training uses gradients, Jacobians, Hessians, and large-scale matrix operations. In transformer models, attention is expressed through matrix products involving query, key, and value matrices.

The central pattern is:

$$ \text{data} \longrightarrow \text{vectors} \longrightarrow \text{linear maps} \longrightarrow \text{optimization}. $$

This chapter describes how the ideas of linear algebra appear in current AI systems.

140.2 Data as Vectors

AI systems begin by converting objects into vectors.

A word, image, audio segment, document, user profile, protein sequence, or graph node may be represented by a vector

$$ x\in\mathbb{R}^d. $$

The dimension (d) depends on the model.

Examples:

Object Vector representation
Word Embedding vector
Image patch Pixel or feature vector
Document Dense semantic vector
User Preference vector
Graph node Node embedding
Audio frame Feature vector

Once data is represented as vectors, linear algebra can be used to compare, transform, combine, and optimize it.

140.3 Embeddings

An embedding maps a discrete object into a vector space.

For example, a vocabulary item (w) may be mapped to

$$ e_w\in\mathbb{R}^d. $$

Words with related meanings often have embeddings that are close under cosine similarity or inner product.

The embedding matrix has the form

$$ E\in\mathbb{R}^{V\times d}, $$

where (V) is the vocabulary size and (d) is the embedding dimension.

A token index selects one row of (E). Thus embedding lookup is a structured linear-algebra operation.

Embeddings are used in language models, recommender systems, image-text models, graph neural networks, and retrieval systems.

140.4 Similarity and Inner Products

Many AI systems compare vectors using inner products.

Given two vectors

$$ x,y\in\mathbb{R}^d, $$

their dot product is

$$ x^Ty. $$

Cosine similarity normalizes by vector lengths:

$$ \cos(x,y) = \frac{x^Ty}{|x||y|}. $$

Large cosine similarity means that the two vectors point in similar directions.

This is used in:

Task Use
Search Find nearby document vectors
Recommendation Compare user and item vectors
Classification Compare feature and class vectors
Clustering Group similar embeddings
Retrieval-augmented generation Retrieve relevant context

Vector similarity is one of the most common uses of linear algebra in AI.

140.5 Neural Network Layers

A basic neural network layer has the form

$$ y=\sigma(Wx+b), $$

where:

Symbol Meaning
(x) Input vector
(W) Weight matrix
(b) Bias vector
(\sigma) Nonlinear activation
(y) Output vector

The affine part

$$ Wx+b $$

is linear algebra. The activation introduces nonlinearity.

A deep neural network composes many such layers:

$$ x \mapsto \sigma(W_1x+b_1) \mapsto \sigma(W_2h_1+b_2) \mapsto \cdots. $$

Thus deep learning alternates linear transformations with nonlinear coordinatewise operations.

140.6 Batches and Matrix Multiplication

Training uses batches of examples.

If a batch contains (B) input vectors of dimension (d), they are stored as a matrix

$$ X\in\mathbb{R}^{B\times d}. $$

A linear layer applied to the whole batch is

$$ Y=XW+B_0, $$

where (W) is a weight matrix and (B_0) broadcasts the bias.

This turns many vector operations into one matrix multiplication.

Matrix multiplication is the computational core of neural network training and inference. Modern hardware accelerators are designed around fast dense matrix and tensor operations.

140.7 Loss Functions and Gradients

Training adjusts weights to minimize a loss function.

Let

$$ \theta $$

denote all model parameters. Training solves approximately:

$$ \min_\theta L(\theta). $$

The gradient

$$ \nabla_\theta L $$

points in the direction of greatest local increase. Optimization algorithms move in the opposite direction.

A typical update is:

$$ \theta_{k+1} = \theta_k-\alpha_k\nabla_\theta L(\theta_k). $$

This is gradient descent in parameter space.

In modern AI, (\theta) may contain millions or billions of parameters, but the principle remains ordinary vector calculus and linear algebra.

140.8 Backpropagation

Backpropagation computes gradients through a composed function.

If a model is a composition

$$ f=f_n\circ f_{n-1}\circ\cdots\circ f_1, $$

then the chain rule says that derivatives multiply in reverse order.

For Jacobians,

$$ J_f = J_{f_n}J_{f_{n-1}}\cdots J_{f_1}. $$

Backpropagation applies this rule efficiently without explicitly forming every large Jacobian.

Instead, it propagates vector-Jacobian products backward through the computation graph.

This is why matrix calculus is essential for deep learning.

140.9 Attention

Attention is one of the most important linear-algebraic mechanisms in modern AI.

Given input vectors collected in a matrix

$$ X, $$

a transformer forms query, key, and value matrices:

$$ Q=XW^Q,\qquad K=XW^K,\qquad V=XW^V. $$

The attention score matrix is

$$ QK^T. $$

Scaled dot-product attention is

$$ \operatorname{Attention}(Q,K,V) = \operatorname{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right)V. $$

\operatorname{Attention}(Q,K,V)=\operatorname{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The query, key, and value matrices are produced by learned linear projections, and the attention scores are computed by matrix multiplication.

140.10 Multi-Head Attention

Multi-head attention uses several attention operations in parallel.

Each head has its own projection matrices:

$$ W_i^Q,\qquad W_i^K,\qquad W_i^V. $$

Each head computes attention in a different learned subspace.

The outputs are concatenated and multiplied by another matrix:

$$ \operatorname{MultiHead}(X) = \operatorname{Concat}(H_1,\ldots,H_h)W^O. $$

This allows the model to represent multiple kinds of relationships at once.

Some heads may track local syntax. Others may track long-range dependencies, positional structure, or semantic relations.

140.11 Low-Rank Structure in Attention

The attention matrix has size roughly

$$ n\times n, $$

where (n) is the sequence length.

This can be expensive for long sequences.

Many efficient transformer methods use linear algebraic structure to reduce cost. One common idea is low-rank approximation. Linformer, for example, proposed approximating self-attention by a low-rank matrix to reduce the sequence-length cost from quadratic to linear under its approximation assumptions.

The general principle is:

$$ \text{large dense matrix} \approx \text{smaller structured factors}. $$

This connects transformer efficiency with matrix approximation.

140.12 State Space Models

Some modern sequence models use state space equations instead of full attention.

A linear state space model has the form

$$ h_{t+1}=Ah_t+Bx_t, $$

$$ y_t=Ch_t+Dx_t. $$

Here:

Symbol Meaning
(h_t) Hidden state
(x_t) Input
(y_t) Output
(A,B,C,D) Learned matrices

State space models use recurrence, convolution, and structured matrices to process long sequences.

Mamba is one modern architecture based on selective state spaces and designed for efficient sequence modeling.

140.13 Singular Value Decomposition in AI

The singular value decomposition writes a matrix as

$$ A=U\Sigma V^T. $$

SVD appears in AI through:

Use Role
Dimensionality reduction Keep leading singular vectors
Compression Approximate weight matrices
Denoising Remove small singular components
Latent semantic analysis Factor term-document matrices
Model analysis Study learned representations

Low-rank approximation is especially important when models are large.

If a weight matrix has effective low rank, it may be approximated by

$$ W\approx UV^T, $$

where (U) and (V) are smaller matrices.

This reduces storage and computation.

140.14 Principal Component Analysis

Principal component analysis, or PCA, finds directions of maximal variance.

Given centered data matrix

$$ X, $$

the covariance matrix is

$$ \frac{1}{n}X^TX. $$

The principal components are eigenvectors of this covariance matrix.

PCA is used for:

Task Purpose
Visualization Reduce to 2 or 3 dimensions
Preprocessing Remove redundant dimensions
Denoising Keep dominant components
Representation analysis Inspect embedding geometry

PCA is one of the classical bridges between linear algebra and data analysis.

140.15 Matrix Factorization for Recommendation

Recommender systems often use matrix factorization.

Let

$$ R\in\mathbb{R}^{m\times n} $$

be a user-item rating matrix.

The goal is to approximate

$$ R\approx UV^T, $$

where:

Matrix Meaning
(U) User factors
(V) Item factors

The predicted rating for user (i) and item (j) is

$$ u_i^Tv_j. $$

This model says that users and items live in the same latent vector space.

Recommendation becomes inner-product prediction.

140.16 Graph Neural Networks

Graphs are common in AI: social networks, molecules, knowledge graphs, citation networks, and recommendation systems.

A graph neural network updates node features using neighboring nodes.

A simple linear message-passing layer has the form

$$ H_{k+1} = \sigma(\widetilde{A}H_kW_k), $$

where:

Symbol Meaning
(\widetilde{A}) Normalized adjacency matrix
(H_k) Node feature matrix
(W_k) Weight matrix
(\sigma) Activation

The adjacency matrix determines how information flows across the graph.

This is spectral graph theory and neural networks combined.

140.17 Generative Models

Generative AI models produce new samples.

Linear algebra appears in several forms:

Model type Linear algebra role
Language models Token embeddings and attention
Diffusion models Noise vectors and denoising networks
Image generators Latent spaces and convolutional layers
Autoencoders Encoder and decoder maps
GANs Generator and discriminator matrices

Latent vector spaces are central. A model often maps a vector

$$ z\in\mathbb{R}^d $$

to a generated object.

Interpolating between latent vectors can produce smooth changes in generated outputs.

140.18 Retrieval-Augmented Generation

Retrieval-augmented generation combines search with generation.

Documents are embedded as vectors:

$$ d_1,\ldots,d_N. $$

A query is embedded as

$$ q. $$

Retrieval selects documents with large similarity scores:

$$ q^Td_i. $$

The selected documents are then passed to a language model as context.

Thus RAG systems depend heavily on:

Component Linear algebra operation
Embedding model Vector representation
Vector database Nearest-neighbor search
Similarity scoring Dot products or cosine similarity
Reranking Matrix and vector scoring
Generation Transformer inference

The retrieval step is essentially large-scale vector search.

140.19 Model Compression

Large AI models are expensive to store and run.

Linear algebra supports compression through:

Method Linear algebra idea
Low-rank factorization Replace (W) by (UV^T)
Pruning Remove small or unimportant weights
Quantization Store lower-precision values
Sparse matrices Exploit zeros
Distillation Approximate one function by another

Low-rank methods explicitly use matrix factorization. Quantization changes the scalar representation. Sparse methods change the matrix storage pattern.

These techniques reduce memory bandwidth and computational cost.

140.20 Hardware and Tensor Algebra

AI hardware is optimized for tensor operations.

A tensor program is usually a sequence of operations such as:

$$ C = AB, $$

$$ Y = XW+b, $$

$$ QK^T, $$

and reductions such as sums, norms, and softmax.

Performance depends on:

Factor Linear algebra issue
Matrix shape Arithmetic intensity
Memory layout Data movement
Precision Numerical error
Blocking Cache and accelerator use
Sparsity Irregular computation

Even when the model is described statistically, the execution is numerical linear algebra.

140.21 AI for Linear Algebra

AI is also being used to discover or accelerate linear algebra algorithms.

One example is AlphaTensor, which used reinforcement learning to search for matrix multiplication algorithms with fewer scalar multiplications. Matrix multiplication is a core operation in linear algebra and machine learning, so algorithmic improvements can matter at large scale.

This reverses the usual relationship.

Linear algebra supports AI, and AI can search for better linear algebra procedures.

140.22 Numerical Stability

Modern AI uses finite precision arithmetic.

Common formats include 32-bit floating point, 16-bit floating point, bfloat16, and lower-precision quantized formats.

Numerical issues include:

Issue Effect
Overflow Values exceed representable range
Underflow Values become too small
Roundoff Accumulated arithmetic error
Ill-conditioning Small perturbations become large
Instability in softmax Large exponentials

Stable implementations often subtract the maximum before applying softmax. This keeps exponentials in a safe numerical range and is standard in attention implementations.

140.23 Summary

Modern AI is applied linear algebra at large scale.

The central ideas are:

Concept AI role
Vectors Represent data and parameters
Matrices Represent learned transformations
Tensors Store batches, activations, and weights
Inner products Similarity and attention scores
Matrix multiplication Core computation
Gradients Training signal
Jacobians Chain rule and backpropagation
SVD Compression and dimensionality reduction
Eigenvectors PCA, graph learning, spectral methods
Low-rank approximation Efficient models
Sparse matrices Efficient storage and computation
State space matrices Long-sequence modeling
Vector search Retrieval and recommendation

AI systems may appear complex at the application level, but their computational core is a small set of linear-algebraic operations repeated at very large scale.