Inference Optimization

Training produces model parameters. Inference uses those parameters to generate predictions.

Training produces model parameters. Inference uses those parameters to generate predictions.

Inference optimization studies how to make model execution faster, cheaper, smaller, and more memory-efficient while preserving acceptable output quality.

For small models, naive inference may be sufficient. For foundation models, inference often becomes more expensive than training because deployed systems may serve millions or billions of requests.

A language model trained once may perform inference continuously for years.

Inference systems therefore optimize:

  • latency
  • throughput
  • memory usage
  • energy efficiency
  • hardware utilization
  • serving cost

Training Versus Inference

Training and inference have different computational characteristics.

Property Training Inference
Forward pass Yes Yes
Backward pass Yes No
Gradient storage Required Not needed
Optimizer state Required Not needed
Numerical precision Often mixed precision Often lower precision
Latency sensitivity Usually low Often critical
Throughput focus Tokens/images per second Requests per second

Inference is usually memory-bandwidth constrained rather than compute constrained, especially for large transformers.

Inference Workloads

Inference workloads vary substantially.

Workload Example
Batch inference Offline embedding generation
Real-time inference Chat applications
Streaming generation Token-by-token LLM decoding
Edge inference Mobile or embedded devices
Interactive multimodal systems Vision-language assistants

Different workloads require different optimizations.

For example:

Goal Important metric
Real-time chatbot Low latency
Embedding pipeline High throughput
Mobile model Low memory and energy
Datacenter serving Cost efficiency

Autoregressive Decoding

Large language models usually generate tokens autoregressively.

Given previous tokens:

$$ [t_1, t_2, \ldots, t_n], $$

the model predicts:

$$ p(t_{n+1}\mid t_{\le n}). $$

Then the next token is appended and the process repeats.

This sequential dependency limits parallelism because token $t_{n+1}$ must be generated before predicting $t_{n+2}$.

Training parallelizes across sequence positions. Inference cannot fully do this during generation.

Autoregressive decoding is therefore a major inference bottleneck.

KV Cache

Transformer inference repeatedly recomputing attention keys and values would be extremely inefficient.

Suppose a sequence has length $T$. Naively recomputing all attention states at every generation step would repeatedly process earlier tokens.

Instead, inference systems use a key-value cache, usually called a KV cache.

For each transformer layer:

Stored tensor Meaning
Keys Attention key projections
Values Attention value projections

At generation step $t$, only the newest token requires new computation. Earlier keys and values are reused.

Without KV caching, generation cost grows roughly as:

$$ O(T^2) $$

per generated token.

With caching, only new attention interactions are computed.

KV caching is essential for efficient transformer serving.

Memory Cost of KV Caches

KV caches consume substantial memory.

Approximate KV cache memory:

$$ \text{memory} \propto L \times T \times H \times D, $$

where:

Symbol Meaning
$L$ Number of layers
$T$ Sequence length
$H$ Attention heads
$D$ Head dimension

Long-context inference therefore becomes memory-intensive.

Example pressures include:

  • many concurrent users
  • long conversations
  • retrieval-augmented prompts
  • large batch serving

Modern inference systems often spend more memory on KV caches than on model parameters.

Quantization

Quantization reduces numerical precision to lower memory and compute cost.

Instead of storing parameters in fp16 or fp32, systems may use:

Format Bits
fp32 32
fp16 16
bf16 16
int8 8
int4 4

A quantized parameter approximation:

$$ W \approx s(q - z), $$

where:

Symbol Meaning
$q$ Quantized integer
$s$ Scale
$z$ Zero point

Quantization reduces:

  • memory footprint
  • bandwidth usage
  • inference latency

A 4-bit model may require roughly one-quarter the parameter memory of a 16-bit model.

Quantization Tradeoffs

Quantization introduces approximation error.

Tradeoffs include:

Advantage Cost
Lower memory Lower numerical precision
Faster inference Possible accuracy degradation
Larger batch serving More implementation complexity

Some layers are more sensitive than others.

Common approaches include:

Method Idea
Post-training quantization Quantize after training
Quantization-aware training Simulate quantization during training
Mixed-precision quantization Different layers use different precision

Modern language models can often tolerate surprisingly aggressive quantization.

Weight-Only Quantization

In many transformer systems, weights dominate memory usage.

Weight-only quantization stores weights in lower precision while keeping activations in higher precision.

Example:

Tensor type Precision
Weights int4
Activations fp16
KV cache fp16

This approach is attractive because it simplifies implementation while greatly reducing parameter memory.

Activation Quantization

Activation quantization reduces precision of intermediate tensors during inference.

This further reduces memory and bandwidth, but activations are often more sensitive than weights.

Challenges include:

  • outlier activations
  • varying tensor distributions
  • dynamic ranges changing during inference

Activation quantization is especially difficult for transformers with long contexts.

Operator Fusion

Modern neural networks contain many small operations:

  • matrix multiplication
  • bias addition
  • normalization
  • activation functions

Naively executing each operation separately creates overhead from:

  • kernel launches
  • memory reads and writes
  • synchronization

Operator fusion combines multiple operations into one kernel.

Example:

$$ y = \text{GELU}(Wx + b) $$

may be fused into one execution unit instead of separate:

  1. matrix multiplication
  2. bias addition
  3. activation

Fusion improves:

Benefit Reason
Throughput Less overhead
Memory efficiency Fewer intermediate tensors
Cache locality Better reuse

Inference compilers rely heavily on fusion.

Compilation and Graph Optimization

Eager execution is flexible but may introduce overhead.

Inference systems often convert models into optimized computation graphs.

Common graph optimizations include:

Optimization Purpose
Operator fusion Reduce overhead
Constant folding Precompute constants
Dead code elimination Remove unused operations
Kernel selection Choose optimized implementations
Layout optimization Improve memory access

Common inference runtimes include:

Runtime Use
TorchScript PyTorch graph execution
TensorRT NVIDIA inference optimization
ONNX Runtime Portable graph execution
TVM Compiler optimization
XLA Accelerated graph compilation

Batch Inference

Inference systems often combine requests into batches.

Instead of processing one example:

$$ x_1, $$

the system processes:

$$ [x_1, x_2, \ldots, x_B]. $$

Batching improves hardware utilization because GPUs are optimized for large tensor operations.

Advantages:

Benefit Reason
Higher throughput Better GPU occupancy
Better amortization Shared kernel overhead
Improved efficiency Larger matrix multiplications

Disadvantages:

Problem Explanation
Higher latency Requests wait for batching
Uneven sequence lengths Padding inefficiency
Scheduling complexity Dynamic request arrival

Serving systems must balance throughput against latency.

Continuous Batching

Traditional batching waits for a full batch before execution.

Continuous batching dynamically inserts and removes requests during generation.

This is especially important for LLM serving because different requests finish at different times.

Example:

Request Length
A 20 tokens
B 300 tokens
C 50 tokens

Without continuous batching, short requests may wait behind long requests.

Continuous batching keeps the GPU busy while minimizing wasted slots.

Modern LLM serving systems heavily rely on this technique.

Speculative Decoding

Autoregressive generation is sequential and slow.

Speculative decoding accelerates generation using a smaller draft model.

Workflow:

  1. small model predicts several candidate tokens
  2. large model verifies them
  3. accepted tokens are committed
  4. rejected tokens are recomputed

If the draft model predicts correctly often enough, throughput increases substantially.

The idea exploits the fact that verification can be cheaper than full sequential generation.

Mixture-of-Experts Inference

Mixture-of-experts models activate only part of the network for each token.

Instead of computing all experts:

$$ f(x) = \sum_{i=1}^{N} g_i(x)E_i(x), $$

only a subset is selected.

This reduces computation per token while increasing total parameter count.

However, MoE inference introduces routing complexity:

  • token dispatch
  • load balancing
  • expert communication

Efficient MoE serving is therefore partly a systems problem.

Attention Optimization

Attention becomes expensive for long contexts.

Standard attention complexity:

$$ O(T^2), $$

where $T$ is sequence length.

Long-context systems therefore use:

Method Idea
FlashAttention Memory-efficient kernels
Sliding-window attention Local attention regions
Sparse attention Ignore many token pairs
Linear attention Approximate softmax attention
Paged attention Efficient KV cache management

FlashAttention became especially important because it reduces memory traffic while preserving exact attention computation.

Paged Attention

Large language model serving often suffers from KV cache fragmentation.

Paged attention organizes KV memory into blocks or pages, similar to virtual memory systems.

Benefits include:

Benefit Explanation
Better memory utilization Reduced fragmentation
Flexible request scheduling Easier dynamic batching
Efficient cache reuse Improved serving throughput

Paged attention systems became important for large-scale multi-user inference servers.

CPU, GPU, and Accelerator Inference

Inference hardware varies widely.

Hardware Strength
CPU Flexibility, low-volume serving
GPU High throughput
TPU Large-scale serving
Edge accelerators Low power
Mobile NPUs On-device inference

The optimal deployment depends on:

  • latency requirements
  • throughput requirements
  • memory constraints
  • deployment environment
  • cost targets

A small quantized model may run efficiently on a phone. A frontier language model may require dozens of GPUs for interactive serving.

Edge Inference

Edge inference runs models close to the user:

  • phones
  • browsers
  • robots
  • embedded devices
  • autonomous systems

Advantages:

Benefit Reason
Lower latency No network round-trip
Better privacy Data stays local
Offline capability No server required

Constraints:

Constraint Problem
Limited memory Small devices
Power consumption Battery limits
Thermal limits Sustained compute restrictions

Edge systems therefore rely heavily on:

  • quantization
  • pruning
  • compact architectures
  • hardware-specific optimization

Serving Systems

Modern inference serving systems coordinate:

  • batching
  • scheduling
  • memory management
  • caching
  • load balancing
  • request routing

Common serving frameworks include:

Framework Use
TorchServe PyTorch deployment
Triton Inference Server Multi-model serving
vLLM Efficient LLM serving
TensorRT-LLM NVIDIA optimized LLM inference
Ray Serve Distributed serving

Serving infrastructure often becomes a major engineering domain separate from model training.

Cost as the Main Constraint

At scale, inference cost dominates deployment economics.

A model serving millions of users may process enormous token volumes daily.

Key cost drivers include:

Driver Impact
Parameter count Memory and compute
Context length Attention cost
Output length Sequential decoding cost
Concurrent users KV cache memory
Precision Hardware efficiency

Inference optimization therefore directly affects commercial viability.

The Central Tradeoff

Inference optimization balances:

$$ \text{quality} \leftrightarrow \text{latency} \leftrightarrow \text{throughput} \leftrightarrow \text{memory} \leftrightarrow \text{cost}. $$

Improving one dimension often worsens another.

Examples:

Optimization Possible downside
Lower precision Accuracy degradation
Larger batches Higher latency
Longer context Higher memory use
Smaller models Reduced capability
Aggressive caching More memory consumption

Inference engineering is therefore largely an optimization problem under hardware and economic constraints.

From Models to Systems

A trained neural network is only one part of a production AI system.

Real-world deployment also requires:

  • runtime compilers
  • schedulers
  • distributed caches
  • request routers
  • memory managers
  • observability systems
  • autoscaling infrastructure

As model size increased, inference optimization evolved from a minor deployment detail into one of the central engineering problems in modern AI systems.