Training Foundation Models

Foundation models are large neural networks trained on broad datasets and adapted to many downstream tasks.

Foundation models are large neural networks trained on broad datasets and adapted to many downstream tasks. Examples include large language models, multimodal transformers, vision foundation models, audio-language systems, and general-purpose embedding models.

Training these systems requires coordinated advances in:

  • optimization
  • distributed systems
  • data engineering
  • numerical stability
  • infrastructure reliability
  • hardware utilization

Foundation model training differs from ordinary deep learning mainly in scale. The underlying mathematical principles remain similar, but the operational constraints become much more severe.

A small model may train on one GPU for hours. A foundation model may require thousands of accelerators running continuously for weeks or months.

What Defines a Foundation Model

A foundation model typically has several properties:

Property Description
Large parameter count Millions to trillions of parameters
Broad pretraining data Diverse internet-scale datasets
General-purpose representations Useful across many tasks
Transferability Fine-tuned or prompted for downstream use
Emergent capabilities Behaviors not explicitly supervised

Most modern foundation models are transformer-based because transformers scale efficiently with data and compute.

Common foundation model categories include:

Type Example tasks
Language models Text generation, reasoning, coding
Vision models Classification, segmentation, detection
Multimodal models Vision-language understanding
Audio models Speech recognition, synthesis
Embedding models Retrieval and semantic search

Scaling Laws

One of the central discoveries in modern deep learning is that model performance often follows predictable scaling behavior.

Performance depends on:

  • parameter count
  • training data size
  • compute budget

Empirical scaling laws often resemble power-law relationships:

$$ L(N) = A N^{-\alpha} + C, $$

where:

Symbol Meaning
$L(N)$ Loss
$N$ Scale variable
$A, C$ Constants
$\alpha$ Scaling exponent

Increasing model size, data, or compute generally improves performance, though with diminishing returns.

Scaling laws influenced modern foundation model design because they suggested that larger models trained on more data would continue improving predictably.

This shifted research from hand-designed architectures toward large-scale optimization and infrastructure.

Compute-Optimal Training

Training budgets are finite. A key question becomes how to allocate compute between:

  • larger models
  • more training tokens
  • longer training duration

Suppose:

Variable Meaning
$P$ Parameter count
$T$ Training tokens
$C$ Total compute

Approximate transformer training cost scales as:

$$ C \propto P T. $$

If the model is too large for the available data, parameters are undertrained. If the dataset is too large for the model, capacity may be insufficient.

Modern training recipes attempt to balance model size and data volume to maximize performance for a fixed compute budget.

This idea is often called compute-optimal training.

Token-Based Training

Large language models are usually trained in terms of tokens rather than epochs.

A token is a subword unit produced by tokenization.

Example:

"foundation models are powerful"

might tokenize into:

["foundation", " models", " are", " powerful"]

Training progress is often measured as:

$$ \text{tokens processed}. $$

For example:

Model Approximate training tokens
Small language model Billions
Mid-scale LLM Hundreds of billions
Frontier-scale LLM Trillions

Unlike classical datasets, internet-scale corpora may not have clean epoch boundaries. Data pipelines therefore stream tokens continuously.

Data Pipelines

Foundation models require enormous datasets.

Data engineering becomes a major component of the system.

Typical stages include:

Stage Purpose
Crawling Collect raw data
Deduplication Remove repeated content
Filtering Remove low-quality data
Language identification Separate languages
Safety filtering Remove harmful content
Tokenization Convert text to token IDs
Sharding Split data across workers

Data quality strongly affects model quality.

A smaller high-quality dataset may outperform a much larger noisy dataset.

Streaming Datasets

Large datasets are rarely stored as one monolithic file.

Instead, they are sharded into many files:

shard_00000.bin
shard_00001.bin
shard_00002.bin
...

Workers stream shards in parallel.

Advantages include:

Benefit Reason
Parallel reading Multiple workers load simultaneously
Fault tolerance Corruption affects only one shard
Distributed access Nodes read different shards
Incremental updates New shards can be added

Streaming avoids loading the entire dataset into memory.

Transformer Training

Most foundation models are transformers.

A simplified decoder-only transformer computes:

$$ x \rightarrow \text{Embedding} \rightarrow \text{Transformer Blocks} \rightarrow \text{Output Projection}. $$

Each transformer block contains:

  • self-attention
  • feedforward networks
  • residual connections
  • normalization layers

Training is autoregressive.

Given tokens:

$$ [t_1, t_2, \ldots, t_n], $$

the model predicts:

$$ t_{i+1} $$

from earlier tokens.

The objective is usually next-token prediction:

$$ L = - \sum_{i=1}^{n} \log p_\theta(t_i \mid t_{<i}). $$

Mixed Precision Training

Foundation model training almost always uses mixed precision.

Instead of float32 everywhere, systems use:

Format Common use
fp16 Earlier mixed precision systems
bf16 Modern large-scale training
fp32 Master weights or sensitive operations

Mixed precision reduces:

  • memory usage
  • communication volume
  • training time

Bfloat16 became especially important because it preserves the exponent range of float32, improving numerical stability.

A typical configuration:

Tensor type Precision
Activations bf16
Gradients bf16
Matrix multiplications bf16
Optimizer state fp32
Master weights fp32

Parallelism Strategies

Foundation models are too large for simple data parallelism alone.

Modern systems combine multiple parallelism dimensions.

Parallelism Purpose
Data parallelism Scale training throughput
Tensor parallelism Split large matrix operations
Pipeline parallelism Split sequential layers
Sharded optimizers Reduce replicated optimizer state
Expert parallelism Route sparse experts across devices

Large training systems often organize GPUs into groups.

Example:

Parallelism dimension Size
Data parallel 16
Tensor parallel 8
Pipeline parallel 4

Total GPUs:

$$ 16 \times 8 \times 4 = 512. $$

Each GPU participates in several communication groups simultaneously.

Memory Optimization

Memory becomes a dominant constraint.

Main memory consumers include:

Component Scaling behavior
Parameters Proportional to model size
Optimizer state Often 2 to 8 times parameter size
Gradients Similar to parameter size
Activations Depend on batch and sequence length

Techniques used to reduce memory include:

Technique Purpose
Activation checkpointing Trade compute for memory
Gradient accumulation Simulate large batches
ZeRO/FSDP Shard optimizer state and parameters
Quantization Lower precision storage
Offloading Move state to CPU or NVMe

Without these methods, large models may not fit even across many GPUs.

Throughput Optimization

Training cost is dominated by accelerator time.

Suppose a training run uses:

  • 2,000 GPUs
  • $2 per GPU-hour
  • 30 days

Cost:

$$ 2000 \times 24 \times 30 \times 2 = 2{,}880{,}000. $$

Even small inefficiencies become expensive.

Important throughput metrics include:

Metric Meaning
Tokens per second Language-model throughput
FLOPs utilization Fraction of peak compute used
GPU utilization Accelerator activity
Communication overhead Time spent synchronizing
Data pipeline latency Waiting for input data

High-performance training systems carefully overlap:

  • communication
  • data loading
  • computation
  • checkpointing

Learning Rate Schedules

Foundation models often use carefully tuned learning rate schedules.

A common pattern:

  1. warmup
  2. plateau or cosine decay
  3. gradual reduction

Warmup stabilizes early optimization.

Example cosine schedule:

$$ \eta_t = \eta_{\min} + \frac{1}{2} (\eta_{\max} - \eta_{\min}) \left( 1 + \cos\left(\frac{\pi t}{T}\right) \right). $$

genui{"math_block_widget_always_prefetch_v2":{"content":"\eta_t=\eta_{\min}+\frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\left(\frac{\pi t}{T}\right)\right)"}}

Warmup is especially important for large batch training because early gradients can be unstable.

Gradient Stability

Large models are sensitive to numerical instability.

Common problems include:

Problem Symptom
Exploding gradients Loss divergence
Vanishing gradients Slow learning
Overflow NaNs
Underflow Zero gradients
Activation spikes Instability

Stabilization techniques include:

Technique Purpose
Gradient clipping Limit update magnitude
Normalization layers Stabilize activations
Residual connections Improve gradient flow
Careful initialization Prevent early divergence
Adaptive optimizers Stabilize updates

Gradient clipping often uses:

$$ g \leftarrow g \cdot \min\left( 1, \frac{\tau}{|g|} \right), $$

where $\tau$ is the clipping threshold.

Evaluation During Training

Foundation model evaluation is expensive.

Evaluations may include:

Evaluation type Example
Validation perplexity Language modeling
Benchmark suites Reasoning and QA
Human preference evaluation Alignment
Safety testing Harmful outputs
Retrieval quality Embedding models

Frequent evaluation slows training, but infrequent evaluation risks wasting compute on bad runs.

Many systems therefore run lightweight validation frequently and expensive benchmark suites less often.

Alignment and Post-Training

Pretraining produces a general-purpose model, but not necessarily a helpful or safe assistant.

Modern systems often add:

Stage Purpose
Supervised fine-tuning Teach instruction following
Preference optimization Align outputs with preferences
RLHF Reinforcement learning from human feedback
Constitutional methods Rule-guided alignment
Safety tuning Reduce harmful behavior

The final model is therefore the result of several training stages, not just one pretraining run.

Infrastructure Reliability

Foundation model training depends heavily on infrastructure engineering.

Key requirements include:

Requirement Reason
Fault tolerance Failures are inevitable
Distributed checkpointing Large model state
Monitoring systems Detect hangs and instability
Cluster scheduling Coordinate resources
High-bandwidth networking Synchronization efficiency
Storage throughput Massive datasets and checkpoints

At large scale, infrastructure limitations often dominate algorithmic limitations.

Environmental and Economic Cost

Foundation model training consumes substantial energy and compute resources.

Costs include:

  • accelerator manufacturing
  • electricity
  • cooling
  • datacenter infrastructure
  • engineering labor

Efficiency improvements therefore matter economically and environmentally.

Important efficiency directions include:

Direction Goal
Better optimizers Fewer training steps
Sparse models Lower compute
Quantization Lower memory and energy
Smaller high-quality datasets Better data efficiency
Efficient architectures Higher throughput

Emergent Behavior

As models scale, new capabilities sometimes appear unexpectedly.

Examples may include:

  • in-context learning
  • chain-of-thought reasoning
  • tool use
  • multilingual transfer
  • coding ability

These behaviors are called emergent because they were not explicitly programmed.

However, emergence is often gradual rather than sudden when measured carefully.

Understanding why scaling produces these capabilities remains an active research area.

The Central Constraint

Foundation model training is fundamentally constrained by:

$$ \text{compute} \times \text{data} \times \text{memory} \times \text{communication}. $$

Every design decision affects one or more of these factors.

For example:

Decision Tradeoff
Larger model Better capacity, higher memory
Longer context Better reasoning, more compute
More GPUs More throughput, more communication
Larger batches Better hardware utilization, harder optimization

Training systems therefore balance mathematical efficiency with systems efficiency.

From Research to Infrastructure

Early deep learning research focused primarily on architecture design. Foundation model training shifted much of the challenge toward systems engineering.

Modern training requires expertise in:

  • optimization theory
  • distributed systems
  • networking
  • compiler systems
  • numerical methods
  • storage infrastructure
  • data engineering

As models scale, the boundary between machine learning research and large-scale systems engineering becomes increasingly blurred.

A modern foundation model is simultaneously:

  • a statistical learning system
  • a distributed computation graph
  • a large-scale numerical optimization problem
  • a data processing pipeline
  • a fault-tolerant infrastructure system