Latent Diffusion

Early diffusion models operated directly in pixel space. A model generated images by iteratively denoising tensors such as

Early diffusion models operated directly in pixel space. A model generated images by iteratively denoising tensors such as

[B, 3, 512, 512]

where $B$ is the batch size and the remaining dimensions represent RGB images.

Although these models produced high-quality outputs, they were computationally expensive. Every denoising step required neural network computation over large high-resolution tensors. Training and inference therefore consumed large amounts of memory, compute, and time.

Latent diffusion addresses this problem by moving the diffusion process into a compressed latent representation. Instead of diffusing pixels, the model diffuses latent tensors produced by an autoencoder.

This idea became foundational in modern text-to-image systems such as entity["product","Stable Diffusion","latent diffusion text-to-image model"].

Motivation for Latent Diffusion

Pixel-space diffusion is expensive for several reasons.

First, image tensors are large. A $512\times512$ RGB image contains:

$$ 3\times512\times512 = 786{,}432 $$

values.

Second, diffusion requires many denoising steps. Each step runs a large neural network over the full spatial resolution.

Third, much of the pixel information is locally redundant. Neighboring pixels often contain highly correlated structure.

The key observation is that many image details are compressible. Instead of modeling raw pixels directly, we can learn a lower-dimensional latent space that preserves semantic structure.

The diffusion process then operates on compressed latent representations:

Space Example shape
Pixel space [B, 3, 512, 512]
Latent space [B, 4, 64, 64]

This dramatically reduces computation.

Autoencoder Compression

Latent diffusion uses an encoder-decoder architecture.

The encoder maps images into latent tensors:

$$ z_0 = \mathcal{E}(x_0). $$

The decoder reconstructs images from latent representations:

$$ \hat{x}_0 = \mathcal{D}(z_0). $$

Here:

Symbol Meaning
$\mathcal{E}$ Encoder
$\mathcal{D}$ Decoder
$x_0$ Original image
$z_0$ Latent representation

The encoder compresses the image into a lower-dimensional representation while preserving visually important information.

The diffusion model operates entirely on $z_0$, not on $x_0$.

Variational Autoencoder Foundations

Most latent diffusion systems use a variational autoencoder-like structure.

The encoder predicts a latent distribution:

$$ q(z\mid x). $$

Typically:

$$ q(z\mid x) = \mathcal{N} \left( z; \mu(x), \sigma(x)^2 I \right). $$

The latent is sampled using the reparameterization trick:

$$ z = \mu(x) + \sigma(x)\epsilon, \qquad \epsilon\sim\mathcal{N}(0,I). $$

The decoder reconstructs the image:

$$ \hat{x} = \mathcal{D}(z). $$

Training minimizes:

$$ \mathcal{L} = \mathcal{L}\text{recon} + \lambda D{\mathrm{KL}} \left( q(z\mid x)|p(z) \right). $$

The KL regularization encourages the latent space to remain approximately Gaussian and well-structured.

Diffusion in Latent Space

Once the encoder-decoder system is trained, the diffusion process operates on latent tensors:

$$ z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon. $$

This equation is identical to pixel-space diffusion. The only difference is that the variables now represent latent tensors rather than pixel tensors.

The reverse process learns:

$$ p_\theta(z_{t-1}\mid z_t,c), $$

where $c$ may represent text conditioning or other guidance information.

Sampling proceeds as:

$$ z_T \rightarrow z_{T-1} \rightarrow \cdots \rightarrow z_0. $$

The decoder then converts the final latent into an image:

$$ x_0 = \mathcal{D}(z_0). $$

Compression Ratios

Latent spaces are usually spatially compressed.

For example, an encoder may reduce a $512\times512$ image to a $64\times64$ latent representation.

This corresponds to an 8x reduction along each spatial dimension:

$$ 512 / 64 = 8. $$

The total compression factor becomes:

$$ 8\times8 = 64. $$

If the latent tensor uses 4 channels instead of 3 RGB channels, then the total representation size becomes:

$$ 4\times64\times64 = 16{,}384. $$

Compare this with pixel space:

$$ 3\times512\times512 = 786{,}432. $$

The latent representation is therefore dramatically smaller.

Representation Number of values
Pixel image 786,432
Latent tensor 16,384

This reduction makes diffusion much cheaper.

Why Latent Diffusion Works

A good latent representation separates semantic information from pixel redundancy.

The encoder learns to preserve:

Preserved information Examples
Object structure Faces, cars, buildings
Spatial layout Relative positions
Global semantics Scene identity
Important textures Materials and edges

The encoder discards:

Reduced information Examples
High-frequency noise Pixel-level randomness
Redundant detail Similar neighboring pixels
Compression-insensitive features Imperceptible variation

The diffusion model therefore focuses on modeling semantic structure rather than low-level pixel statistics.

Architecture of Latent Diffusion Models

A latent diffusion system typically contains three major components.

Component Purpose
Autoencoder Compress and reconstruct images
Diffusion U-Net Perform denoising in latent space
Conditioning model Encode prompts or other guidance

The workflow becomes:

$$ x_0 \rightarrow z_0 \rightarrow z_t \rightarrow \hat{z}_0 \rightarrow \hat{x}_0. $$

The diffusion model itself never directly processes full-resolution images.

Cross-Attention Conditioning

Modern latent diffusion systems condition generation on text.

Suppose a text encoder produces embeddings:

$$ c = \mathrm{TextEncoder}(y), $$

where $y$ is the prompt.

The denoising model predicts:

$$ \epsilon_\theta(z_t,t,c). $$

Cross-attention allows latent features to attend to text embeddings.

The attention mechanism computes:

$$ \mathrm{Attention}(Q,K,V) = \mathrm{softmax} \left( \frac{QK^\top}{\sqrt{d}} \right)V. $$

genui{"math_block_widget_always_prefetch_v2":{"content":"\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V"}}

Here:

Symbol Meaning
$Q$ Queries from latent features
$K$ Keys from text embeddings
$V$ Values from text embeddings

Cross-attention enables the latent image representation to incorporate prompt semantics during denoising.

Stable Diffusion Pipeline

A simplified latent diffusion pipeline looks like this:

  1. Encode text prompt into embeddings
  2. Sample latent Gaussian noise
  3. Iteratively denoise latent representation
  4. Decode latent tensor into image

Mathematically:

$$ c = \mathrm{TextEncoder}(y), $$

$$ z_T\sim\mathcal{N}(0,I), $$

$$ z_{t-1}\sim p_\theta(z_{t-1}\mid z_t,c), $$

$$ x_0 = \mathcal{D}(z_0). $$

The latent tensor evolves gradually from noise into structured semantic content.

Latent Tensor Shapes

In many latent diffusion systems, tensor shapes follow conventions such as:

Tensor Shape
Image [B, 3, 512, 512]
Latent [B, 4, 64, 64]
Text embeddings [B, T, D]

Example:

images = torch.randn(8, 3, 512, 512)

latents = torch.randn(8, 4, 64, 64)

text_embeddings = torch.randn(8, 77, 768)

The diffusion U-Net processes latent tensors rather than pixel tensors.

Training Procedure

Training latent diffusion involves multiple stages.

Stage 1: Train the Autoencoder

The encoder-decoder system learns to reconstruct images.

Losses may include:

Loss Purpose
Reconstruction loss Pixel fidelity
Perceptual loss Semantic similarity
KL loss Latent regularization
Adversarial loss Sharper outputs

Stage 2: Freeze the Autoencoder

After training, the encoder and decoder are fixed.

Stage 3: Train the Diffusion Model

Images are encoded into latents:

$$ z_0=\mathcal{E}(x_0). $$

Noise is added:

$$ z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon. $$

The diffusion model predicts noise:

$$ \epsilon_\theta(z_t,t,c). $$

The training objective becomes:

$$ \mathcal{L} = \mathbb{E} \left[ | \epsilon - \epsilon_\theta(z_t,t,c) |_2^2 \right]. $$

Latent Scaling

Latent representations may have arbitrary variance depending on encoder training.

To stabilize diffusion training, latent vectors are often rescaled:

$$ z'_0 = s z_0, $$

where $s$ is a constant scaling factor.

For example, some systems normalize latent standard deviation so that diffusion noise schedules behave consistently.

Without proper scaling:

Problem Consequence
Latents too large Noise becomes too weak
Latents too small Signal disappears too quickly
Inconsistent variance Training instability

Advantages of Latent Diffusion

Latent diffusion provides several major advantages.

Advantage Explanation
Lower compute cost Smaller tensors
Lower memory usage Reduced spatial resolution
Faster training Less expensive denoising
Faster inference Smaller U-Net operations
Better scalability Larger images become feasible
Semantic modeling Focus on high-level structure

This efficiency enabled practical open-source large-scale text-to-image generation.

Limitations of Latent Diffusion

Latent compression also introduces limitations.

Limitation Cause
Loss of fine detail Compression bottleneck
Reconstruction artifacts Imperfect decoder
Semantic drift Encoder information loss
Decoder dependence Final quality limited by decoder
Compression bias Latent space may favor certain textures

The decoder becomes part of the generative pipeline. Even perfect latent denoising cannot exceed decoder reconstruction quality.

Pixel Diffusion Versus Latent Diffusion

Property Pixel Diffusion Latent Diffusion
Operating space Pixels Compressed latents
Tensor size Large Small
Compute cost High Lower
Memory usage High Lower
Fine detail modeling Strong Decoder-limited
Scalability Harder Easier
Sampling speed Slower Faster

Pixel diffusion may preserve fine textures more directly. Latent diffusion is usually more practical for large-scale systems.

PyTorch Example: Encoding and Diffusion

Suppose an autoencoder is defined as:

encoder = AutoencoderEncoder()
decoder = AutoencoderDecoder()

Encode images:

images = torch.randn(8, 3, 512, 512)

latents = encoder(images)

print(latents.shape)
# torch.Size([8, 4, 64, 64])

Add diffusion noise:

noise = torch.randn_like(latents)

t = torch.randint(0, T, (8,))

alpha_bar_t = extract(alpha_bars, t, latents.shape)

z_t = (
    torch.sqrt(alpha_bar_t) * latents
    +
    torch.sqrt(1 - alpha_bar_t) * noise
)

Predict noise:

pred_noise = unet(z_t, t, text_embeddings)

loss = torch.nn.functional.mse_loss(
    pred_noise,
    noise
)

Decode generated latent:

generated_images = decoder(latents)

This structure is the core workflow of many modern latent diffusion systems.

Classifier-Free Guidance in Latent Space

Latent diffusion commonly uses classifier-free guidance.

The model predicts:

$$ \epsilon_\theta(z_t,t,c) $$

and

$$ \epsilon_\theta(z_t,t,\varnothing). $$

The guided prediction becomes:

$$ \hat{\epsilon} = \epsilon_\text{uncond} + s ( \epsilon_\text{cond} - \epsilon_\text{uncond} ). $$

The guidance scale $s$ controls prompt strength.

Guidance scale Effect
Small More diversity
Moderate Better prompt adherence
Large Sharper but less diverse outputs

Very large guidance scales may produce oversaturated or unstable images.

Latent Diffusion Beyond Images

The latent diffusion idea generalizes beyond image generation.

Applications include:

Domain Latent representation
Video generation Spatiotemporal latent tensors
Audio synthesis Spectrogram or audio latents
3D generation Geometry or radiance field latents
Motion generation Pose or trajectory latents
Molecular generation Graph or embedding latents

The key principle remains unchanged:

  1. Learn a compressed representation
  2. Diffuse in latent space
  3. Decode into the original domain

Why Latent Diffusion Became Dominant

Latent diffusion balanced three competing requirements:

Requirement Challenge
High visual quality Requires expressive models
Large image resolution Requires large tensors
Practical compute cost Requires efficiency

Pixel-space diffusion achieved quality but was expensive. GANs were fast but often unstable. Autoregressive image models scaled poorly with resolution.

Latent diffusion provided a practical compromise:

Feature Result
Compression Lower compute
Diffusion training Stable optimization
Attention conditioning Strong prompt control
U-Net denoising High-quality structure generation

This combination made large-scale open text-to-image systems feasible.

Summary

Latent diffusion performs diffusion in a compressed latent representation rather than directly in pixel space.

An encoder maps images into latent tensors:

$$ z_0=\mathcal{E}(x_0). $$

The diffusion process operates on these latent representations:

$$ z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon. $$

A denoising network learns the reverse process in latent space. After denoising, a decoder reconstructs the final image.

Latent diffusion greatly reduces computational cost while preserving semantic structure. This architecture became foundational in modern text-to-image generation systems because it combines efficient compression, stable diffusion training, and flexible conditioning mechanisms.