Diffusion Transformers

Early diffusion systems used convolutional U-Nets as denoising networks. U-Nets worked well because images contain strong local structure, and convolutions efficiently model nearby spatial relationships.

Early diffusion systems used convolutional U-Nets as denoising networks. U-Nets worked well because images contain strong local structure, and convolutions efficiently model nearby spatial relationships.

However, transformers became increasingly attractive because they scale effectively with model size, support flexible conditioning, and capture long-range dependencies more naturally than convolutional architectures.

Diffusion Transformers, often abbreviated DiTs, replace or augment convolutional U-Nets with transformer-based architectures. Instead of treating images as grids processed by convolutions, DiTs treat latent representations as token sequences processed by self-attention.

This transition mirrors the broader shift from convolutional models to transformers in computer vision and language modeling.

From U-Nets to Transformers

A standard diffusion U-Net processes tensors such as:

[B, C, H, W]

using convolutional layers, residual blocks, and attention modules.

A diffusion transformer instead converts latent tensors into tokens:

$$ z \in \mathbb{R}^{B \times N \times D} $$

where:

Symbol Meaning
$B$ Batch size
$N$ Number of tokens
$D$ Embedding dimension

The transformer then processes these tokens using self-attention and feedforward layers.

This changes the denoising problem from spatial convolution to sequence modeling.

Patch Tokenization

Diffusion transformers usually operate on latent patches rather than individual pixels.

Suppose a latent tensor has shape:

[B, C, H, W]

For example:

[B, 4, 64, 64]

The tensor is divided into patches.

If the patch size is:

$$ P \times P, $$

then the number of patches becomes:

$$ N = \frac{H}{P} \cdot \frac{W}{P}. $$

Each patch is flattened and projected into an embedding vector.

For example:

Latent size Patch size Number of tokens
$64\times64$ $2\times2$ 1024
$64\times64$ $4\times4$ 256
$32\times32$ $2\times2$ 256

Patch embeddings transform spatial tensors into transformer token sequences.

Transformer Denoising Objective

The diffusion objective remains unchanged.

Given noisy latent:

$$ z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, $$

the transformer predicts:

$$ \epsilon_\theta(z_t,t,c). $$

The loss is still:

$$ \mathcal{L} = \mathbb{E} \left[ | \epsilon - \epsilon_\theta(z_t,t,c) |_2^2 \right]. $$

The main difference lies in the network architecture, not in the diffusion mathematics.

Self-Attention in Diffusion

Transformers process tokens using self-attention.

The attention mechanism computes:

$$ \mathrm{Attention}(Q,K,V) = \mathrm{softmax} \left( \frac{QK^\top}{\sqrt{d}} \right)V. $$

genui{"math_block_widget_always_prefetch_v2":{"content":"\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V"}}

For diffusion transformers:

Tensor Meaning
$Q$ Query embeddings
$K$ Key embeddings
$V$ Value embeddings

Every latent token can attend to every other token.

This provides:

Benefit Explanation
Global receptive field Long-range dependencies modeled directly
Flexible conditioning Easy integration of text tokens
Better scaling Transformer scaling laws often favorable
Unified architecture Similarity to language models

Unlike convolutions, attention does not rely on local neighborhoods.

Positional Embeddings

Transformers require positional information because self-attention alone is permutation invariant.

Diffusion transformers add positional embeddings:

$$ z'_i = z_i + p_i. $$

The positional vector $p_i$ encodes spatial location.

Common methods include:

Method Description
Learned embeddings Trainable position vectors
Sinusoidal embeddings Fixed Fourier-like encoding
Rotary embeddings Relative rotational encoding
2D positional embeddings Separate height and width structure

Without positional embeddings, the model would not know where image patches belong spatially.

Timestep Conditioning

The denoising network must know the diffusion timestep.

A timestep embedding:

$$ e_t = \mathrm{Embed}(t) $$

is injected into transformer blocks.

Sinusoidal timestep embeddings are common:

$$ \mathrm{PE}(t,2i) = \sin \left( \frac{t}{10000^{2i/d}} \right), $$

$$ \mathrm{PE}(t,2i+1) = \cos \left( \frac{t}{10000^{2i/d}} \right). $$

These embeddings are passed through learned projection layers before conditioning the transformer.

Adaptive Layer Normalization

Many diffusion transformers use adaptive layer normalization, often called AdaLN.

Instead of fixed normalization parameters, timestep and conditioning embeddings modulate activations.

Standard layer normalization is:

$$ \mathrm{LN}(x) = \gamma \frac{x-\mu}{\sigma} + \beta. $$

Adaptive layer normalization replaces $\gamma$ and $\beta$ with conditioning-dependent parameters:

$$ \gamma = \gamma(c,t), \qquad \beta = \beta(c,t). $$

This allows prompt and timestep information to influence the transformer at every layer.

AdaLN became important because it provides stable conditioning without requiring heavy cross-attention everywhere.

Diffusion Transformer Block

A diffusion transformer block usually contains:

  1. Layer normalization
  2. Self-attention
  3. Residual connection
  4. Feedforward network
  5. Conditioning modulation

A simplified structure:

$$ x \rightarrow \mathrm{Attention} \rightarrow \mathrm{MLP} \rightarrow x'. $$

Conditioning enters through:

Conditioning source Mechanism
Timestep AdaLN or embedding injection
Text prompt Cross-attention
Class labels Embedding modulation
Image conditioning Additional tokens

Cross-Attention Conditioning

Text-to-image diffusion transformers usually use cross-attention.

Text embeddings:

$$ c \in \mathbb{R}^{B\times T\times D} $$

interact with image latent tokens.

The transformer computes:

Tensor Source
Queries Image tokens
Keys Text tokens
Values Text tokens

This lets image generation depend on language semantics.

Compared with convolutional U-Nets, transformers integrate multimodal conditioning naturally because all modalities become token sequences.

DiT Architecture

A canonical diffusion transformer architecture typically includes:

Stage Purpose
Patch embedding Convert latent patches into tokens
Positional embedding Encode spatial structure
Transformer blocks Perform denoising computation
Conditioning layers Inject timestep and text information
Output projection Convert tokens back to latent patches

The overall process:

$$ z_t \rightarrow \text{tokens} \rightarrow \text{transformer} \rightarrow \hat{\epsilon} $$

The transformer predicts latent noise, which is reshaped back into latent spatial form.

Scaling Properties

Transformers often scale better than convolutional U-Nets as model size increases.

Empirically:

Increase Effect
More parameters Better sample quality
More data Stronger generalization
More compute Improved prompt adherence
Larger context Better compositionality

This resembles scaling behavior in language models.

Large diffusion transformers may learn richer semantic structure and stronger multimodal alignment than smaller convolutional models.

Computational Complexity

Transformers also introduce challenges.

Self-attention has quadratic complexity:

$$ O(N^2) $$

with respect to the number of tokens.

For high-resolution images, token count becomes large.

Example:

Resolution Patch size Tokens
$64\times64$ $2\times2$ 1024
$128\times128$ $2\times2$ 4096
$256\times256$ $2\times2$ 16384

Attention cost grows rapidly.

Efficient attention methods therefore become important.

Efficient Attention Methods

Modern diffusion transformers use attention optimizations such as:

Method Purpose
FlashAttention Faster memory-efficient attention
Windowed attention Restrict local attention
Sparse attention Reduce pairwise interactions
Linear attention Approximate quadratic attention
Multi-query attention Reduce memory usage
Token merging Reduce sequence length

These techniques make large diffusion transformers practical at high resolution.

Latent Space and Transformers

Most diffusion transformers operate in latent space rather than pixel space.

This reduces token count dramatically.

Suppose a latent tensor has shape:

[B, 4, 64, 64]

Using $2\times2$ patches:

$$ N = (64/2)^2 = 1024. $$

Without latent compression, operating directly on $512\times512$ images would require vastly more tokens.

Latent diffusion and transformers therefore complement each other:

Technique Benefit
Latent diffusion Smaller spatial representation
Transformers Flexible global modeling

Together they enable scalable generative systems.

Diffusion Transformers for Video

Transformers naturally extend to video because videos can also be represented as token sequences.

A latent video tensor:

[B, C, F, H, W]

can be patchified into spatiotemporal tokens.

The transformer then models:

Dependency type Example
Spatial Relationships within a frame
Temporal Motion across frames
Cross-modal Text-video conditioning

Video diffusion transformers often use factorized attention:

Attention type Scope
Spatial attention Within frame
Temporal attention Across frames
Cross-attention Prompt conditioning

This improves efficiency relative to full attention across all video tokens.

Training Diffusion Transformers

Training resembles standard diffusion training.

Given latent:

$$ z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, $$

the transformer predicts noise:

$$ \epsilon_\theta(z_t,t,c). $$

PyTorch example:

pred_noise = dit(
    latents,
    timesteps,
    text_embeddings
)

loss = torch.nn.functional.mse_loss(
    pred_noise,
    noise
)

The objective is unchanged from U-Net diffusion systems.

Advantages of Diffusion Transformers

Diffusion transformers provide several benefits.

Advantage Explanation
Global context modeling Attention connects distant regions
Strong scaling behavior Similar to large language models
Unified multimodal architecture Text and image tokens integrate naturally
Flexible conditioning Multiple modalities become token streams
Better compositionality Improved concept interaction

Transformers also simplify architectural unification across text, image, audio, and video generation.

Limitations of Diffusion Transformers

Transformers also have weaknesses.

Limitation Cause
High memory usage Quadratic attention
Large compute cost Long token sequences
Training instability Very deep transformer optimization
Slow inference Many denoising steps plus attention cost
Data hunger Large transformers require large datasets

Efficient training and inference remain active research areas.

Relationship to Foundation Models

Diffusion transformers move generative modeling closer to foundation-model architectures.

A single transformer architecture can potentially process:

Modality Token type
Text Wordpiece tokens
Images Patch tokens
Video Spatiotemporal tokens
Audio Spectrogram tokens
3D scenes Spatial tokens

This motivates unified multimodal generative systems.

Modern research increasingly explores:

Direction Goal
Unified token spaces Shared multimodal representations
Joint training Multiple modalities together
World models Predictive generative simulation
Large multimodal transformers General-purpose generative systems

Diffusion transformers fit naturally into this trend.

PyTorch Patch Embedding Example

A simple patch embedding layer:

import torch
import torch.nn as nn

class PatchEmbed(nn.Module):
    def __init__(
        self,
        in_channels=4,
        patch_size=2,
        embed_dim=768,
    ):
        super().__init__()

        self.proj = nn.Conv2d(
            in_channels,
            embed_dim,
            kernel_size=patch_size,
            stride=patch_size,
        )

    def forward(self, x):
        """
        x: [B, C, H, W]
        """

        x = self.proj(x)

        B, D, H, W = x.shape

        x = x.flatten(2)
        x = x.transpose(1, 2)

        return x

Usage:

latents = torch.randn(8, 4, 64, 64)

patch_embed = PatchEmbed()

tokens = patch_embed(latents)

print(tokens.shape)
# torch.Size([8, 1024, 768])

The latent image becomes a sequence of transformer tokens.

Summary

Diffusion transformers replace convolutional denoising networks with transformer architectures operating on latent token sequences.

Latent tensors are patchified into tokens, processed using self-attention, conditioned on timestep and prompt embeddings, and projected back into latent space for denoising.

The diffusion objective remains unchanged:

$$ \mathcal{L} = \mathbb{E} \left[ | \epsilon - \epsilon_\theta(z_t,t,c) |_2^2 \right]. $$

Transformers provide strong global modeling, flexible conditioning, and favorable scaling behavior. Combined with latent diffusion, they form a scalable architecture for modern multimodal generative systems.