Diffusion Transformers

Early diffusion systems used convolutional U-Nets as denoising networks. U-Nets worked well because images contain strong local structure, and convolutions efficiently model nearby spatial relationships.

However, transformers became increasingly attractive because they scale effectively with model size, support flexible conditioning, and capture long-range dependencies more naturally than convolutional architectures.

Diffusion Transformers, often abbreviated DiTs, replace or augment convolutional U-Nets with transformer-based architectures. Instead of treating images as grids processed by convolutions, DiTs treat latent representations as token sequences processed by self-attention.

This transition mirrors the broader shift from convolutional models to transformers in computer vision and language modeling.

From U-Nets to Transformers

A standard diffusion U-Net processes tensors such as:

[B, C, H, W]

using convolutional layers, residual blocks, and attention modules.

A diffusion transformer instead converts latent tensors into tokens:

$$ z \in \mathbb{R}^{B \times N \times D} $$

where:

Symbol	Meaning
$B$	Batch size
$N$	Number of tokens
$D$	Embedding dimension

The transformer then processes these tokens using self-attention and feedforward layers.

This changes the denoising problem from spatial convolution to sequence modeling.

Patch Tokenization

Diffusion transformers usually operate on latent patches rather than individual pixels.

Suppose a latent tensor has shape:

[B, C, H, W]

For example:

[B, 4, 64, 64]

The tensor is divided into patches.

If the patch size is:

$$ P \times P, $$

then the number of patches becomes:

$$ N = \frac{H}{P} \cdot \frac{W}{P}. $$

Each patch is flattened and projected into an embedding vector.

For example:

Latent size	Patch size	Number of tokens
$64\times64$	$2\times2$	1024
$64\times64$	$4\times4$	256
$32\times32$	$2\times2$	256

Patch embeddings transform spatial tensors into transformer token sequences.

Transformer Denoising Objective

The diffusion objective remains unchanged.

Given noisy latent:

$$ z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, $$

the transformer predicts:

$$ \epsilon_\theta(z_t,t,c). $$

The loss is still:

$$ \mathcal{L} = \mathbb{E} \left[ | \epsilon - \epsilon_\theta(z_t,t,c) |_2^2 \right]. $$

The main difference lies in the network architecture, not in the diffusion mathematics.

Self-Attention in Diffusion

Transformers process tokens using self-attention.

The attention mechanism computes:

$$ \mathrm{Attention}(Q,K,V) = \mathrm{softmax} \left( \frac{QK^\top}{\sqrt{d}} \right)V. $$

genui{"math_block_widget_always_prefetch_v2":{"content":"\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V"}}

For diffusion transformers:

Tensor	Meaning
$Q$	Query embeddings
$K$	Key embeddings
$V$	Value embeddings

Every latent token can attend to every other token.

This provides:

Benefit	Explanation
Global receptive field	Long-range dependencies modeled directly
Flexible conditioning	Easy integration of text tokens
Better scaling	Transformer scaling laws often favorable
Unified architecture	Similarity to language models

Unlike convolutions, attention does not rely on local neighborhoods.

Positional Embeddings

Transformers require positional information because self-attention alone is permutation invariant.

Diffusion transformers add positional embeddings:

$$ z'_i = z_i + p_i. $$

The positional vector $p_i$ encodes spatial location.

Common methods include:

Method	Description
Learned embeddings	Trainable position vectors
Sinusoidal embeddings	Fixed Fourier-like encoding
Rotary embeddings	Relative rotational encoding
2D positional embeddings	Separate height and width structure

Without positional embeddings, the model would not know where image patches belong spatially.

Timestep Conditioning

The denoising network must know the diffusion timestep.

A timestep embedding:

$$ e_t = \mathrm{Embed}(t) $$

is injected into transformer blocks.

Sinusoidal timestep embeddings are common:

$$ \mathrm{PE}(t,2i) = \sin \left( \frac{t}{10000^{2i/d}} \right), $$

$$ \mathrm{PE}(t,2i+1) = \cos \left( \frac{t}{10000^{2i/d}} \right). $$

These embeddings are passed through learned projection layers before conditioning the transformer.

Adaptive Layer Normalization

Many diffusion transformers use adaptive layer normalization, often called AdaLN.

Instead of fixed normalization parameters, timestep and conditioning embeddings modulate activations.

Standard layer normalization is:

$$ \mathrm{LN}(x) = \gamma \frac{x-\mu}{\sigma} + \beta. $$

Adaptive layer normalization replaces $\gamma$ and $\beta$ with conditioning-dependent parameters:

$$ \gamma = \gamma(c,t), \qquad \beta = \beta(c,t). $$

This allows prompt and timestep information to influence the transformer at every layer.

AdaLN became important because it provides stable conditioning without requiring heavy cross-attention everywhere.

Diffusion Transformer Block

A diffusion transformer block usually contains:

Layer normalization
Self-attention
Residual connection
Feedforward network
Conditioning modulation

A simplified structure:

$$ x \rightarrow \mathrm{Attention} \rightarrow \mathrm{MLP} \rightarrow x'. $$

Conditioning enters through:

Conditioning source	Mechanism
Timestep	AdaLN or embedding injection
Text prompt	Cross-attention
Class labels	Embedding modulation
Image conditioning	Additional tokens

Cross-Attention Conditioning

Text-to-image diffusion transformers usually use cross-attention.

Text embeddings:

$$ c \in \mathbb{R}^{B\times T\times D} $$

interact with image latent tokens.

The transformer computes:

Tensor	Source
Queries	Image tokens
Keys	Text tokens
Values	Text tokens

This lets image generation depend on language semantics.

Compared with convolutional U-Nets, transformers integrate multimodal conditioning naturally because all modalities become token sequences.

DiT Architecture

A canonical diffusion transformer architecture typically includes:

Stage	Purpose
Patch embedding	Convert latent patches into tokens
Positional embedding	Encode spatial structure
Transformer blocks	Perform denoising computation
Conditioning layers	Inject timestep and text information
Output projection	Convert tokens back to latent patches

The overall process:

$$ z_t \rightarrow \text{tokens} \rightarrow \text{transformer} \rightarrow \hat{\epsilon} $$

The transformer predicts latent noise, which is reshaped back into latent spatial form.

Scaling Properties

Transformers often scale better than convolutional U-Nets as model size increases.

Empirically:

Increase	Effect
More parameters	Better sample quality
More data	Stronger generalization
More compute	Improved prompt adherence
Larger context	Better compositionality

This resembles scaling behavior in language models.

Large diffusion transformers may learn richer semantic structure and stronger multimodal alignment than smaller convolutional models.

Computational Complexity

Transformers also introduce challenges.

Self-attention has quadratic complexity:

$$ O(N^2) $$

with respect to the number of tokens.

For high-resolution images, token count becomes large.

Example:

Resolution	Patch size	Tokens
$64\times64$	$2\times2$	1024
$128\times128$	$2\times2$	4096
$256\times256$	$2\times2$	16384

Attention cost grows rapidly.

Efficient attention methods therefore become important.

Efficient Attention Methods

Modern diffusion transformers use attention optimizations such as:

Method	Purpose
FlashAttention	Faster memory-efficient attention
Windowed attention	Restrict local attention
Sparse attention	Reduce pairwise interactions
Linear attention	Approximate quadratic attention
Multi-query attention	Reduce memory usage
Token merging	Reduce sequence length

These techniques make large diffusion transformers practical at high resolution.

Latent Space and Transformers

Most diffusion transformers operate in latent space rather than pixel space.

This reduces token count dramatically.

Suppose a latent tensor has shape:

[B, 4, 64, 64]

Using $2\times2$ patches:

$$ N = (64/2)^2 = 1024. $$

Without latent compression, operating directly on $512\times512$ images would require vastly more tokens.

Latent diffusion and transformers therefore complement each other:

Technique	Benefit
Latent diffusion	Smaller spatial representation
Transformers	Flexible global modeling

Together they enable scalable generative systems.

Diffusion Transformers for Video

Transformers naturally extend to video because videos can also be represented as token sequences.

A latent video tensor:

[B, C, F, H, W]

can be patchified into spatiotemporal tokens.

The transformer then models:

Dependency type	Example
Spatial	Relationships within a frame
Temporal	Motion across frames
Cross-modal	Text-video conditioning

Video diffusion transformers often use factorized attention:

Attention type	Scope
Spatial attention	Within frame
Temporal attention	Across frames
Cross-attention	Prompt conditioning

This improves efficiency relative to full attention across all video tokens.

Training Diffusion Transformers

Training resembles standard diffusion training.

Given latent:

$$ z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, $$

the transformer predicts noise:

$$ \epsilon_\theta(z_t,t,c). $$

PyTorch example:

pred_noise = dit(
    latents,
    timesteps,
    text_embeddings
)

loss = torch.nn.functional.mse_loss(
    pred_noise,
    noise
)

The objective is unchanged from U-Net diffusion systems.

Advantages of Diffusion Transformers

Diffusion transformers provide several benefits.

Advantage	Explanation
Global context modeling	Attention connects distant regions
Strong scaling behavior	Similar to large language models
Unified multimodal architecture	Text and image tokens integrate naturally
Flexible conditioning	Multiple modalities become token streams
Better compositionality	Improved concept interaction

Transformers also simplify architectural unification across text, image, audio, and video generation.

Limitations of Diffusion Transformers

Transformers also have weaknesses.

Limitation	Cause
High memory usage	Quadratic attention
Large compute cost	Long token sequences
Training instability	Very deep transformer optimization
Slow inference	Many denoising steps plus attention cost
Data hunger	Large transformers require large datasets

Efficient training and inference remain active research areas.

Relationship to Foundation Models

Diffusion transformers move generative modeling closer to foundation-model architectures.

A single transformer architecture can potentially process:

Modality	Token type
Text	Wordpiece tokens
Images	Patch tokens
Video	Spatiotemporal tokens
Audio	Spectrogram tokens
3D scenes	Spatial tokens

This motivates unified multimodal generative systems.

Modern research increasingly explores:

Direction	Goal
Unified token spaces	Shared multimodal representations
Joint training	Multiple modalities together
World models	Predictive generative simulation
Large multimodal transformers	General-purpose generative systems

Diffusion transformers fit naturally into this trend.

PyTorch Patch Embedding Example

A simple patch embedding layer:

import torch
import torch.nn as nn

class PatchEmbed(nn.Module):
    def __init__(
        self,
        in_channels=4,
        patch_size=2,
        embed_dim=768,
    ):
        super().__init__()

        self.proj = nn.Conv2d(
            in_channels,
            embed_dim,
            kernel_size=patch_size,
            stride=patch_size,
        )

    def forward(self, x):
        """
        x: [B, C, H, W]
        """

        x = self.proj(x)

        B, D, H, W = x.shape

        x = x.flatten(2)
        x = x.transpose(1, 2)

        return x

Usage:

latents = torch.randn(8, 4, 64, 64)

patch_embed = PatchEmbed()

tokens = patch_embed(latents)

print(tokens.shape)
# torch.Size([8, 1024, 768])

The latent image becomes a sequence of transformer tokens.

Summary

Diffusion transformers replace convolutional denoising networks with transformer architectures operating on latent token sequences.

Latent tensors are patchified into tokens, processed using self-attention, conditioned on timestep and prompt embeddings, and projected back into latent space for denoising.

The diffusion objective remains unchanged:

$$ \mathcal{L} = \mathbb{E} \left[ | \epsilon - \epsilon_\theta(z_t,t,c) |_2^2 \right]. $$

Transformers provide strong global modeling, flexible conditioning, and favorable scaling behavior. Combined with latent diffusion, they form a scalable architecture for modern multimodal generative systems.