Video Diffusion Systems

Video diffusion extends image diffusion from still images to moving sequences. Instead of generating one image, the model generates a sequence of frames that should remain visually coherent over time.

Video diffusion extends image diffusion from still images to moving sequences. Instead of generating one image, the model generates a sequence of frames that should remain visually coherent over time.

A video sample can be represented as a tensor:

$$ x_0 \in \mathbb{R}^{B \times C \times F \times H \times W} $$

where $B$ is batch size, $C$ is channels, $F$ is the number of frames, $H$ is height, and $W$ is width.

For example:

video = torch.randn(2, 3, 16, 256, 256)

This represents a batch of 2 videos, each with 3 color channels, 16 frames, and spatial resolution $256 \times 256$.

From Image Generation to Video Generation

Image diffusion models learn a distribution over images:

$$ p_\theta(x) $$

Video diffusion models learn a distribution over frame sequences:

$$ p_\theta(x_{1:F}) $$

where $x_{1:F}$ denotes all frames in the video.

The added difficulty is temporal coherence. A good video model must satisfy both spatial and temporal constraints.

Requirement Meaning
Spatial quality Each frame should look realistic
Temporal coherence Objects should remain consistent across frames
Motion realism Movement should follow plausible dynamics
Long-range consistency Scene identity should persist over time
Prompt alignment Video should match the text or conditioning input

A model that generates good individual frames may still fail as a video model if objects flicker, identities change, or motion appears unstable.

Forward Diffusion for Video

The forward diffusion process is the same as image diffusion, but applied to video tensors.

$$ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon $$

where

$$ \epsilon \sim \mathcal{N}(0,I) $$

has the same shape as the video.

In PyTorch:

def q_sample_video(x0, t, alpha_bars):
    noise = torch.randn_like(x0)

    alpha_bar_t = extract(alpha_bars, t, x0.shape)

    xt = (
        torch.sqrt(alpha_bar_t) * x0
        +
        torch.sqrt(1.0 - alpha_bar_t) * noise
    )

    return xt, noise

If x0 has shape [B, C, F, H, W], then xt and noise have the same shape.

The mathematics is unchanged. The challenge lies in the denoising network, which must model correlations across both space and time.

Latent Video Diffusion

Pixel-space video diffusion is extremely expensive. A short video may contain dozens of high-resolution frames.

For example, a 16-frame RGB video at $512 \times 512$ resolution contains:

$$ 16 \times 3 \times 512 \times 512 = 12{,}582{,}912 $$

values per sample.

Latent video diffusion reduces this cost by encoding frames into latent representations.

An image autoencoder maps each frame into a latent:

$$ z_0 = \mathcal{E}(x_0) $$

For video, the latent tensor may have shape:

[B, C_z, F, H_z, W_z]

For example:

latents = torch.randn(2, 4, 16, 64, 64)

Diffusion then operates on latent videos:

$$ z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon. $$

After denoising, a decoder converts latent frames back to pixel frames.

Temporal Modeling

A video diffusion model needs temporal structure. The model must understand how information changes across frames.

Common temporal modeling methods include:

Method Description
3D convolutions Apply convolution over time, height, and width
Temporal attention Let frames attend to other frames
Factorized attention Separate spatial attention and temporal attention
Recurrent state Carry information between frames
Motion modules Add temporal layers to an image diffusion backbone
Transformer blocks Model spatiotemporal token sequences

The simplest extension is a 3D U-Net. It replaces 2D convolution layers with 3D convolution layers:

$$ [B, C, H, W] \rightarrow [B, C, F, H, W]. $$

However, full 3D computation is costly. Many modern systems instead start from a strong image diffusion model and add temporal modules.

Factorized Space-Time Attention

Full attention over video tokens is expensive.

Suppose a latent video has shape:

$$ F \times H \times W $$

The number of tokens is:

$$ N = FHW. $$

Full self-attention has cost:

$$ O(N^2). $$

For video, this becomes expensive quickly.

Factorized attention reduces cost by separating spatial and temporal attention.

Spatial attention attends within each frame:

$$ O(F(HW)^2) $$

Temporal attention attends across frames at each spatial location:

$$ O(HW F^2) $$

This is usually cheaper than:

$$ O((FHW)^2). $$

The model can first learn spatial relationships within each frame, then learn how features move over time.

Text-to-Video Conditioning

Text-to-video generation conditions the reverse process on a prompt:

$$ p_\theta(z_{t-1}\mid z_t, c) $$

where

$$ c = \mathrm{TextEncoder}(y). $$

The prompt may specify:

Prompt element Video effect
Subject What appears
Action What moves
Scene Where it happens
Camera motion How viewpoint changes
Style Visual appearance
Duration hints Event structure

Examples:

"A panda surfing on a wave, cinematic lighting"
"A drone shot flying over a futuristic city at sunset"

Text conditioning is usually injected using cross-attention, as in text-to-image systems.

Image-to-Video Generation

Image-to-video models start from a still image and generate motion.

Given an image $x_\text{ref}$, the model learns:

$$ p_\theta(x_{1:F}\mid x_\text{ref}) $$

The first frame, appearance, or identity should remain consistent with the reference image.

Common conditioning methods include:

Conditioning method Purpose
Reference image embedding Preserve identity and style
First-frame conditioning Anchor the generated video
Depth or pose control Guide motion geometry
Optical flow hints Guide frame-to-frame movement
Camera trajectory Control viewpoint changes

Image-to-video is often easier than pure text-to-video because the model receives concrete visual structure at the start.

Motion Consistency

Motion consistency is the central problem in video generation.

A model must preserve:

Consistency type Example
Object identity Same person or object across frames
Geometry Stable shape and viewpoint
Texture Clothing, fur, material consistency
Lighting Stable illumination
Background Scene remains coherent
Camera motion Smooth viewpoint movement

Without temporal modeling, an image diffusion model applied independently to each frame produces flicker. Each frame may look plausible, but the sequence fails as video.

Temporal layers reduce flicker by sharing information across frames.

Training Data for Video Diffusion

Video diffusion models require large datasets of video clips.

Training data usually includes:

Data Use
Video frames Visual supervision
Captions Text conditioning
Timestamps Temporal order
Motion metadata Optional control
Audio Optional multimodal conditioning

Video data is harder to curate than image data because it has more failure modes:

Issue Effect
Low resolution Weak visual detail
Compression artifacts Learned artifacts
Watermarks Undesired generations
Poor captions Weak prompt alignment
Scene cuts Broken temporal continuity
Camera shake Noisy motion patterns

Good video training data should contain coherent clips, accurate captions, and diverse motion.

Frame Rate and Duration

Video models must choose a frame rate and duration.

A model may generate:

16 frames at 8 fps

which gives 2 seconds of video.

Or:

64 frames at 24 fps

which gives about 2.67 seconds of video.

Higher frame rate improves smoothness but increases compute. Longer duration improves usefulness but makes long-range consistency harder.

Design choice Tradeoff
More frames Better duration, higher compute
Higher resolution Better quality, higher memory
Higher fps Smoother motion, harder modeling
Longer clips Better storytelling, harder consistency

Sliding Window Generation

Long video generation often uses sliding windows.

Instead of generating all frames at once, the model generates a short clip, then conditions the next clip on previous frames.

Example:

$$ x_{1:16} \rightarrow x_{9:24} \rightarrow x_{17:32} $$

The overlapping frames help maintain continuity.

However, errors can accumulate. If the generated video drifts, later windows may become inconsistent with earlier ones.

Multi-Stage Video Generation

Many systems use multiple stages.

A typical pipeline:

  1. Generate low-resolution video latents
  2. Upscale spatial resolution
  3. Interpolate or refine frames
  4. Apply temporal super-resolution
  5. Decode to final video

This separates global motion from fine detail.

Stage Goal
Base generation Scene and motion
Spatial upsampling Higher resolution
Temporal upsampling More frames
Refinement Remove artifacts

Multi-stage systems are easier to scale because each stage solves a narrower problem.

Video Diffusion Loss

The basic noise prediction loss remains:

$$ \mathcal{L} = \mathbb{E} \left[ |\epsilon-\epsilon_\theta(z_t,t,c)|_2^2 \right]. $$

For video tensors, this loss averages over channels, frames, height, and width.

In PyTorch:

pred_noise = model(z_t, t, text_embeddings)

loss = torch.nn.functional.mse_loss(
    pred_noise,
    noise
)

Additional losses may encourage temporal smoothness:

Loss Purpose
Optical flow consistency Preserve motion
Perceptual loss Improve visual quality
Temporal adversarial loss Reduce flicker
Frame interpolation loss Improve smooth transitions

Many modern systems rely primarily on diffusion loss plus strong temporal architecture rather than explicit handcrafted temporal losses.

Efficient Training Techniques

Video diffusion is memory-intensive. Common efficiency techniques include:

Technique Purpose
Latent diffusion Reduce spatial size
Mixed precision Lower memory use
Gradient checkpointing Trade compute for memory
Factorized attention Reduce attention cost
Frame subsampling Reduce temporal length
Distributed training Scale across GPUs
Low-rank adaptation Fine-tune cheaply
Model distillation Reduce inference steps

Training often uses small clips first, then increases resolution or duration during later stages.

Common Failure Modes

Video diffusion has characteristic failures.

Failure mode Description
Flicker Frame-to-frame appearance changes
Identity drift Subject changes over time
Geometry collapse Shapes deform implausibly
Motion blur Weak temporal detail
Frozen motion Image changes too little
Prompt drift Video stops following prompt
Scene cuts Abrupt unintended transitions
Texture swimming Surface textures move incorrectly

These failures reflect the difficulty of modeling coherent 4D structure: time plus 3D appearance.

PyTorch Shape Example

A minimal video diffusion training step has the same structure as image diffusion.

def video_diffusion_loss(model, video, text_embeddings, schedule):
    """
    video: [B, C, F, H, W]
    text_embeddings: [B, T_text, D]
    """
    batch_size = video.shape[0]
    device = video.device

    t = torch.randint(
        0,
        schedule.num_steps,
        (batch_size,),
        device=device,
    )

    noise = torch.randn_like(video)
    x_t = schedule.q_sample(video, t, noise)

    pred_noise = model(
        x_t,
        t,
        encoder_hidden_states=text_embeddings,
    )

    return torch.nn.functional.mse_loss(pred_noise, noise)

For latent video diffusion, replace video with a latent tensor:

latents.shape
# torch.Size([B, 4, F, H_z, W_z])

The training loop remains the same.

Relationship to World Models

Video generation is related to world modeling. A video model must learn how scenes evolve.

However, text-to-video diffusion models are usually generative simulators rather than explicit physical simulators. They can learn common motion patterns, but they may violate physical consistency.

For example, a model may generate plausible waves, walking, or camera pans, but still fail at:

Physical property Possible failure
Object permanence Objects disappear
Conservation Objects change mass or shape
Contact dynamics Hands pass through objects
Causality Effects precede causes
Long-horizon planning Actions lose coherence

This makes video diffusion useful for synthesis, editing, and design, but limited as a precise simulator.

Summary

Video diffusion extends diffusion models from images to frame sequences. The forward process adds Gaussian noise to video tensors or latent video tensors. The reverse model learns to denoise while preserving both spatial quality and temporal coherence.

The main architectural challenge is temporal modeling. Systems use 3D convolutions, temporal attention, factorized attention, motion modules, and multi-stage generation pipelines to produce coherent motion.

Video diffusion is computationally expensive because it models space and time together. Latent representations, efficient attention, short clips, and staged generation make the problem more tractable.