Video Diffusion Systems

Video diffusion extends image diffusion from still images to moving sequences. Instead of generating one image, the model generates a sequence of frames that should remain visually coherent over time.

A video sample can be represented as a tensor:

$$ x_0 \in \mathbb{R}^{B \times C \times F \times H \times W} $$

where $B$ is batch size, $C$ is channels, $F$ is the number of frames, $H$ is height, and $W$ is width.

For example:

video = torch.randn(2, 3, 16, 256, 256)

This represents a batch of 2 videos, each with 3 color channels, 16 frames, and spatial resolution $256 \times 256$.

From Image Generation to Video Generation

Image diffusion models learn a distribution over images:

$$ p_\theta(x) $$

Video diffusion models learn a distribution over frame sequences:

$$ p_\theta(x_{1:F}) $$

where $x_{1:F}$ denotes all frames in the video.

The added difficulty is temporal coherence. A good video model must satisfy both spatial and temporal constraints.

Requirement	Meaning
Spatial quality	Each frame should look realistic
Temporal coherence	Objects should remain consistent across frames
Motion realism	Movement should follow plausible dynamics
Long-range consistency	Scene identity should persist over time
Prompt alignment	Video should match the text or conditioning input

A model that generates good individual frames may still fail as a video model if objects flicker, identities change, or motion appears unstable.

Forward Diffusion for Video

The forward diffusion process is the same as image diffusion, but applied to video tensors.

$$ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon $$

where

$$ \epsilon \sim \mathcal{N}(0,I) $$

has the same shape as the video.

In PyTorch:

def q_sample_video(x0, t, alpha_bars):
    noise = torch.randn_like(x0)

    alpha_bar_t = extract(alpha_bars, t, x0.shape)

    xt = (
        torch.sqrt(alpha_bar_t) * x0
        +
        torch.sqrt(1.0 - alpha_bar_t) * noise
    )

    return xt, noise

If x0 has shape [B, C, F, H, W], then xt and noise have the same shape.

The mathematics is unchanged. The challenge lies in the denoising network, which must model correlations across both space and time.

Latent Video Diffusion

Pixel-space video diffusion is extremely expensive. A short video may contain dozens of high-resolution frames.

For example, a 16-frame RGB video at $512 \times 512$ resolution contains:

$$ 16 \times 3 \times 512 \times 512 = 12{,}582{,}912 $$

values per sample.

Latent video diffusion reduces this cost by encoding frames into latent representations.

An image autoencoder maps each frame into a latent:

$$ z_0 = \mathcal{E}(x_0) $$

For video, the latent tensor may have shape:

[B, C_z, F, H_z, W_z]

For example:

latents = torch.randn(2, 4, 16, 64, 64)

Diffusion then operates on latent videos:

$$ z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon. $$

After denoising, a decoder converts latent frames back to pixel frames.

Temporal Modeling

A video diffusion model needs temporal structure. The model must understand how information changes across frames.

Common temporal modeling methods include:

Method	Description
3D convolutions	Apply convolution over time, height, and width
Temporal attention	Let frames attend to other frames
Factorized attention	Separate spatial attention and temporal attention
Recurrent state	Carry information between frames
Motion modules	Add temporal layers to an image diffusion backbone
Transformer blocks	Model spatiotemporal token sequences

The simplest extension is a 3D U-Net. It replaces 2D convolution layers with 3D convolution layers:

$$ [B, C, H, W] \rightarrow [B, C, F, H, W]. $$

However, full 3D computation is costly. Many modern systems instead start from a strong image diffusion model and add temporal modules.

Factorized Space-Time Attention

Full attention over video tokens is expensive.

Suppose a latent video has shape:

$$ F \times H \times W $$

The number of tokens is:

$$ N = FHW. $$

Full self-attention has cost:

$$ O(N^2). $$

For video, this becomes expensive quickly.

Factorized attention reduces cost by separating spatial and temporal attention.

Spatial attention attends within each frame:

$$ O(F(HW)^2) $$

Temporal attention attends across frames at each spatial location:

$$ O(HW F^2) $$

This is usually cheaper than:

$$ O((FHW)^2). $$

The model can first learn spatial relationships within each frame, then learn how features move over time.

Text-to-Video Conditioning

Text-to-video generation conditions the reverse process on a prompt:

$$ p_\theta(z_{t-1}\mid z_t, c) $$

where

$$ c = \mathrm{TextEncoder}(y). $$

The prompt may specify:

Prompt element	Video effect
Subject	What appears
Action	What moves
Scene	Where it happens
Camera motion	How viewpoint changes
Style	Visual appearance
Duration hints	Event structure

Examples:

"A panda surfing on a wave, cinematic lighting"

"A drone shot flying over a futuristic city at sunset"

Text conditioning is usually injected using cross-attention, as in text-to-image systems.

Image-to-Video Generation

Image-to-video models start from a still image and generate motion.

Given an image $x_\text{ref}$, the model learns:

$$ p_\theta(x_{1:F}\mid x_\text{ref}) $$

The first frame, appearance, or identity should remain consistent with the reference image.

Common conditioning methods include:

Conditioning method	Purpose
Reference image embedding	Preserve identity and style
First-frame conditioning	Anchor the generated video
Depth or pose control	Guide motion geometry
Optical flow hints	Guide frame-to-frame movement
Camera trajectory	Control viewpoint changes

Image-to-video is often easier than pure text-to-video because the model receives concrete visual structure at the start.

Motion Consistency

Motion consistency is the central problem in video generation.

A model must preserve:

Consistency type	Example
Object identity	Same person or object across frames
Geometry	Stable shape and viewpoint
Texture	Clothing, fur, material consistency
Lighting	Stable illumination
Background	Scene remains coherent
Camera motion	Smooth viewpoint movement

Without temporal modeling, an image diffusion model applied independently to each frame produces flicker. Each frame may look plausible, but the sequence fails as video.

Temporal layers reduce flicker by sharing information across frames.

Training Data for Video Diffusion

Video diffusion models require large datasets of video clips.

Training data usually includes:

Data	Use
Video frames	Visual supervision
Captions	Text conditioning
Timestamps	Temporal order
Motion metadata	Optional control
Audio	Optional multimodal conditioning

Video data is harder to curate than image data because it has more failure modes:

Issue	Effect
Low resolution	Weak visual detail
Compression artifacts	Learned artifacts
Watermarks	Undesired generations
Poor captions	Weak prompt alignment
Scene cuts	Broken temporal continuity
Camera shake	Noisy motion patterns

Good video training data should contain coherent clips, accurate captions, and diverse motion.

Frame Rate and Duration

Video models must choose a frame rate and duration.

A model may generate:

16 frames at 8 fps

which gives 2 seconds of video.

Or:

64 frames at 24 fps

which gives about 2.67 seconds of video.

Higher frame rate improves smoothness but increases compute. Longer duration improves usefulness but makes long-range consistency harder.

Design choice	Tradeoff
More frames	Better duration, higher compute
Higher resolution	Better quality, higher memory
Higher fps	Smoother motion, harder modeling
Longer clips	Better storytelling, harder consistency

Sliding Window Generation

Long video generation often uses sliding windows.

Instead of generating all frames at once, the model generates a short clip, then conditions the next clip on previous frames.

Example:

$$ x_{1:16} \rightarrow x_{9:24} \rightarrow x_{17:32} $$

The overlapping frames help maintain continuity.

However, errors can accumulate. If the generated video drifts, later windows may become inconsistent with earlier ones.

Multi-Stage Video Generation

Many systems use multiple stages.

A typical pipeline:

Generate low-resolution video latents
Upscale spatial resolution
Interpolate or refine frames
Apply temporal super-resolution
Decode to final video

This separates global motion from fine detail.

Stage	Goal
Base generation	Scene and motion
Spatial upsampling	Higher resolution
Temporal upsampling	More frames
Refinement	Remove artifacts

Multi-stage systems are easier to scale because each stage solves a narrower problem.

Video Diffusion Loss

The basic noise prediction loss remains:

$$ \mathcal{L} = \mathbb{E} \left[ |\epsilon-\epsilon_\theta(z_t,t,c)|_2^2 \right]. $$

For video tensors, this loss averages over channels, frames, height, and width.

In PyTorch:

pred_noise = model(z_t, t, text_embeddings)

loss = torch.nn.functional.mse_loss(
    pred_noise,
    noise
)

Additional losses may encourage temporal smoothness:

Loss	Purpose
Optical flow consistency	Preserve motion
Perceptual loss	Improve visual quality
Temporal adversarial loss	Reduce flicker
Frame interpolation loss	Improve smooth transitions

Many modern systems rely primarily on diffusion loss plus strong temporal architecture rather than explicit handcrafted temporal losses.

Efficient Training Techniques

Video diffusion is memory-intensive. Common efficiency techniques include:

Technique	Purpose
Latent diffusion	Reduce spatial size
Mixed precision	Lower memory use
Gradient checkpointing	Trade compute for memory
Factorized attention	Reduce attention cost
Frame subsampling	Reduce temporal length
Distributed training	Scale across GPUs
Low-rank adaptation	Fine-tune cheaply
Model distillation	Reduce inference steps

Training often uses small clips first, then increases resolution or duration during later stages.

Common Failure Modes

Video diffusion has characteristic failures.

Failure mode	Description
Flicker	Frame-to-frame appearance changes
Identity drift	Subject changes over time
Geometry collapse	Shapes deform implausibly
Motion blur	Weak temporal detail
Frozen motion	Image changes too little
Prompt drift	Video stops following prompt
Scene cuts	Abrupt unintended transitions
Texture swimming	Surface textures move incorrectly

These failures reflect the difficulty of modeling coherent 4D structure: time plus 3D appearance.

PyTorch Shape Example

A minimal video diffusion training step has the same structure as image diffusion.

def video_diffusion_loss(model, video, text_embeddings, schedule):
    """
    video: [B, C, F, H, W]
    text_embeddings: [B, T_text, D]
    """
    batch_size = video.shape[0]
    device = video.device

    t = torch.randint(
        0,
        schedule.num_steps,
        (batch_size,),
        device=device,
    )

    noise = torch.randn_like(video)
    x_t = schedule.q_sample(video, t, noise)

    pred_noise = model(
        x_t,
        t,
        encoder_hidden_states=text_embeddings,
    )

    return torch.nn.functional.mse_loss(pred_noise, noise)

For latent video diffusion, replace video with a latent tensor:

latents.shape
# torch.Size([B, 4, F, H_z, W_z])

The training loop remains the same.

Relationship to World Models

Video generation is related to world modeling. A video model must learn how scenes evolve.

However, text-to-video diffusion models are usually generative simulators rather than explicit physical simulators. They can learn common motion patterns, but they may violate physical consistency.

For example, a model may generate plausible waves, walking, or camera pans, but still fail at:

Physical property	Possible failure
Object permanence	Objects disappear
Conservation	Objects change mass or shape
Contact dynamics	Hands pass through objects
Causality	Effects precede causes
Long-horizon planning	Actions lose coherence

This makes video diffusion useful for synthesis, editing, and design, but limited as a precise simulator.

Summary

Video diffusion extends diffusion models from images to frame sequences. The forward process adds Gaussian noise to video tensors or latent video tensors. The reverse model learns to denoise while preserving both spatial quality and temporal coherence.

The main architectural challenge is temporal modeling. Systems use 3D convolutions, temporal attention, factorized attention, motion modules, and multi-stage generation pipelines to produce coherent motion.

Video diffusion is computationally expensive because it models space and time together. Latent representations, efficient attention, short clips, and staged generation make the problem more tractable.