Latent Diffusion Models

High-resolution image generation by diffusing in learned latent spaces

Latent Diffusion Models (LDMs) make high-resolution image generation practical by running the diffusion process in a compressed latent space rather than pixel space. This is the architecture behind Stable Diffusion.

The Problem with Pixel-Space Diffusion

Diffusing a 512×512×3 image requires processing 786,432 dimensions per step. For thousands of denoising steps, this is extremely expensive.

The Solution: Compress First

LDMs use a two-stage approach:

  1. Encode: Compress images to a smaller latent space
  2. Diffuse: Run diffusion in latent space
  3. Decode: Reconstruct to pixel space
z=E(x),x=D(z^)z = \mathcal{E}(x), \quad x' = \mathcal{D}(\hat{z})

where E\mathcal{E} is the encoder, D\mathcal{D} is the decoder, and z^\hat{z} is the denoised latent.

The Autoencoder

Typically a VQ-VAE or KL-VAE trained separately:

LAE=xD(E(x))2+λReg(E(x))\mathcal{L}_{AE} = \|x - \mathcal{D}(\mathcal{E}(x))\|^2 + \lambda \cdot \text{Reg}(\mathcal{E}(x))

The latent space is typically 8× or 4× smaller per dimension (64× or 16× in total pixels).

Diffusion in Latent Space

Standard diffusion objective, but on latents:

LLDM=Ez,ϵ,t[ϵϵθ(zt,t,c)2]\mathcal{L}_{LDM} = \mathbb{E}_{z, \epsilon, t}\left[\|\epsilon - \epsilon_\theta(z_t, t, c)\|^2\right]

where cc is the conditioning (text, class, etc.).

Interactive Visualization

See how the latent space compression enables efficient generation:

Latent Diffusion Pipeline

🖼️
512×512
786k dims
VAE
Encoder
64×64
16k dims
VAE
Decoder
Generated
Pixel Diffusion
786,432 dims
~24GB GPU memory
Latent Diffusion
16,384 dims
~8GB GPU memory (48× smaller)

Conditioning via Cross-Attention

Text conditioning is injected through cross-attention:

Attention(Q,K,V)=softmax(QKTd)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V

where:

  • QQ comes from the latent (image features)
  • K,VK, V come from the text encoder (CLIP or T5)

Architecture Overview

Text → CLIP → Cross-Attention

Noise → U-Net (in latent space) → Denoised Latent

                                    Decoder → Image

Efficiency Gains

MetricPixel DiffusionLatent Diffusion
Dimensions512×512×364×64×4
Total dims786,43216,384
Compression48×
GPU memory~24GB~8GB

Stable Diffusion Specifics

  • VAE: KL-regularized autoencoder
  • U-Net: ~860M parameters
  • Text encoder: CLIP ViT-L/14 (frozen)
  • Latent size: 64×64×4 for 512×512 images

Training Pipeline

  1. Stage 1: Train VAE on images (reconstruction + regularization)
  2. Stage 2: Train U-Net on latents with frozen VAE
  3. Optional: Fine-tune on specific domains

Key Insight

By separating perceptual compression (autoencoder) from semantic generation (diffusion), LDMs achieve:

  • High-quality generation
  • Computational efficiency
  • Flexible conditioning
Found an error or want to contribute? Edit on GitHub