Diffusion Models

Generative models that learn to denoise, enabling high-quality image and video synthesis

Diffusion Models generate data by learning to reverse a gradual noising process. Starting from pure noise, they iteratively denoise to produce remarkably high-quality images, videos, and audio.

Core Idea

The process has two phases:

Forward process (fixed): Gradually add Gaussian noise to data over TT steps:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)

Reverse process (learned): Denoise step by step:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

The Closed-Form Forward Process

We can jump directly to any timestep tt:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)

where αˉt=s=1t(1βs)\bar{\alpha}_t = \prod_{s=1}^{t} (1-\beta_s).

This means: xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon, where ϵN(0,I)\epsilon \sim \mathcal{N}(0, I).

Training Objective

Instead of predicting μθ\mu_\theta, predict the noise ϵ\epsilon:

Lsimple=Et,x0,ϵ[ϵϵθ(xt,t)2]\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]

Sample tt uniformly, noise the image, predict the noise. Remarkably simple yet effective.

Sampling (Inference)

Start with xTN(0,I)x_T \sim \mathcal{N}(0, I), then iterate:

xt1=1αt(xtβt1αˉtϵθ(xt,t))+σtzx_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t)\right) + \sigma_t z

where zN(0,I)z \sim \mathcal{N}(0, I) and σt\sigma_t controls stochasticity.

Classifier-Free Guidance

To improve sample quality, interpolate between conditional and unconditional predictions:

ϵ~θ(xt,t,c)=ϵθ(xt,t,)+w(ϵθ(xt,t,c)ϵθ(xt,t,))\tilde{\epsilon}_\theta(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + w \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))

where w>1w > 1 strengthens conditioning (e.g., on text prompts).

Interactive Visualization

Watch the diffusion process in action—forward noising and reverse denoising:

Diffusion Process

Reverse process: Learn to denoise step-by-step to generate images.

Step 0/20Noise → Image

Training: Predict the noise ε at each step.
Inference: Start from pure noise, iteratively denoise.

Architecture: U-Net

The denoising network is typically a U-Net with:

  • Time embedding: Sinusoidal encoding of tt
  • Attention layers: Self-attention and cross-attention (for conditioning)
  • Skip connections: Preserve spatial information

Key Models

ModelInnovation
DDPMFoundational formulation
DDIMDeterministic sampling, fewer steps
Stable DiffusionLatent space diffusion
DALL-E 2CLIP guidance
ImagenLarge language model conditioning

Why Diffusion Works

Unlike GANs (adversarial, unstable) or VAEs (blurry), diffusion models:

  1. Have stable training (simple MSE loss)
  2. Produce high-fidelity samples
  3. Enable controllable generation
  4. Scale effectively with compute
Found an error or want to contribute? Edit on GitHub