Generative models that learn to denoise, enabling high-quality image and video synthesis
Diffusion Models generate data by learning to reverse a gradual noising process. Starting from pure noise, they iteratively denoise to produce remarkably high-quality images, videos, and audio.
Core Idea
The process has two phases:
Forward process (fixed): Gradually add Gaussian noise to data over steps:
Reverse process (learned): Denoise step by step:
The Closed-Form Forward Process
We can jump directly to any timestep :
where .
This means: , where .
Training Objective
Instead of predicting , predict the noise :
Sample uniformly, noise the image, predict the noise. Remarkably simple yet effective.
Sampling (Inference)
Start with , then iterate:
where and controls stochasticity.
Classifier-Free Guidance
To improve sample quality, interpolate between conditional and unconditional predictions:
where strengthens conditioning (e.g., on text prompts).
Interactive Visualization
Watch the diffusion process in action—forward noising and reverse denoising:
Diffusion Process
Reverse process: Learn to denoise step-by-step to generate images.
Training: Predict the noise ε at each step.
Inference: Start from pure noise, iteratively denoise.
Architecture: U-Net
The denoising network is typically a U-Net with:
- Time embedding: Sinusoidal encoding of
- Attention layers: Self-attention and cross-attention (for conditioning)
- Skip connections: Preserve spatial information
Key Models
| Model | Innovation |
|---|---|
| DDPM | Foundational formulation |
| DDIM | Deterministic sampling, fewer steps |
| Stable Diffusion | Latent space diffusion |
| DALL-E 2 | CLIP guidance |
| Imagen | Large language model conditioning |
Why Diffusion Works
Unlike GANs (adversarial, unstable) or VAEs (blurry), diffusion models:
- Have stable training (simple MSE loss)
- Produce high-fidelity samples
- Enable controllable generation
- Scale effectively with compute