Multi-Scale Context Aggregation by Dilated Convolutions

Expanding receptive fields exponentially without losing resolution or adding parameters

Dilated convolutions (also called atrous convolutions) expand the receptive field exponentially without increasing the number of parameters or reducing spatial resolution. This paper introduced them for dense prediction tasks like semantic segmentation.

The Problem

Standard convolutions face a tradeoff:

  • Pooling increases receptive field but loses spatial resolution
  • Larger kernels increase receptive field but add parameters quadratically

For dense prediction (assigning a label to every pixel), we need both large context and high resolution.

Dilated Convolution

A dilated convolution with dilation factor dd samples input at intervals:

(Fdk)(p)=s+dt=pF(s)k(t)(F *_d k)(p) = \sum_{s+dt=p} F(s) \cdot k(t)

A 3×3 kernel with dilation dd has a receptive field of (2d+1)×(2d+1)(2d+1) \times (2d+1) while using only 9 parameters.

Exponential Expansion

By stacking layers with dilations 1,2,4,8,...1, 2, 4, 8, ..., the receptive field grows exponentially:

LayerDilationReceptive Field
113×3
227×7
3415×15
4831×31

After LL layers: (2L+11)×(2L+11)(2^{L+1} - 1) \times (2^{L+1} - 1) receptive field.

Interactive Demo

Visualize how different dilation rates expand the receptive field:

Dilated Convolutions

3×3
Receptive Field
9
Parameters (3×3)
Why Dilated Convolutions?
Exponentially growing receptive field without pooling
Preserves spatial resolution for dense prediction
Same number of parameters as standard convolution
Receptive field after L layers: (2^(L+1) - 1) × (2^(L+1) - 1)

Context Module

The paper proposes a multi-scale context aggregation module:

Context(x)=i=1nDilatedConvdi(x)\text{Context}(x) = \sum_{i=1}^{n} \text{DilatedConv}_{d_i}(x)

Multiple parallel dilated convolutions capture information at different scales, which are then combined for the final prediction.

Key Properties

  • No resolution loss: Unlike pooling, dilated convolutions maintain spatial dimensions
  • Parameter efficient: Same number of weights as standard convolution
  • Flexible: Dilation rate can be adjusted per layer or learned

Applications

Dilated convolutions became foundational for:

  • Semantic segmentation (DeepLab, PSPNet)
  • Audio generation (WaveNet)
  • Time series (TCN - Temporal Convolutional Networks)

Key Paper

Found an error or want to contribute? Edit on GitHub