Deep residual learning with skip connections that enabled training of 152+ layer networks

ResNet (Residual Network) introduced skip connections that revolutionized deep learning by enabling the training of networks with over 150 layers. It won the 2015 ImageNet challenge with a 3.57% error rate—surpassing human-level performance.

The Degradation Problem

Before ResNet, adding more layers to a network eventually degraded performance:

Error56-layer>Error20-layer\text{Error}_{56\text{-layer}} > \text{Error}_{20\text{-layer}}

This wasn’t overfitting—training error also increased. Deeper networks were fundamentally harder to optimize.

Residual Learning

Instead of learning a desired mapping H(x)H(x) directly, ResNet learns the residual:

F(x)=H(x)xF(x) = H(x) - x

The output becomes:

y=F(x)+xy = F(x) + x

The key insight: if identity is optimal, it’s easier to push F(x)0F(x) \rightarrow 0 than to learn H(x)=xH(x) = x.

Skip Connections

The identity shortcut xx bypasses layers and is added to the output:

y=F(x,{Wi})+xy = \mathcal{F}(x, \{W_i\}) + x

These shortcuts:

  • Add no extra parameters
  • Enable gradient flow through hundreds of layers
  • Allow each block to refine features rather than transform them completely

Block Architectures

Basic Block (ResNet-18/34): Two 3×3 convolutions Bottleneck Block (ResNet-50/101/152): 1×1 → 3×3 → 1×1 convolutions for efficiency

Interactive Demo

Explore residual blocks and toggle skip connections to see their effect:

Residual Learning

ResNet-5050 layers | Bottleneck blocks
Stage 2
x
1×1
3×3
1×1
F(x)+x
x
1×1
3×3
1×1
F(x)+x
x
1×1
3×3
1×1
F(x)+x
Stage 3
x
1×1
3×3
1×1
F(x)+x
x
1×1
3×3
1×1
F(x)+x
x
1×1
3×3
1×1
F(x)+x
+1
Stage 4
x
1×1
3×3
1×1
F(x)+x
x
1×1
3×3
1×1
F(x)+x
x
1×1
3×3
1×1
F(x)+x
+3
Stage 5
x
1×1
3×3
1×1
F(x)+x
x
1×1
3×3
1×1
F(x)+x
x
1×1
3×3
1×1
F(x)+x
Without Skip
Gradients vanish in deep networks. Training 56+ layers degrades performance.
With Skip
Identity shortcuts provide gradient highways. 152+ layers train successfully.
The Key Insight
Instead of learning H(x), learn the residual F(x) = H(x) - x
If identity is optimal, it's easier to push F(x) → 0 than to fit H(x) = x

Why It Works

Skip connections create gradient highways:

Lx=Ly(1+Fx)\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \cdot \left(1 + \frac{\partial F}{\partial x}\right)

The “1” term ensures gradients flow directly backward, preventing vanishing gradients even in very deep networks.

Impact

ResNet’s influence extends far beyond image classification:

  • Foundation for most modern vision architectures
  • Inspired connections in Transformers (residual attention)
  • Enabled training of networks with 1000+ layers

Key Paper

Found an error or want to contribute? Edit on GitHub