Identity Mappings in Deep Residual Networks

Pre-activation ResNet design that enables training of 1000+ layer networks

Identity Mappings in Deep Residual Networks is the follow-up paper to ResNet that analyzes skip connections more deeply and proposes an improved design—the pre-activation ResNet—which enables training networks with over 1000 layers.

The Problem with Original ResNet

In the original ResNet, the residual block applies ReLU after the addition:

hl+1=ReLU(hl+F(hl,Wl))h_{l+1} = \text{ReLU}(h_l + \mathcal{F}(h_l, W_l))

This means the shortcut path is not a pure identity—it passes through a nonlinearity, which can impede gradient flow in very deep networks.

Pure Identity Shortcuts

The key insight: for optimal gradient propagation, the shortcut should be a clean identity mapping:

hl+1=hl+F(hl,Wl)h_{l+1} = h_l + \mathcal{F}(h_l, W_l)

By recursively applying this, any deep layer can directly access any shallow layer:

hL=hl+i=lL1F(hi,Wi)h_L = h_l + \sum_{i=l}^{L-1} \mathcal{F}(h_i, W_i)

Pre-activation Design

Moving BN and ReLU before the convolution (pre-activation) achieves pure identity shortcuts:

y=x+Conv(ReLU(BN(x)))y = x + \text{Conv}(\text{ReLU}(\text{BN}(x)))

This subtle rearrangement has profound effects on gradient flow and enables training of 1001-layer networks.

Interactive Demo

Compare the original post-activation design with the improved pre-activation version:

Identity Mappings in ResNets

Original (Post-activation)
x
Conv
BN
ReLU
Conv
BN
+ add
ReLU
out
ReLU after addition blocks gradient
Pre-activation
x
BN
ReLU
Conv
BN
ReLU
Conv
+ add
out
Clean identity path for gradients
Gradient Flow Comparison
Original: gradient passes through ReLU after addition
Pre-activation: gradient flows directly via identity
Original:
hl+1 = ReLU(hl + F(hl))
Pre-activation:
hl+1 = hl + F(BN(ReLU(hl)))
1001
Layers trainable
4.92%
CIFAR-10 error
~0.5%
Improvement

Why Pre-activation Works

The gradient of the loss with respect to any layer ll:

Lhl=LhL(1+hli=lL1Fi)\frac{\partial \mathcal{L}}{\partial h_l} = \frac{\partial \mathcal{L}}{\partial h_L} \left(1 + \frac{\partial}{\partial h_l}\sum_{i=l}^{L-1}\mathcal{F}_i\right)

The “1” term provides a direct gradient highway from hLh_L to hlh_l, unimpeded by nonlinearities.

Results

ArchitectureCIFAR-10 ErrorCIFAR-100 Error
ResNet-110 (original)6.61%-
ResNet-110 (pre-act)6.37%-
ResNet-1001 (pre-act)4.92%22.71%

Pre-activation enables training of networks 10× deeper with better performance.

Key Paper

Found an error or want to contribute? Edit on GitHub