Pre-activation ResNet design that enables training of 1000+ layer networks
Identity Mappings in Deep Residual Networks is the follow-up paper to ResNet that analyzes skip connections more deeply and proposes an improved design—the pre-activation ResNet—which enables training networks with over 1000 layers.
The Problem with Original ResNet
In the original ResNet, the residual block applies ReLU after the addition:
This means the shortcut path is not a pure identity—it passes through a nonlinearity, which can impede gradient flow in very deep networks.
Pure Identity Shortcuts
The key insight: for optimal gradient propagation, the shortcut should be a clean identity mapping:
By recursively applying this, any deep layer can directly access any shallow layer:
Pre-activation Design
Moving BN and ReLU before the convolution (pre-activation) achieves pure identity shortcuts:
This subtle rearrangement has profound effects on gradient flow and enables training of 1001-layer networks.
Interactive Demo
Compare the original post-activation design with the improved pre-activation version:
Identity Mappings in ResNets
Why Pre-activation Works
The gradient of the loss with respect to any layer :
The “1” term provides a direct gradient highway from to , unimpeded by nonlinearities.
Results
| Architecture | CIFAR-10 Error | CIFAR-100 Error |
|---|---|---|
| ResNet-110 (original) | 6.61% | - |
| ResNet-110 (pre-act) | 6.37% | - |
| ResNet-1001 (pre-act) | 4.92% | 22.71% |
Pre-activation enables training of networks 10× deeper with better performance.
Key Paper
- Identity Mappings in Deep Residual Networks — He, Zhang, Ren, Sun (2016)
https://arxiv.org/abs/1603.05027