Normalizing across features for sequence models and Transformers
Layer Normalization (LayerNorm) normalizes across the feature dimension rather than the batch dimension. This makes it ideal for Transformers and RNNs where batch statistics are problematic.
The Key Difference
BatchNorm: Normalize across batch for each feature
LayerNorm: Normalize across features for each sample
The Algorithm
For an input :
where are learnable parameters.
Interactive Visualization
Compare BatchNorm and LayerNorm normalization patterns:
Normalization Comparison
LayerNorm: Each row (sample) is normalized independently. μ≈0, σ²≈1 per row.
Why LayerNorm for Transformers?
- Batch independence: Each sequence is normalized independently
- Variable sequence lengths: No batch statistics needed
- Inference consistency: Same computation at train and test time
- Autoregressive generation: Works with batch size 1
Pre-Norm vs Post-Norm
Post-Norm (original Transformer):
Pre-Norm (modern preference):
Pre-Norm enables better gradient flow and often converges faster.
RMSNorm: A Simpler Alternative
Remove the mean centering:
Used in LLaMA and many modern LLMs—simpler and often works just as well.
Comparison Table
| Aspect | BatchNorm | LayerNorm | RMSNorm |
|---|---|---|---|
| Normalizes over | Batch | Features | Features |
| Learnable params | 2D | 2D | D |
| Mean centering | Yes | Yes | No |
| Best for | CNNs | Transformers | LLMs |
Common Placements in Transformers
# Pre-norm Transformer block
def forward(x):
x = x + self.attn(self.norm1(x))
x = x + self.ffn(self.norm2(x))
return x
Key Insight
LayerNorm’s batch independence makes it essential for:
- Autoregressive generation (batch size 1)
- Variable-length sequences
- Distributed training across sequence dimension