Normalizing each example across its features
Layer Normalization (LayerNorm) rescales the features inside a single example so they have a more stable distribution. In practice, it helps deep models train more reliably, especially Transformers.
Read Batch Normalization first if you want the easiest comparison. This page focuses on why LayerNorm became the default in Transformer-style models.
A Small Example
Suppose one token embedding has four features:
[2, 10, -4, 8]
LayerNorm computes the mean and variance of those four numbers, normalizes them, then applies learned scale and shift parameters. The important point is that the normalization happens within that one example, not across the batch.
BatchNorm vs LayerNorm
BatchNorm normalizes a feature using statistics collected across different examples in the batch.
LayerNorm normalizes across the features of one example at a time.
That difference is why LayerNorm works well when:
- batch size is small
- sequence lengths vary
- you are generating one token at a time
The Formula
For an input :
where and are learnable parameters that let the model choose the final scale and offset.
Interactive Visualization
Compare BatchNorm and LayerNorm normalization patterns:
Normalization Comparison
LayerNorm: Each row (sample) is normalized independently. μ≈0, σ²≈1 per row.
Why Transformers Use It
Transformers rely on LayerNorm because it is:
- Batch-independent: each sequence can be normalized on its own
- Stable at inference time: train and test use the same basic computation
- Friendly to autoregressive generation: it still works when batch size is 1
Pre-Norm vs Post-Norm
Two common block layouts are:
Post-Norm:
Pre-Norm:
Modern large language models often prefer pre-norm because gradients tend to behave better in deep networks.
RMSNorm
RMSNorm is a close relative that skips mean-centering:
It is a simpler normalization choice used in many modern LLMs.
Common Confusion
- LayerNorm does not normalize across the batch
- It does not remove the need for residual connections
- It helps optimization, but it is not a substitute for good architecture design
What To Remember
- LayerNorm normalizes each example across its features
- That makes it a good fit for sequence models and Transformers
- Pre-norm and RMSNorm are the most relevant follow-on ideas for modern LLMs