Layer Normalization

Normalizing each example across its features

Layer Normalization (LayerNorm) rescales the features inside a single example so they have a more stable distribution. In practice, it helps deep models train more reliably, especially Transformers.

Read Batch Normalization first if you want the easiest comparison. This page focuses on why LayerNorm became the default in Transformer-style models.

A Small Example

Suppose one token embedding has four features:

[2, 10, -4, 8]

LayerNorm computes the mean and variance of those four numbers, normalizes them, then applies learned scale and shift parameters. The important point is that the normalization happens within that one example, not across the batch.

BatchNorm vs LayerNorm

BatchNorm normalizes a feature using statistics collected across different examples in the batch.

LayerNorm normalizes across the features of one example at a time.

That difference is why LayerNorm works well when:

  • batch size is small
  • sequence lengths vary
  • you are generating one token at a time

The Formula

For an input xRDx \in \mathbb{R}^D:

μ=1Dd=1Dxd,σ2=1Dd=1D(xdμ)2\mu = \frac{1}{D}\sum_{d=1}^{D} x_d, \quad \sigma^2 = \frac{1}{D}\sum_{d=1}^{D} (x_d - \mu)^2 x^=xμσ2+ϵ,y=γx^+β\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y = \gamma \odot \hat{x} + \beta

where γ\gamma and β\beta are learnable parameters that let the model choose the final scale and offset.

Interactive Visualization

Compare BatchNorm and LayerNorm normalization patterns:

Normalization Comparison

Original
-0.7
-0.6
2.5
0.8
1.0
2.4
-0.9
0.6
2.8
-0.4
-0.4
1.2
2.7
1.2
1.1
-0.7
1.4
-0.1
-0.9
-0.9
1.7
0.7
-0.8
0.0
6 features →
Normalized
-1.3
-1.2
1.3
-0.1
0.1
1.2
-1.1
0.1
1.9
-0.7
-0.7
0.5
1.6
0.3
0.1
-1.5
0.4
-0.9
-0.9
-0.9
1.8
0.7
-0.8
0.1

LayerNorm: Each row (sample) is normalized independently. μ≈0, σ²≈1 per row.

Why Transformers Use It

Transformers rely on LayerNorm because it is:

  1. Batch-independent: each sequence can be normalized on its own
  2. Stable at inference time: train and test use the same basic computation
  3. Friendly to autoregressive generation: it still works when batch size is 1

Pre-Norm vs Post-Norm

Two common block layouts are:

Post-Norm:

x=LayerNorm(x+Sublayer(x))x' = \text{LayerNorm}(x + \text{Sublayer}(x))

Pre-Norm:

x=x+Sublayer(LayerNorm(x))x' = x + \text{Sublayer}(\text{LayerNorm}(x))

Modern large language models often prefer pre-norm because gradients tend to behave better in deep networks.

RMSNorm

RMSNorm is a close relative that skips mean-centering:

x^=x1Ddxd2+ϵγ\hat{x} = \frac{x}{\sqrt{\frac{1}{D}\sum_d x_d^2 + \epsilon}} \cdot \gamma

It is a simpler normalization choice used in many modern LLMs.

Common Confusion

  • LayerNorm does not normalize across the batch
  • It does not remove the need for residual connections
  • It helps optimization, but it is not a substitute for good architecture design

What To Remember

  • LayerNorm normalizes each example across its features
  • That makes it a good fit for sequence models and Transformers
  • Pre-norm and RMSNorm are the most relevant follow-on ideas for modern LLMs
Found an error or want to contribute? Edit on GitHub