Layer Normalization

Layer Normalization (LayerNorm) rescales the features inside a single example so they have a more stable distribution. In practice, it helps deep models train more reliably, especially Transformers.

Read Batch Normalization first if you want the easiest comparison. This page focuses on why LayerNorm became the default in Transformer-style models.

A Small Example

Suppose one token embedding has four features:

[2, 10, -4, 8]

LayerNorm computes the mean and variance of those four numbers, normalizes them, then applies learned scale and shift parameters. The important point is that the normalization happens within that one example, not across the batch.

BatchNorm vs LayerNorm

BatchNorm normalizes a feature using statistics collected across different examples in the batch.

LayerNorm normalizes across the features of one example at a time.

That difference is why LayerNorm works well when:

batch size is small
sequence lengths vary
you are generating one token at a time

The Formula

For an input $x \in \mathbb{R}^D$ :

\mu = \frac{1}{D}\sum_{d=1}^{D} x_d, \quad \sigma^2 = \frac{1}{D}\sum_{d=1}^{D} (x_d - \mu)^2

\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y = \gamma \odot \hat{x} + \beta

where $\gamma$ and $\beta$ are learnable parameters that let the model choose the final scale and offset.

Interactive Visualization

Compare BatchNorm and LayerNorm normalization patterns:

Normalization Comparison

Original

-0.7

-0.6

2.5

0.8

1.0

2.4

-0.9

0.6

2.8

-0.4

1.2

2.7

1.2

1.1

-0.7

1.4

-0.1

-0.9

1.7

0.7

-0.8

0.0

← 6 features →

Normalized

-1.3

-1.2

1.3

-0.1

0.1

1.2

-1.1

0.1

1.9

-0.7

0.5

1.6

0.3

0.1

-1.5

0.4

-0.9

1.8

0.7

-0.8

0.1

LayerNorm: Each row (sample) is normalized independently. μ≈0, σ²≈1 per row.

Why Transformers Use It

Transformers rely on LayerNorm because it is:

Batch-independent: each sequence can be normalized on its own
Stable at inference time: train and test use the same basic computation
Friendly to autoregressive generation: it still works when batch size is 1

Pre-Norm vs Post-Norm

Two common block layouts are:

Post-Norm:

x' = \text{LayerNorm}(x + \text{Sublayer}(x))

Pre-Norm:

x' = x + \text{Sublayer}(\text{LayerNorm}(x))

Modern large language models often prefer pre-norm because gradients tend to behave better in deep networks.

RMSNorm

RMSNorm is a close relative that skips mean-centering:

\hat{x} = \frac{x}{\sqrt{\frac{1}{D}\sum_d x_d^2 + \epsilon}} \cdot \gamma

It is a simpler normalization choice used in many modern LLMs.

Common Confusion

LayerNorm does not normalize across the batch
It does not remove the need for residual connections
It helps optimization, but it is not a substitute for good architecture design

What To Remember

LayerNorm normalizes each example across its features
That makes it a good fit for sequence models and Transformers
Pre-norm and RMSNorm are the most relevant follow-on ideas for modern LLMs