Keeping Neural Networks Simple by Minimizing the Description Length of the Weights

Hinton's MDL approach to neural network regularization through noisy weights

Keeping Neural Networks Simple by Hinton and van Camp introduced a principled approach to regularization: add noise to weights during training, effectively limiting the precision (and thus description length) of the learned parameters.

The MDL Principle

The Minimum Description Length principle states that the best model minimizes:

L(model)+L(datamodel)L(\text{model}) + L(\text{data} | \text{model})

For neural networks:

L(weights)+L(errorsweights)L(\text{weights}) + L(\text{errors} | \text{weights})

The Key Insight

Instead of precise weights, transmit noisy weights:

w~=w+ϵ,ϵN(0,σ2)\tilde{w} = w + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)

Higher noise σ\sigma means:

  • Fewer bits to describe weights (can use coarser precision)
  • More bits to describe prediction errors

The network learns to find the optimal noise level.

Interactive Demo

Explore the trade-off between weight precision and noise:

MDL Weight Regularization

Noise Level: 30%
Precision: 8 bits
Weight Distribution
Gaussian noise added to weights during forward pass
400
Nominal Bits
340
Effective Bits
15%
Compression
The MDL Trade-off
Total Cost = Description(Weights) + Description(Errors|Weights)
Adding noise reduces weight precision (fewer bits to transmit) but increases prediction errors. The optimal noise level balances these costs.

Mathematical Framework

Weight distribution after adding noise:

q(w)=N(w;μ,σ2)q(w) = \mathcal{N}(w; \mu, \sigma^2)

The description length (in bits) for communicating weights:

L(w)logq(w)p(w)L(w) \approx -\log \frac{q(w)}{p(w)}

where p(w)p(w) is a prior (often N(0,1)\mathcal{N}(0, 1)).

The Bits-Back Argument

A subtle insight: when transmitting weights with a known distribution, the receiver can “give back” some bits:

Effective bits=H[q]DKL[qp]\text{Effective bits} = H[q] - D_{KL}[q \| p]

This is the same KL term that appears in VAEs!

Connection to Modern Regularization

This 1993 paper anticipated:

MDL ConceptModern Equivalent
Noisy weightsDropout, weight noise
Description lengthKL divergence in VAEs
Bits-back codingVariational inference
Optimal precisionLearned quantization

Training Procedure

  1. Forward pass with noisy weights: w~i=wi+N(0,σi2)\tilde{w}_i = w_i + \mathcal{N}(0, \sigma_i^2)
  2. Compute expected loss: Eϵ[L(x,y;w+ϵ)]\mathbb{E}_\epsilon[\mathcal{L}(x, y; w + \epsilon)]
  3. Update both ww and σ\sigma via gradient descent
  4. Network learns which weights need precision

Why This Matters

This paper established:

  • Information-theoretic regularization: Not just heuristics
  • Learned precision: Different weights need different precision
  • Compression = generalization: Simpler models generalize better

Key Paper

Found an error or want to contribute? Edit on GitHub