Hinton's MDL approach to neural network regularization through noisy weights
Keeping Neural Networks Simple by Hinton and van Camp introduced a principled approach to regularization: add noise to weights during training, effectively limiting the precision (and thus description length) of the learned parameters.
The MDL Principle
The Minimum Description Length principle states that the best model minimizes:
For neural networks:
The Key Insight
Instead of precise weights, transmit noisy weights:
Higher noise means:
- Fewer bits to describe weights (can use coarser precision)
- More bits to describe prediction errors
The network learns to find the optimal noise level.
Interactive Demo
Explore the trade-off between weight precision and noise:
MDL Weight Regularization
Mathematical Framework
Weight distribution after adding noise:
The description length (in bits) for communicating weights:
where is a prior (often ).
The Bits-Back Argument
A subtle insight: when transmitting weights with a known distribution, the receiver can “give back” some bits:
This is the same KL term that appears in VAEs!
Connection to Modern Regularization
This 1993 paper anticipated:
| MDL Concept | Modern Equivalent |
|---|---|
| Noisy weights | Dropout, weight noise |
| Description length | KL divergence in VAEs |
| Bits-back coding | Variational inference |
| Optimal precision | Learned quantization |
Training Procedure
- Forward pass with noisy weights:
- Compute expected loss:
- Update both and via gradient descent
- Network learns which weights need precision
Why This Matters
This paper established:
- Information-theoretic regularization: Not just heuristics
- Learned precision: Different weights need different precision
- Compression = generalization: Simpler models generalize better
Key Paper
- Keeping Neural Networks Simple by Minimizing the Description Length of the Weights — Hinton, van Camp (1993)
https://www.cs.toronto.edu/~hinton/absps/colt93.pdf