Grünwald's comprehensive guide to MDL for model selection and learning
A Tutorial Introduction to the Minimum Description Length Principle by Peter Grünwald is the definitive guide to MDL—a principled approach to model selection based on data compression.
The Core Idea
The best model is the one that compresses data most:
- : Bits to describe the model itself
- : Bits to describe prediction errors
Minimize the sum, not just the error.
Occam’s Razor, Formalized
MDL provides a mathematical justification for preferring simpler models:
But simpler models may have higher errors. MDL finds the optimal trade-off.
Interactive Demo
Compare models of varying complexity:
MDL Principle
Two-Part Codes
The simplest MDL approach:
- Encode the model using bits
- Encode the data given the model using bits
- Choose model minimizing
This is “crude” MDL—refined versions exist.
Connection to Bayesian Inference
MDL relates to the Bayesian posterior:
Maximizing posterior ≈ minimizing description length.
Normalized Maximum Likelihood
For refined MDL, use the stochastic complexity:
This accounts for model flexibility more carefully.
Key Applications
| Domain | MDL Application |
|---|---|
| Model selection | Choose polynomial degree |
| Change detection | Find breakpoints in time series |
| Clustering | Determine number of clusters |
| Feature selection | Which features to include |
MDL vs. Other Criteria
| Criterion | Formula |
|---|---|
| AIC | |
| BIC | |
| MDL |
MDL is often equivalent to BIC but has stronger theoretical foundations.
Practical Insights
- More complex ≠ Better: Overfitting wastes bits on noise
- Compression = Learning: A good compressor is a good predictor
- Prior knowledge: Model encoding reflects assumptions
Why Ilya Included This
MDL connects:
- Kolmogorov complexity (theoretical)
- Statistical learning (practical)
- Neural network regularization (applied)
Understanding MDL illuminates why techniques like dropout and weight decay work.
Key Resource
- A Tutorial Introduction to the Minimum Description Length Principle — Grünwald (2005)
https://arxiv.org/abs/math/0406077