A Tutorial Introduction to the Minimum Description Length Principle

Grünwald's comprehensive guide to MDL for model selection and learning

A Tutorial Introduction to the Minimum Description Length Principle by Peter Grünwald is the definitive guide to MDL—a principled approach to model selection based on data compression.

The Core Idea

The best model is the one that compresses data most:

Ltotal=L(model)+L(datamodel)L_{\text{total}} = L(\text{model}) + L(\text{data}|\text{model})
  • L(model)L(\text{model}): Bits to describe the model itself
  • L(datamodel)L(\text{data}|\text{model}): Bits to describe prediction errors

Minimize the sum, not just the error.

Occam’s Razor, Formalized

MDL provides a mathematical justification for preferring simpler models:

Simple modelLow L(model)\text{Simple model} \Rightarrow \text{Low } L(\text{model})

But simpler models may have higher errors. MDL finds the optimal trade-off.

Interactive Demo

Compare models of varying complexity:

MDL Principle

Model Comparison
Constant85 bits
Linear55 bits
Quadratic45 bits (optimal)
Polynomial-10105 bits
Lookup Table200 bits
Model bits
Error bits
Selected: Quadratic
y = ax² + bx + c
25
Model
20
Errors
45
Total
The Trade-off
Simple models need few bits but make many errors. Complex models fit perfectly but require many bits to describe. MDL finds the sweet spot.

Two-Part Codes

The simplest MDL approach:

  1. Encode the model using L1L_1 bits
  2. Encode the data given the model using L2L_2 bits
  3. Choose model minimizing L1+L2L_1 + L_2

This is “crude” MDL—refined versions exist.

Connection to Bayesian Inference

MDL relates to the Bayesian posterior:

logP(modeldata)=logP(datamodel)+logP(model)logP(data)\log P(\text{model}|\text{data}) = \log P(\text{data}|\text{model}) + \log P(\text{model}) - \log P(\text{data})

Maximizing posterior ≈ minimizing description length.

Normalized Maximum Likelihood

For refined MDL, use the stochastic complexity:

COMP(xM)=logP(xθ^(x),M)xP(xθ^(x),M)\text{COMP}(x | \mathcal{M}) = \log \frac{P(x | \hat{\theta}(x), \mathcal{M})}{\sum_{x'} P(x' | \hat{\theta}(x'), \mathcal{M})}

This accounts for model flexibility more carefully.

Key Applications

DomainMDL Application
Model selectionChoose polynomial degree
Change detectionFind breakpoints in time series
ClusteringDetermine number of clusters
Feature selectionWhich features to include

MDL vs. Other Criteria

CriterionFormula
AIC2logL+2k-2\log L + 2k
BIC2logL+klogn-2\log L + k\log n
MDLlogL+COMP(M)-\log L + \text{COMP}(\mathcal{M})

MDL is often equivalent to BIC but has stronger theoretical foundations.

Practical Insights

  1. More complex ≠ Better: Overfitting wastes bits on noise
  2. Compression = Learning: A good compressor is a good predictor
  3. Prior knowledge: Model encoding reflects assumptions

Why Ilya Included This

MDL connects:

  • Kolmogorov complexity (theoretical)
  • Statistical learning (practical)
  • Neural network regularization (applied)

Understanding MDL illuminates why techniques like dropout and weight decay work.

Key Resource

Found an error or want to contribute? Edit on GitHub