A Tutorial Introduction to the Minimum Description Length Principle

A Tutorial Introduction to the Minimum Description Length Principle by Peter Grünwald is the definitive guide to MDL—a principled approach to model selection based on data compression.

The Core Idea

The best model is the one that compresses data most:

L_{\text{total}} = L(\text{model}) + L(\text{data}|\text{model})

$L(\text{model})$ : Bits to describe the model itself
$L(\text{data}|\text{model})$ : Bits to describe prediction errors

Minimize the sum, not just the error.

Occam’s Razor, Formalized

MDL provides a mathematical justification for preferring simpler models:

\text{Simple model} \Rightarrow \text{Low } L(\text{model})

But simpler models may have higher errors. MDL finds the optimal trade-off.

Interactive Demo

Compare models of varying complexity:

MDL Principle

Model Comparison

Constant85 bits

Linear55 bits

Quadratic45 bits (optimal)

Polynomial-10105 bits

Lookup Table200 bits

Model bits

Error bits

Selected: Quadratic

y = ax² + bx + c

Model

Errors

Total

The Trade-off

Simple models need few bits but make many errors. Complex models fit perfectly but require many bits to describe. MDL finds the sweet spot.

Two-Part Codes

The simplest MDL approach:

Encode the model using $L_1$ bits
Encode the data given the model using $L_2$ bits
Choose model minimizing $L_1 + L_2$

This is “crude” MDL—refined versions exist.

Connection to Bayesian Inference

MDL relates to the Bayesian posterior:

\log P(\text{model}|\text{data}) = \log P(\text{data}|\text{model}) + \log P(\text{model}) - \log P(\text{data})

Maximizing posterior ≈ minimizing description length.

Normalized Maximum Likelihood

For refined MDL, use the stochastic complexity:

\text{COMP}(x | \mathcal{M}) = \log \frac{P(x | \hat{\theta}(x), \mathcal{M})}{\sum_{x'} P(x' | \hat{\theta}(x'), \mathcal{M})}

This accounts for model flexibility more carefully.

Key Applications

Domain	MDL Application
Model selection	Choose polynomial degree
Change detection	Find breakpoints in time series
Clustering	Determine number of clusters
Feature selection	Which features to include

MDL vs. Other Criteria

Criterion	Formula
AIC	$-2\log L + 2k$
BIC	$-2\log L + k\log n$
MDL	$-\log L + \text{COMP}(\mathcal{M})$

MDL is often equivalent to BIC but has stronger theoretical foundations.

Practical Insights

More complex ≠ Better: Overfitting wastes bits on noise
Compression = Learning: A good compressor is a good predictor
Prior knowledge: Model encoding reflects assumptions

Key Resource

A Tutorial Introduction to the Minimum Description Length Principle — Grünwald (2005)
https://arxiv.org/abs/math/0406077