Gradient Boosted Decision Trees

Sequential tree ensembles optimized via gradient descent

Gradient Boosted Decision Trees (GBDT) are an ensemble method that builds decision trees sequentially, where each new tree is trained to correct the errors of the current ensemble.

Core Idea: Gradient Descent in Function Space

GBDT performs gradient descent over functions rather than parameters. We iteratively improve a model F(x)F(x) by adding a new function (tree) that points in the direction of steepest loss reduction.

At iteration mm:

Fm(x)=Fm1(x)+ηhm(x)F_m(x) = F_{m-1}(x) + \eta \, h_m(x)
  • hm(x)h_m(x) is a regression tree
  • η\eta is the learning rate

Loss and Gradients

For squared error loss:

L(y,F(x))=12(yF(x))2L(y, F(x)) = \frac{1}{2}(y - F(x))^2

The negative gradient (residual) is:

LF(x)=yF(x)-\frac{\partial L}{\partial F(x)} = y - F(x)

Each tree is trained to predict these residuals.

Interactive Visualization

The demo below shows:

  • Data points and ensemble prediction curve
  • Residuals as vertical lines
  • Individual trees contributing to the final model
  • Loss decreasing as trees are added
Trees: 1
Learning rate: 0.30
Gray lines show residuals; curve is ensemble prediction.

Key Hyperparameters

  • learning_rate (η) – step size for each tree; smaller values improve generalization
  • n_estimators – number of trees in the ensemble
  • max_depth – controls tree complexity and interaction order

GBDT vs Random Forest

AspectGBDTRandom Forest
TrainingSequentialParallel
Bias–VarianceLow biasLow variance
OptimizationGradient-basedBagging
OverfittingControlled via learning rateControlled via averaging

Widely available in:

  • xgboost
  • lightgbm
  • catboost
  • sklearn.ensemble.GradientBoosting*

When to Use GBDT

  • Tabular data with mixed feature types
  • Medium-sized datasets
  • When strong performance and interpretability matter
Found an error or want to contribute? Edit on GitHub