Adaptive learning rates with momentum for deep learning
Adam (Adaptive Moment Estimation), introduced by Kingma and Ba in 2014, is the most widely used optimizer in deep learning. It combines the benefits of momentum and adaptive learning rates, requiring minimal tuning while working well across diverse problems.
The Problem with Vanilla SGD
Stochastic Gradient Descent faces challenges:
- Same learning rate for all parameters (bad for sparse features)
- Struggles with saddle points and ravines
- Requires careful learning rate scheduling
Adam’s Solution
Adam maintains two moving averages for each parameter:
- First moment (momentum):
- Second moment (adaptive LR):
Then updates weights using:
Bias Correction
Early in training, and are biased toward zero. Adam corrects this:
This correction is crucial for proper early training.
Interactive Demo
Compare Adam with other optimizers on a loss landscape:
Optimizer Comparison
v = β₂v + (1-β₂)∇L²
θ = θ - α · m̂ / (√v̂ + ε)
The Complete Algorithm
def adam(params, grads, m, v, t, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
t += 1
for p, g, m_i, v_i in zip(params, grads, m, v):
# Update biased moments
m_i = beta1 * m_i + (1 - beta1) * g
v_i = beta2 * v_i + (1 - beta2) * g**2
# Bias correction
m_hat = m_i / (1 - beta1**t)
v_hat = v_i / (1 - beta2**t)
# Update parameters
p -= lr * m_hat / (sqrt(v_hat) + eps)
return t
Default Hyperparameters
| Parameter | Default | Description |
|---|---|---|
| (lr) | 0.001 | Step size |
| 0.9 | First moment decay | |
| 0.999 | Second moment decay | |
| 1e-8 | Numerical stability |
These defaults work remarkably well across many problems.
Why Adam Works
Momentum ()
- Accumulates gradient direction over time
- Smooths out noisy gradients
- Helps escape shallow local minima
Adaptive Learning Rate ()
- Parameters with large gradients get smaller steps
- Parameters with small gradients get larger steps
- No manual learning rate scheduling needed
Adam Variants
| Variant | Improvement |
|---|---|
| AdamW | Decoupled weight decay (better generalization) |
| AMSGrad | Non-increasing step sizes (convergence fix) |
| RAdam | Rectified Adam (variance correction) |
| AdaFactor | Memory-efficient for large models |
| LAMB | Layer-wise adaptive for large batch training |
| Lion | Simplified, sign-based updates |
AdamW: The Modern Default
Standard Adam’s L2 regularization is coupled with gradient scaling, causing issues. AdamW decouples weight decay:
AdamW is now the default for transformers and large language models.
Comparison with Other Optimizers
| Optimizer | Momentum | Adaptive LR | Memory | Use Case |
|---|---|---|---|---|
| SGD | Optional | No | 1x | Well-tuned vision models |
| RMSprop | No | Yes | 2x | RNNs |
| Adam | Yes | Yes | 3x | General purpose |
| AdamW | Yes | Yes | 3x | Transformers, LLMs |
When NOT to Use Adam
- ImageNet training: SGD with momentum often generalizes better
- Memory-constrained: Adam needs 3x memory of SGD
- Small datasets: Can overfit more than SGD
Learning Rate Scheduling
Even with Adam, learning rate schedules help:
- Warmup: Start low, increase gradually (critical for transformers)
- Cosine decay: Smooth decrease to zero
- Step decay: Discrete reductions at milestones
Historical Impact
Adam’s impact:
- Became the default optimizer for most deep learning
- Enabled training without extensive hyperparameter search
- Made research more accessible (less tuning expertise needed)
- Foundation for modern optimizer development
Key Papers
- Adam: A Method for Stochastic Optimization – Kingma & Ba, 2014
https://arxiv.org/abs/1412.6980 - Decoupled Weight Decay Regularization (AdamW) – Loshchilov & Hutter, 2017
https://arxiv.org/abs/1711.05101 - On the Variance of the Adaptive Learning Rate and Beyond (RAdam) – Liu et al., 2019
https://arxiv.org/abs/1908.03265 - Symbolic Discovery of Optimization Algorithms (Lion) – Chen et al., 2023
https://arxiv.org/abs/2302.06675