Adam Optimizer

Adaptive learning rates with momentum for deep learning

Adam (Adaptive Moment Estimation), introduced by Kingma and Ba in 2014, is the most widely used optimizer in deep learning. It combines the benefits of momentum and adaptive learning rates, requiring minimal tuning while working well across diverse problems.

The Problem with Vanilla SGD

Stochastic Gradient Descent faces challenges:

  • Same learning rate for all parameters (bad for sparse features)
  • Struggles with saddle points and ravines
  • Requires careful learning rate scheduling

Adam’s Solution

Adam maintains two moving averages for each parameter:

  1. First moment (momentum): mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1) g_t
  2. Second moment (adaptive LR): vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2

Then updates weights using:

θt+1=θtαv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

Bias Correction

Early in training, mtm_t and vtv_t are biased toward zero. Adam corrects this:

m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

This correction is crucial for proper early training.

Interactive Demo

Compare Adam with other optimizers on a loss landscape:

Optimizer Comparison

Step: 0/100
GoalStart
SGD
2.5250
Momentum
2.5250
Adam
2.5250
Adam (Adaptive Moment Estimation)
m = β₁m + (1-β₁)∇L
v = β₂v + (1-β₂)∇L²
θ = θ - α · m̂ / (√v̂ + ε)

The Complete Algorithm

def adam(params, grads, m, v, t, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
    t += 1
    for p, g, m_i, v_i in zip(params, grads, m, v):
        # Update biased moments
        m_i = beta1 * m_i + (1 - beta1) * g
        v_i = beta2 * v_i + (1 - beta2) * g**2

        # Bias correction
        m_hat = m_i / (1 - beta1**t)
        v_hat = v_i / (1 - beta2**t)

        # Update parameters
        p -= lr * m_hat / (sqrt(v_hat) + eps)

    return t

Default Hyperparameters

ParameterDefaultDescription
α\alpha (lr)0.001Step size
β1\beta_10.9First moment decay
β2\beta_20.999Second moment decay
ϵ\epsilon1e-8Numerical stability

These defaults work remarkably well across many problems.

Why Adam Works

Momentum (β1\beta_1)

  • Accumulates gradient direction over time
  • Smooths out noisy gradients
  • Helps escape shallow local minima

Adaptive Learning Rate (β2\beta_2)

  • Parameters with large gradients get smaller steps
  • Parameters with small gradients get larger steps
  • No manual learning rate scheduling needed

Adam Variants

VariantImprovement
AdamWDecoupled weight decay (better generalization)
AMSGradNon-increasing step sizes (convergence fix)
RAdamRectified Adam (variance correction)
AdaFactorMemory-efficient for large models
LAMBLayer-wise adaptive for large batch training
LionSimplified, sign-based updates

AdamW: The Modern Default

Standard Adam’s L2 regularization is coupled with gradient scaling, causing issues. AdamW decouples weight decay:

θt+1=θtα(m^tv^t+ϵ+λθt)\theta_{t+1} = \theta_t - \alpha \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right)

AdamW is now the default for transformers and large language models.

Comparison with Other Optimizers

OptimizerMomentumAdaptive LRMemoryUse Case
SGDOptionalNo1xWell-tuned vision models
RMSpropNoYes2xRNNs
AdamYesYes3xGeneral purpose
AdamWYesYes3xTransformers, LLMs

When NOT to Use Adam

  1. ImageNet training: SGD with momentum often generalizes better
  2. Memory-constrained: Adam needs 3x memory of SGD
  3. Small datasets: Can overfit more than SGD

Learning Rate Scheduling

Even with Adam, learning rate schedules help:

  • Warmup: Start low, increase gradually (critical for transformers)
  • Cosine decay: Smooth decrease to zero
  • Step decay: Discrete reductions at milestones

Historical Impact

Adam’s impact:

  • Became the default optimizer for most deep learning
  • Enabled training without extensive hyperparameter search
  • Made research more accessible (less tuning expertise needed)
  • Foundation for modern optimizer development

Key Papers

Found an error or want to contribute? Edit on GitHub