Adam Optimizer | AIpedia

Adam (Adaptive Moment Estimation), introduced by Kingma and Ba in 2014, is the most widely used optimizer in deep learning. It combines the benefits of momentum and adaptive learning rates, requiring minimal tuning while working well across diverse problems.

The Problem with Vanilla SGD

Stochastic Gradient Descent faces challenges:

Same learning rate for all parameters (bad for sparse features)
Struggles with saddle points and ravines
Requires careful learning rate scheduling

Adam’s Solution

Adam maintains two moving averages for each parameter:

First moment (momentum): $m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$
Second moment (adaptive LR): $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$

Then updates weights using:

\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

Bias Correction

Early in training, $m_t$ and $v_t$ are biased toward zero. Adam corrects this:

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

This correction is crucial for proper early training.

Interactive Demo

Compare Adam with other optimizers on a loss landscape:

Optimizer Comparison

Step: 0/100

SGD

2.5250

Momentum

2.5250

Adam

2.5250

Adam (Adaptive Moment Estimation)

m = β₁m + (1-β₁)∇L
v = β₂v + (1-β₂)∇L²
θ = θ - α · m̂ / (√v̂ + ε)

The Complete Algorithm

def adam(params, grads, m, v, t, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
    t += 1
    for p, g, m_i, v_i in zip(params, grads, m, v):
        # Update biased moments
        m_i = beta1 * m_i + (1 - beta1) * g
        v_i = beta2 * v_i + (1 - beta2) * g**2

        # Bias correction
        m_hat = m_i / (1 - beta1**t)
        v_hat = v_i / (1 - beta2**t)

        # Update parameters
        p -= lr * m_hat / (sqrt(v_hat) + eps)

    return t

Default Hyperparameters

Parameter	Default	Description
$\alpha$ (lr)	0.001	Step size
$\beta_1$	0.9	First moment decay
$\beta_2$	0.999	Second moment decay
$\epsilon$	1e-8	Numerical stability

These defaults work remarkably well across many problems.

Why Adam Works

Momentum ( $\beta_1$ )

Accumulates gradient direction over time
Smooths out noisy gradients
Helps escape shallow local minima

Adaptive Learning Rate ( $\beta_2$ )

Parameters with large gradients get smaller steps
Parameters with small gradients get larger steps
No manual learning rate scheduling needed

Adam Variants

Variant	Improvement
AdamW	Decoupled weight decay (better generalization)
AMSGrad	Non-increasing step sizes (convergence fix)
RAdam	Rectified Adam (variance correction)
AdaFactor	Memory-efficient for large models
LAMB	Layer-wise adaptive for large batch training
Lion	Simplified, sign-based updates

AdamW: The Modern Default

Standard Adam’s L2 regularization is coupled with gradient scaling, causing issues. AdamW decouples weight decay:

\theta_{t+1} = \theta_t - \alpha \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right)

AdamW is now the default for transformers and large language models.

Comparison with Other Optimizers

Optimizer	Momentum	Adaptive LR	Memory	Use Case
SGD	Optional	No	1x	Well-tuned vision models
RMSprop	No	Yes	2x	RNNs
Adam	Yes	Yes	3x	General purpose
AdamW	Yes	Yes	3x	Transformers, LLMs

When NOT to Use Adam

ImageNet training: SGD with momentum often generalizes better
Memory-constrained: Adam needs 3x memory of SGD
Small datasets: Can overfit more than SGD

Learning Rate Scheduling

Even with Adam, learning rate schedules help:

Warmup: Start low, increase gradually (critical for transformers)
Cosine decay: Smooth decrease to zero
Step decay: Discrete reductions at milestones

Historical Impact

Adam’s impact:

Became the default optimizer for most deep learning
Enabled training without extensive hyperparameter search
Made research more accessible (less tuning expertise needed)
Foundation for modern optimizer development

Key Papers

Adam: A Method for Stochastic Optimization – Kingma & Ba, 2014
https://arxiv.org/abs/1412.6980
Decoupled Weight Decay Regularization (AdamW) – Loshchilov & Hutter, 2017
https://arxiv.org/abs/1711.05101
On the Variance of the Adaptive Learning Rate and Beyond (RAdam) – Liu et al., 2019
https://arxiv.org/abs/1908.03265
Symbolic Discovery of Optimization Algorithms (Lion) – Chen et al., 2023
https://arxiv.org/abs/2302.06675

The Problem with Vanilla SGD

Adam’s Solution

Bias Correction

Interactive Demo

Optimizer Comparison

The Complete Algorithm

Default Hyperparameters

Why Adam Works

Momentum (β1\beta_1β1​)

Adaptive Learning Rate (β2\beta_2β2​)

Adam Variants

AdamW: The Modern Default

Comparison with Other Optimizers

When NOT to Use Adam

Learning Rate Scheduling

Historical Impact

Key Papers

Momentum ( $\beta_1$ )

Adaptive Learning Rate ( $\beta_2$ )