Backpropagation

The algorithm that enables neural networks to learn by computing gradients efficiently

Backpropagation is the algorithm that makes deep learning possible. It efficiently computes gradients of the loss with respect to every parameter in a neural network, enabling gradient-based optimization.

The Core Problem

To train a neural network, we need:

Lwifor every weight wi\frac{\partial L}{\partial w_i} \quad \text{for every weight } w_i

Naively computing each gradient separately would be prohibitively expensive. Backpropagation does it efficiently.

The Chain Rule

Backpropagation is just the chain rule applied systematically:

Lx=Lyyx\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x}

If y=f(x)y = f(x) and L=g(y)L = g(y), we can compute Lx\frac{\partial L}{\partial x} from Ly\frac{\partial L}{\partial y} and local gradient yx\frac{\partial y}{\partial x}.

Forward and Backward Pass

Forward Pass

Compute outputs layer by layer:

z(l)=W(l)a(l1)+b(l),a(l)=σ(z(l))z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}, \quad a^{(l)} = \sigma(z^{(l)})

Store activations for the backward pass.

Backward Pass

Propagate gradients from output to input:

δ(l)=Lz(l)=((W(l+1))Tδ(l+1))σ(z(l))\delta^{(l)} = \frac{\partial L}{\partial z^{(l)}} = \left( (W^{(l+1)})^T \delta^{(l+1)} \right) \odot \sigma'(z^{(l)})

Then compute weight gradients:

LW(l)=δ(l)(a(l1))T\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} (a^{(l-1)})^T

Interactive Visualization

Watch gradients flow backward through a network:

Backpropagation Flow

→ Forward
InputHidden 1Hidden 2Output
Forward Pass

Compute activations layer by layer: a = σ(Wa + b)

Backward Pass

Propagate gradients: δ = (W^T δ) ⊙ σ'(z)

Computational Graph View

Modern frameworks represent computation as a directed acyclic graph:

  1. Forward: Traverse graph, compute outputs
  2. Backward: Traverse in reverse, accumulate gradients

Each node stores:

  • Forward function: y=f(x1,,xn)y = f(x_1, \ldots, x_n)
  • Backward function: Lxi\frac{\partial L}{\partial x_i} given Ly\frac{\partial L}{\partial y}

Common Layer Gradients

LayerForwardBackward
Lineary=Wxy = WxLW=LyxT\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y} x^T
ReLUy=max(0,x)y = \max(0, x)Lx=Ly1x>0\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \mathbf{1}_{x > 0}
Softmax+CEL=logpcL = -\log p_cLz=py\frac{\partial L}{\partial z} = p - y
BatchNormy=γx^+βy = \gamma \hat{x} + \beta(complex, involves batch statistics)

Vanishing/Exploding Gradients

Deep networks can suffer from:

Lw(1)=l=2Lz(l)z(l1)Lz(L)\frac{\partial L}{\partial w^{(1)}} = \prod_{l=2}^{L} \frac{\partial z^{(l)}}{\partial z^{(l-1)}} \cdot \frac{\partial L}{\partial z^{(L)}}

If factors < 1: vanishing gradients (early layers don’t learn) If factors > 1: exploding gradients (unstable training)

Solutions: ReLU, residual connections, careful initialization, normalization.

Automatic Differentiation

Modern frameworks (PyTorch, JAX) implement backprop automatically:

# Forward
y = model(x)
loss = criterion(y, target)

# Backward (computes all gradients)
loss.backward()

# Update
optimizer.step()

Why Backprop Matters

Backpropagation is:

  • Efficient: O(n)O(n) gradient computation for nn parameters
  • General: Works for any differentiable computation graph
  • Foundational: Enables all modern deep learning
Found an error or want to contribute? Edit on GitHub