GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism

Training giant neural networks by pipelining micro-batches across devices

GPipe enables training of giant neural networks that don’t fit on a single accelerator by partitioning the model across devices and pipelining micro-batches to maximize utilization.

The Problem

Large models exceed single-device memory. Naive model parallelism (one layer per device) is inefficient:

Utilization=1K\text{Utilization} = \frac{1}{K}

where KK is the number of devices. Most devices sit idle waiting for others.

The GPipe Solution

  1. Partition the model into KK consecutive stages across KK devices
  2. Split each mini-batch into MM micro-batches
  3. Pipeline micro-batches through the stages

Pipeline Schedule

Forward passes flow left-to-right, backward passes right-to-left:

Time=(K1)+M+(K1)\text{Time} = (K - 1) + M + (K - 1)

The (K1)(K-1) terms are “pipeline bubbles” at start and end.

Interactive Demo

Watch micro-batches flow through a 4-device pipeline:

GPipe Pipeline Parallelism

Device
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
GPU 0
0
0
GPU 1
GPU 2
GPU 3
Forward
Backward
Bubble (idle)
Micro-batch Splitting
Split batch into M micro-batches. Pipeline them across devices to hide latency.
Bubble Overhead
Bubbles = (K-1)/M of total time. More micro-batches → less overhead.

Efficiency Analysis

Bubble fraction (wasted time):

Bubble=2(K1)2(K1)+2M=K1K1+M\text{Bubble} = \frac{2(K-1)}{2(K-1) + 2M} = \frac{K-1}{K-1+M}

With MKM \gg K, bubbles become negligible. GPipe recommends M4KM \geq 4K.

Memory Optimization

GPipe uses activation recomputation (gradient checkpointing):

  • Forward: Only store activations at partition boundaries
  • Backward: Recompute intermediate activations as needed

This trades compute for memory, enabling larger models.

Synchronous Training

Despite pipelining, GPipe maintains synchronous SGD semantics:

θt+1=θtη1Mm=1MθLm\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{M}\sum_{m=1}^{M} \nabla_\theta \mathcal{L}_m

Gradients are accumulated across micro-batches before the update.

Results

TaskModel SizeDevicesSpeedup
ImageNet557M params8 TPUs6.3×
Translation6B params2048 TPUsNear-linear

GPipe achieved 84.4% ImageNet accuracy with a 557M parameter AmoebaNet.

Comparison with Data Parallelism

AspectData ParallelGPipe
Memory per deviceFull model1/K model
CommunicationGradient syncActivation transfer
Batch sizeScales with KIndependent of K

GPipe is complementary—can combine both approaches.

Key Paper

Found an error or want to contribute? Edit on GitHub