Training giant neural networks by pipelining micro-batches across devices
GPipe enables training of giant neural networks that don’t fit on a single accelerator by partitioning the model across devices and pipelining micro-batches to maximize utilization.
The Problem
Large models exceed single-device memory. Naive model parallelism (one layer per device) is inefficient:
where is the number of devices. Most devices sit idle waiting for others.
The GPipe Solution
- Partition the model into consecutive stages across devices
- Split each mini-batch into micro-batches
- Pipeline micro-batches through the stages
Pipeline Schedule
Forward passes flow left-to-right, backward passes right-to-left:
The terms are “pipeline bubbles” at start and end.
Interactive Demo
Watch micro-batches flow through a 4-device pipeline:
GPipe Pipeline Parallelism
Efficiency Analysis
Bubble fraction (wasted time):
With , bubbles become negligible. GPipe recommends .
Memory Optimization
GPipe uses activation recomputation (gradient checkpointing):
- Forward: Only store activations at partition boundaries
- Backward: Recompute intermediate activations as needed
This trades compute for memory, enabling larger models.
Synchronous Training
Despite pipelining, GPipe maintains synchronous SGD semantics:
Gradients are accumulated across micro-batches before the update.
Results
| Task | Model Size | Devices | Speedup |
|---|---|---|---|
| ImageNet | 557M params | 8 TPUs | 6.3× |
| Translation | 6B params | 2048 TPUs | Near-linear |
GPipe achieved 84.4% ImageNet accuracy with a 557M parameter AmoebaNet.
Comparison with Data Parallelism
| Aspect | Data Parallel | GPipe |
|---|---|---|
| Memory per device | Full model | 1/K model |
| Communication | Gradient sync | Activation transfer |
| Batch size | Scales with K | Independent of K |
GPipe is complementary—can combine both approaches.
Key Paper
- GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism — Huang et al. (2019)
https://arxiv.org/abs/1811.06965