Scaling up end-to-end speech recognition with RNNs and CTC
Deep Speech 2 demonstrated that end-to-end deep learning could match or exceed traditional speech recognition systems. By scaling RNNs and using CTC loss, it achieved state-of-the-art results in both English and Mandarin.
End-to-End Architecture
The system maps spectrograms directly to text:
No phonemes, no pronunciation dictionaries—just raw audio to text.
The Model
- Input: Power spectrogram of audio
- Convolution: 1-3 conv layers for feature extraction
- Recurrent: 5-7 bidirectional RNN layers (GRU or LSTM)
- Output: Softmax over characters + CTC blank token
CTC Loss
Connectionist Temporal Classification allows training without frame-level alignment:
where collapses repeated characters and removes blanks.
Interactive Demo
Explore the Deep Speech 2 pipeline:
Deep Speech 2
Key Innovations
Batch Normalization for RNNs
Applied to recurrent layers (a non-trivial extension):
This enabled training much deeper networks.
SortaGrad
Curriculum learning strategy:
- First epoch: Sort utterances by length (shortest first)
- Subsequent epochs: Random order
This stabilizes early training with CTC.
Model Parallelism
Distributed training across multiple GPUs:
- Data parallelism for batches
- Model parallelism for large layers
Results
| Language | Test Set | WER |
|---|---|---|
| English | WSJ eval92 | 3.60% |
| English | LibriSpeech clean | 5.33% |
| Mandarin | Internal test | 6.19% |
Achieved near-human performance on clean speech.
Scaling Laws
The paper demonstrated:
- More data → better results (up to 12,000 hours)
- Deeper networks → better results (up to 7 RNN layers)
- Both English and Mandarin benefit similarly from scale
Why This Matters
Deep Speech 2 showed that:
- End-to-end wins: No need for hand-designed pipelines
- Scale matters: More data and compute beat clever engineering
- Transfer works: Same architecture for different languages
Legacy
This work influenced:
- Modern speech systems (Whisper, Wav2Vec)
- End-to-end paradigm adoption
- Large-scale data collection for speech
Key Paper
- Deep Speech 2: End-to-End Speech Recognition in English and Mandarin — Amodei et al. (2015)
https://arxiv.org/abs/1512.02595