Deep Speech 2: End-to-End Speech Recognition

Scaling up end-to-end speech recognition with RNNs and CTC

Deep Speech 2 demonstrated that end-to-end deep learning could match or exceed traditional speech recognition systems. By scaling RNNs and using CTC loss, it achieved state-of-the-art results in both English and Mandarin.

End-to-End Architecture

The system maps spectrograms directly to text:

AudioSpectrogramRNNCTCText\text{Audio} \rightarrow \text{Spectrogram} \rightarrow \text{RNN} \rightarrow \text{CTC} \rightarrow \text{Text}

No phonemes, no pronunciation dictionaries—just raw audio to text.

The Model

  1. Input: Power spectrogram of audio
  2. Convolution: 1-3 conv layers for feature extraction
  3. Recurrent: 5-7 bidirectional RNN layers (GRU or LSTM)
  4. Output: Softmax over characters + CTC blank token

CTC Loss

Connectionist Temporal Classification allows training without frame-level alignment:

P(yx)=πB1(y)P(πx)P(y|x) = \sum_{\pi \in \mathcal{B}^{-1}(y)} P(\pi|x)

where B\mathcal{B} collapses repeated characters and removes blanks.

Interactive Demo

Explore the Deep Speech 2 pipeline:

Deep Speech 2

Input Spectrogram
Time →↑ Frequency
Batch Normalization
Applied to RNN activations, enabling deeper networks
SortaGrad
Curriculum learning: start with shorter utterances

Key Innovations

Batch Normalization for RNNs

Applied to recurrent layers (a non-trivial extension):

ht=BN(W[ht1,xt])h_t = \text{BN}(W \cdot [h_{t-1}, x_t])

This enabled training much deeper networks.

SortaGrad

Curriculum learning strategy:

  1. First epoch: Sort utterances by length (shortest first)
  2. Subsequent epochs: Random order

This stabilizes early training with CTC.

Model Parallelism

Distributed training across multiple GPUs:

  • Data parallelism for batches
  • Model parallelism for large layers

Results

LanguageTest SetWER
EnglishWSJ eval923.60%
EnglishLibriSpeech clean5.33%
MandarinInternal test6.19%

Achieved near-human performance on clean speech.

Scaling Laws

The paper demonstrated:

  • More data → better results (up to 12,000 hours)
  • Deeper networks → better results (up to 7 RNN layers)
  • Both English and Mandarin benefit similarly from scale

Why This Matters

Deep Speech 2 showed that:

  1. End-to-end wins: No need for hand-designed pipelines
  2. Scale matters: More data and compute beat clever engineering
  3. Transfer works: Same architecture for different languages

Legacy

This work influenced:

  • Modern speech systems (Whisper, Wav2Vec)
  • End-to-end paradigm adoption
  • Large-scale data collection for speech

Key Paper

Found an error or want to contribute? Edit on GitHub