Mamba: State Space Models

Linear-time sequence modeling as an efficient alternative to Transformers

Mamba is a state space model (SSM) architecture that achieves Transformer-quality performance with linear complexity in sequence length. It’s emerging as a compelling alternative for long-context applications.

The Transformer Bottleneck

Self-attention has quadratic complexity:

O(n2d)O(n^2 \cdot d)

For a 100k token sequence, this becomes prohibitively expensive. Mamba achieves:

O(nd)O(n \cdot d)

State Space Models

SSMs model sequences through a continuous latent state:

h(t)=Ah(t)+Bx(t)h'(t) = Ah(t) + Bx(t) y(t)=Ch(t)y(t) = Ch(t)
  • h(t)RNh(t) \in \mathbb{R}^N: Hidden state
  • ARN×NA \in \mathbb{R}^{N \times N}: State matrix
  • B,CB, C: Input/output projections

Discretization

For discrete sequences, we discretize with step size Δ\Delta:

Aˉ=exp(ΔA),Bˉ=(ΔA)1(exp(ΔA)I)ΔB\bar{A} = \exp(\Delta A), \quad \bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B

This gives the recurrence:

ht=Aˉht1+Bˉxt,yt=Chth_t = \bar{A} h_{t-1} + \bar{B} x_t, \quad y_t = C h_t

The Selective State Space

Mamba’s key innovation: input-dependent parameters:

Bt=fB(xt),Ct=fC(xt),Δt=fΔ(xt)B_t = f_B(x_t), \quad C_t = f_C(x_t), \quad \Delta_t = f_\Delta(x_t)

This allows the model to selectively propagate or forget information based on content—similar to gating in LSTMs.

Interactive Visualization

Compare Mamba’s linear scaling with Transformer’s quadratic growth:

Complexity Comparison

Transformer O(n²)1,000,000
Mamba O(n)1,000
Speedup: 1000× faster for 1,000 tokens
Transformer
Quadratic attention
~2k context typical
Mamba (SSM)
Linear recurrence
1M+ context possible

Key insight: State space models process sequences with O(n) complexity by maintaining a fixed-size hidden state, while achieving comparable quality through selective gating.

Architecture

A Mamba block contains:

  1. Linear projection: Expand input dimension
  2. Convolution: Local context mixing
  3. SSM: Selective state space layer
  4. Gating: Multiplicative interaction
  5. Linear projection: Back to model dimension

Efficient Computation

The recurrence can be computed in parallel via:

  • Parallel scan: O(n)O(n) work, O(logn)O(\log n) depth
  • Hardware-aware: Fused CUDA kernels

Performance Comparison

ModelSequence LengthMemoryThroughput
Transformer2k optimalO(n2)O(n^2)Baseline
Mamba1M+ possibleO(n)O(n)3-5× faster

Key Properties

Advantages:

  • Linear-time inference
  • Constant memory per token in generation
  • Strong performance on long-range tasks

Trade-offs:

  • Less mature ecosystem than Transformers
  • Some tasks still favor attention
  • Hybrid architectures may be optimal

Applications

Mamba shows promise for:

  • Long-document understanding
  • Genomics and DNA modeling
  • Audio generation
  • Efficient edge deployment
Found an error or want to contribute? Edit on GitHub