Mamba: State Space Models

A sequence model that keeps a running state instead of attending to every token pair

Mamba is a sequence model designed for long contexts. Its selling point is efficiency: instead of comparing every token with every other token like a Transformer, it carries a learned internal state forward through the sequence.

If you are new to sequence models, start with Understanding LSTMs and Transformer. Mamba is easier to understand if you already know both recurrence and attention.

The Intuition

Imagine reading a long textbook chapter. A Transformer repeatedly looks back at many earlier words. Mamba tries a different strategy: keep a smart running summary of what matters, then update that summary as each new token arrives.

That makes Mamba feel closer to an RNN in spirit, but with a more expressive update rule and much better engineering for modern hardware.

Why People Care

Self-attention has quadratic cost in sequence length:

O(n2d)O(n^2 \cdot d)

For very long sequences, that becomes expensive in memory and time. Mamba aims for:

O(nd)O(n \cdot d)

That difference matters a lot for long documents, audio, DNA, or other very long sequences.

What a State Space Model Does

A state space model (SSM) keeps a hidden state that changes over time. You can think of it as a memory vector that gets updated every time a new token comes in.

In continuous form:

h(t)=Ah(t)+Bx(t)h'(t) = Ah(t) + Bx(t) y(t)=Ch(t)y(t) = Ch(t)

You do not need the control-theory background on first read. The useful mental model is:

  • state = what the model is currently remembering
  • input = the new token
  • output = what the model produces from that memory

From Continuous Time to Tokens

Because text arrives token by token, the model uses a discrete recurrence:

ht=Aˉht1+Bˉxt,yt=Chth_t = \bar{A} h_{t-1} + \bar{B} x_t, \quad y_t = C h_t

This is the part that makes Mamba look like a learned memory-update system.

The Key Innovation: Selectivity

Classic SSMs use fixed dynamics. Mamba makes parts of the update depend on the current token:

Bt=fB(xt),Ct=fC(xt),Δt=fΔ(xt)B_t = f_B(x_t), \quad C_t = f_C(x_t), \quad \Delta_t = f_\Delta(x_t)

That means the model can decide, based on content, whether to keep information, forget it, or emphasize it. This is why people often describe Mamba as a selective state space model.

Interactive Visualization

Compare Mamba’s linear scaling with Transformer’s quadratic growth:

Complexity Comparison

Transformer O(n²)1,000,000
Mamba O(n)1,000
Speedup: 1000× faster for 1,000 tokens
Transformer
Quadratic attention
~2k context typical
Mamba (SSM)
Linear recurrence
1M+ context possible

Key insight: State space models process sequences with O(n) complexity by maintaining a fixed-size hidden state, while achieving comparable quality through selective gating.

What a Mamba Block Contains

At a high level, a block has:

  1. A projection to expand the representation
  2. A short convolution for local mixing
  3. The selective state space update
  4. A gating interaction
  5. A projection back to model size

When Mamba Helps

Mamba is especially interesting when:

  • sequences are very long
  • inference cost matters a lot
  • recurrence is acceptable if it is much more efficient

Trade-offs

StrengthLimitation
Linear scaling in sequence lengthHarder to interpret than plain recurrence
Strong long-context efficiencySmaller ecosystem than Transformers
Useful hybrid building blockAttention still wins on some tasks

What To Remember

  • Mamba is a long-sequence alternative to attention-heavy models
  • It keeps a learned running state instead of attending to all pairs of tokens
  • Its main advantage is efficiency at long context lengths

Where To Go Next

Found an error or want to contribute? Edit on GitHub