Mamba: State Space Models

Mamba is a sequence model designed for long contexts. Its selling point is efficiency: instead of comparing every token with every other token like a Transformer, it carries a learned internal state forward through the sequence.

If you are new to sequence models, start with Understanding LSTMs and Transformer. Mamba is easier to understand if you already know both recurrence and attention.

The Intuition

Imagine reading a long textbook chapter. A Transformer repeatedly looks back at many earlier words. Mamba tries a different strategy: keep a smart running summary of what matters, then update that summary as each new token arrives.

That makes Mamba feel closer to an RNN in spirit, but with a more expressive update rule and much better engineering for modern hardware.

Why People Care

Self-attention has quadratic cost in sequence length:

O(n^2 \cdot d)

For very long sequences, that becomes expensive in memory and time. Mamba aims for:

O(n \cdot d)

That difference matters a lot for long documents, audio, DNA, or other very long sequences.

What a State Space Model Does

A state space model (SSM) keeps a hidden state that changes over time. You can think of it as a memory vector that gets updated every time a new token comes in.

In continuous form:

h'(t) = Ah(t) + Bx(t)

y(t) = Ch(t)

You do not need the control-theory background on first read. The useful mental model is:

state = what the model is currently remembering
input = the new token
output = what the model produces from that memory

From Continuous Time to Tokens

Because text arrives token by token, the model uses a discrete recurrence:

h_t = \bar{A} h_{t-1} + \bar{B} x_t, \quad y_t = C h_t

This is the part that makes Mamba look like a learned memory-update system.

The Key Innovation: Selectivity

Classic SSMs use fixed dynamics. Mamba makes parts of the update depend on the current token:

B_t = f_B(x_t), \quad C_t = f_C(x_t), \quad \Delta_t = f_\Delta(x_t)

That means the model can decide, based on content, whether to keep information, forget it, or emphasize it. This is why people often describe Mamba as a selective state space model.

Interactive Visualization

Compare Mamba’s linear scaling with Transformer’s quadratic growth:

Complexity Comparison

Sequence Length: 1,000 tokens

Transformer O(n²)1,000,000

Mamba O(n)1,000

Speedup: 1000× faster for 1,000 tokens

Transformer

Quadratic attention

~2k context typical

Mamba (SSM)

Linear recurrence

1M+ context possible

Key insight: State space models process sequences with O(n) complexity by maintaining a fixed-size hidden state, while achieving comparable quality through selective gating.

What a Mamba Block Contains

At a high level, a block has:

A projection to expand the representation
A short convolution for local mixing
The selective state space update
A gating interaction
A projection back to model size

When Mamba Helps

Mamba is especially interesting when:

sequences are very long
inference cost matters a lot
recurrence is acceptable if it is much more efficient

Trade-offs

Strength	Limitation
Linear scaling in sequence length	Harder to interpret than plain recurrence
Strong long-context efficiency	Smaller ecosystem than Transformers
Useful hybrid building block	Attention still wins on some tasks

What To Remember

Mamba is a long-sequence alternative to attention-heavy models
It keeps a learned running state instead of attending to all pairs of tokens
Its main advantage is efficiency at long context lengths

Where To Go Next

Read Transformer for the main comparison point
Read Understanding LSTMs for the recurrence intuition
Read Pre-training to see how these models fit into foundation-model pipelines