A sequence model that keeps a running state instead of attending to every token pair
Mamba is a sequence model designed for long contexts. Its selling point is efficiency: instead of comparing every token with every other token like a Transformer, it carries a learned internal state forward through the sequence.
If you are new to sequence models, start with Understanding LSTMs and Transformer. Mamba is easier to understand if you already know both recurrence and attention.
The Intuition
Imagine reading a long textbook chapter. A Transformer repeatedly looks back at many earlier words. Mamba tries a different strategy: keep a smart running summary of what matters, then update that summary as each new token arrives.
That makes Mamba feel closer to an RNN in spirit, but with a more expressive update rule and much better engineering for modern hardware.
Why People Care
Self-attention has quadratic cost in sequence length:
For very long sequences, that becomes expensive in memory and time. Mamba aims for:
That difference matters a lot for long documents, audio, DNA, or other very long sequences.
What a State Space Model Does
A state space model (SSM) keeps a hidden state that changes over time. You can think of it as a memory vector that gets updated every time a new token comes in.
In continuous form:
You do not need the control-theory background on first read. The useful mental model is:
- state = what the model is currently remembering
- input = the new token
- output = what the model produces from that memory
From Continuous Time to Tokens
Because text arrives token by token, the model uses a discrete recurrence:
This is the part that makes Mamba look like a learned memory-update system.
The Key Innovation: Selectivity
Classic SSMs use fixed dynamics. Mamba makes parts of the update depend on the current token:
That means the model can decide, based on content, whether to keep information, forget it, or emphasize it. This is why people often describe Mamba as a selective state space model.
Interactive Visualization
Compare Mamba’s linear scaling with Transformer’s quadratic growth:
Complexity Comparison
Key insight: State space models process sequences with O(n) complexity by maintaining a fixed-size hidden state, while achieving comparable quality through selective gating.
What a Mamba Block Contains
At a high level, a block has:
- A projection to expand the representation
- A short convolution for local mixing
- The selective state space update
- A gating interaction
- A projection back to model size
When Mamba Helps
Mamba is especially interesting when:
- sequences are very long
- inference cost matters a lot
- recurrence is acceptable if it is much more efficient
Trade-offs
| Strength | Limitation |
|---|---|
| Linear scaling in sequence length | Harder to interpret than plain recurrence |
| Strong long-context efficiency | Smaller ecosystem than Transformers |
| Useful hybrid building block | Attention still wins on some tasks |
What To Remember
- Mamba is a long-sequence alternative to attention-heavy models
- It keeps a learned running state instead of attending to all pairs of tokens
- Its main advantage is efficiency at long context lengths
Where To Go Next
- Read Transformer for the main comparison point
- Read Understanding LSTMs for the recurrence intuition
- Read Pre-training to see how these models fit into foundation-model pipelines