Recognition: unknown
Lost in State Space: Probing Frozen Mamba Representations
Pith reviewed 2026-05-09 19:53 UTC · model grok-4.3
The pith
Mamba recurrent states extracted at fixed patch boundaries do not consistently outperform mean pooling for frozen sentence representations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across SST-2, CoLA, MRPC, STS-B, and IMDb, four strategies for pulling sentence representations from a frozen pretrained Mamba-130M model show that fixed patch-boundary readouts of the recurrent state do not reliably exceed mean pooling. The raw final SSM state collapses to near-zero utility on CoLA, and the token representations exhibit severe anisotropy with mean pairwise cosine similarity of 0.9999. The work introduces orthogonal injection as a modified recurrence intended to limit how much new information enters the state at each step.
What carries the argument
Mamba's recurrent state h_t, treated as a compressed running summary of prior tokens, evaluated through four extraction strategies including patch-boundary readouts versus mean pooling under frozen probing.
If this is right
- Mean pooling remains a competitive default for obtaining frozen sentence vectors from Mamba on these tasks.
- The final SSM state alone is insufficient for tasks such as CoLA, where it yields zero Matthews correlation.
- Token representations inside the model display extreme anisotropy that limits their direct usefulness.
- Orthogonal injection provides one concrete way to alter the recurrence and constrain per-step information.
- Extraction method choice affects downstream probing performance more than expected from the model architecture alone.
Where Pith is reading between the lines
- Similar probing difficulties may appear in other selective state-space models when their internal states are used without adaptation.
- The observed collapse and anisotropy point toward a need for training objectives or architectural changes that explicitly encourage semantic diversity in the state.
- Developers building zero-shot sentence embedding pipelines from Mamba may need learned heads or post-processing rather than relying on raw state readouts.
- The gap between theoretical compression in the recurrence and practical semantic utility suggests targeted diagnostics for state-space models on longer or more complex inputs.
Load-bearing premise
The chosen fixed patch boundaries and four extraction strategies give a fair test of whether the recurrent state inherently compresses usable semantic sentence information.
What would settle it
A replication on the same five benchmarks and protocol that finds patch-boundary readouts statistically outperforming mean pooling across three seeds, or that measures low anisotropy in the extracted vectors.
Figures
read the original abstract
Mamba's recurrent state h_t is, by construction, a compressed summary of every token seen so far. This raises a tempting hypothesis: if we extract token-level outputs y_t at fixed patch boundaries, we obtain semantic sentence summaries for free, with no pooling head, no fine-tuning, and no [CLS] token. We test this hypothesis carefully. Across five benchmarks (SST-2, CoLA, MRPC, STS-B, IMDb), we compare four strategies for extracting frozen sentence representations from a pretrained Mamba-130M backbone under a strict frozen-feature probing protocol, using three random seeds where computationally feasible. The results do not support the hypothesis: patch boundary readouts do not consistently outperform simple mean pooling. We identify and quantify two structural pathologies: severe anisotropy (mean pairwise cosine similarity 0.9999, std 0.000044) and representational collapse in the raw final SSM state (MCC = 0.000 on CoLA across all three seeds, confirmed via confusion matrix). We further propose orthogonal injection, a modified recurrence that constrains new information per
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper tests whether token-level outputs extracted from Mamba's recurrent state h_t at fixed patch boundaries can serve as semantic sentence representations without pooling, fine-tuning, or a [CLS] token. Using a frozen Mamba-130M backbone on five benchmarks (SST-2, CoLA, MRPC, STS-B, IMDb), it compares four extraction strategies under a strict linear-probe protocol with three random seeds, finding that patch-boundary readouts do not consistently outperform mean pooling. It further quantifies two pathologies—anisotropy (mean pairwise cosine ~0.9999) and representational collapse (MCC=0 on CoLA)—and proposes orthogonal injection as a modified recurrence.
Significance. If the negative result holds, it provides concrete evidence that Mamba's compressed recurrent state does not inherently yield usable sentence-level semantics at arbitrary boundaries, highlighting limitations of frozen SSM representations for NLP. The multi-benchmark, multi-seed design with metrics such as MCC and cosine similarity strengthens the empirical contribution and should inform future work on state extraction or architectural fixes in state-space models.
major comments (2)
- [§3] §3 (Extraction strategies and orthogonal injection): The four extraction strategies and the orthogonal injection modification are defined, but the manuscript does not report the exact patch-boundary indices used or provide pseudocode/equations for the injection update rule; without these, it is difficult to verify that the chosen boundaries constitute a fair test of the 'free summary' hypothesis or to reproduce the proposed fix for the observed collapse.
- [Results section] Results section, CoLA row of the main table: MCC=0 is reported for the raw final state across all three seeds, yet no per-seed variance, confusion-matrix breakdown, or comparison to a trivial baseline (e.g., always predicting the majority class) is supplied; this weakens the strength of the collapse claim as evidence against the recurrent state.
minor comments (3)
- [Abstract] Abstract: the description of orthogonal injection is truncated mid-sentence and omits any mention of the quantitative results or statistical protocol.
- The manuscript should include a brief discussion of how the observed anisotropy compares to known anisotropy in Transformer representations and whether any post-hoc whitening was attempted.
- Table captions and axis labels on any anisotropy or MCC plots should explicitly state the number of seeds and the exact pooling variants being compared.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive assessment of the empirical contribution. We address each major comment point by point below and will revise the manuscript accordingly to improve reproducibility and strengthen the evidence presented.
read point-by-point responses
-
Referee: [§3] §3 (Extraction strategies and orthogonal injection): The four extraction strategies and the orthogonal injection modification are defined, but the manuscript does not report the exact patch-boundary indices used or provide pseudocode/equations for the injection update rule; without these, it is difficult to verify that the chosen boundaries constitute a fair test of the 'free summary' hypothesis or to reproduce the proposed fix for the observed collapse.
Authors: We agree that the exact patch-boundary indices and the update rule for orthogonal injection should be specified for reproducibility. In the revised manuscript, we will add the precise indices used for each benchmark (determined by fixed token intervals scaled to average sentence length in the dataset) and include both pseudocode and the full mathematical equations for the orthogonal injection modification in §3. This will allow direct verification of the extraction strategies and the proposed fix. revision: yes
-
Referee: [Results section] Results section, CoLA row of the main table: MCC=0 is reported for the raw final state across all three seeds, yet no per-seed variance, confusion-matrix breakdown, or comparison to a trivial baseline (e.g., always predicting the majority class) is supplied; this weakens the strength of the collapse claim as evidence against the recurrent state.
Authors: We appreciate this suggestion for strengthening the collapse claim. The MCC=0 result was consistent across all three seeds, and we had verified it internally using confusion matrices (which showed predictions collapsing exclusively to the majority class). In the revision, we will report the per-seed MCC values (all exactly 0), include the confusion-matrix breakdown in an appendix, and add an explicit comparison to the majority-class baseline (which also yields MCC=0). This will be noted in the results section or a footnote for clarity. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper is a purely empirical probing study. It defines extraction strategies, benchmarks, and a frozen linear-probe protocol explicitly, then reports direct experimental comparisons (patch-boundary readouts vs. mean pooling) across seeds. No derivations, equations, or fitted parameters are presented that reduce to inputs by construction. The negative result on the hypothesis follows from the stated comparisons without load-bearing self-citations or ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The pretrained Mamba-130M backbone provides a representative frozen feature extractor for testing the state-compression hypothesis.
Reference graph
Works this paper leans on
-
[1]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv preprint arXiv:2312.00752 , year=
work page internal anchor Pith review arXiv
-
[2]
Transformers are
Dao, Tri and Gu, Albert , journal=. Transformers are
-
[3]
Advances in Neural Information Processing Systems , volume=
Combining recurrent, convolutional, and continuous-time models with linear state space layers , author=. Advances in Neural Information Processing Systems , volume=
-
[4]
International Conference on Learning Representations , year=
Efficiently modeling long sequences with structured state spaces , author=. International Conference on Learning Representations , year=
-
[5]
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , journal=
-
[6]
Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=
-
[7]
Sentence-
Reimers, Nils and Gurevych, Iryna , journal=. Sentence-
-
[8]
Gao, Tianyu and Yao, Xingcheng and Chen, Danqi , journal=
-
[9]
Byte latent transformer: Patches scale better than tokens , author=. arXiv preprint arXiv:2412.09871 , year=
-
[10]
Yu, Lili and others , journal=
-
[11]
How contextual are contextualized word representations? Comparing the geometry of
Ethayarajh, Kawin , journal=. How contextual are contextualized word representations? Comparing the geometry of
-
[12]
Proceedings of EMNLP , year=
On the sentence embeddings from pre-trained language models , author=. Proceedings of EMNLP , year=
-
[13]
Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel , journal=
-
[14]
Proceedings of ACL , year=
Learning word vectors for sentiment analysis , author=. Proceedings of ACL , year=
-
[15]
International Conference on Learning Representations , year=
Representation degeneration problem in training natural language generation models , author=. International Conference on Learning Representations , year=
-
[16]
Understanding intermediate layers using linear classifier probes
Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=
-
[17]
Tenney, Ian and Das, Dipanjan and Pavlick, Ellie , journal=
-
[18]
Proceedings of NAACL-HLT , year=
A structural probe for finding syntax in word representations , author=. Proceedings of NAACL-HLT , year=
-
[19]
International Conference on Learning Representations , year=
All-but-the-top: Simple and effective postprocessing for word representations , author=. International Conference on Learning Representations , year=
-
[20]
Jamba: A Hybrid Transformer-Mamba Language Model
Jamba: A hybrid transformer-mamba language model , author=. arXiv preprint arXiv:2403.19887 , year=
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.