pith. machine review for the scientific record. sign in

arxiv: 2605.00253 · v1 · submitted 2026-04-30 · 💻 cs.CL · cs.LG

Recognition: unknown

Lost in State Space: Probing Frozen Mamba Representations

Akash Singh, Bhagyashree Wagh

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:53 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords Mambastate space modelsfrozen probingsentence representationsanisotropyrepresentational collapseorthogonal injectionrecurrent state
0
0 comments X

The pith

Mamba recurrent states extracted at fixed patch boundaries do not consistently outperform mean pooling for frozen sentence representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the idea that Mamba's recurrent state, built as a running summary of every token seen so far, can supply ready-made semantic sentence vectors simply by reading it out at chosen patch boundaries. This would remove any need for pooling heads, fine-tuning, or special tokens. On five standard benchmarks and under a strict frozen-feature protocol with multiple random seeds, the patch readouts do not beat the simplest baseline of averaging all token outputs. The experiments also document extreme anisotropy across token vectors and complete representational collapse in the final state on at least one task. Readers care because the result questions how much semantic compression is actually occurring inside the state space without further intervention.

Core claim

Across SST-2, CoLA, MRPC, STS-B, and IMDb, four strategies for pulling sentence representations from a frozen pretrained Mamba-130M model show that fixed patch-boundary readouts of the recurrent state do not reliably exceed mean pooling. The raw final SSM state collapses to near-zero utility on CoLA, and the token representations exhibit severe anisotropy with mean pairwise cosine similarity of 0.9999. The work introduces orthogonal injection as a modified recurrence intended to limit how much new information enters the state at each step.

What carries the argument

Mamba's recurrent state h_t, treated as a compressed running summary of prior tokens, evaluated through four extraction strategies including patch-boundary readouts versus mean pooling under frozen probing.

If this is right

  • Mean pooling remains a competitive default for obtaining frozen sentence vectors from Mamba on these tasks.
  • The final SSM state alone is insufficient for tasks such as CoLA, where it yields zero Matthews correlation.
  • Token representations inside the model display extreme anisotropy that limits their direct usefulness.
  • Orthogonal injection provides one concrete way to alter the recurrence and constrain per-step information.
  • Extraction method choice affects downstream probing performance more than expected from the model architecture alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar probing difficulties may appear in other selective state-space models when their internal states are used without adaptation.
  • The observed collapse and anisotropy point toward a need for training objectives or architectural changes that explicitly encourage semantic diversity in the state.
  • Developers building zero-shot sentence embedding pipelines from Mamba may need learned heads or post-processing rather than relying on raw state readouts.
  • The gap between theoretical compression in the recurrence and practical semantic utility suggests targeted diagnostics for state-space models on longer or more complex inputs.

Load-bearing premise

The chosen fixed patch boundaries and four extraction strategies give a fair test of whether the recurrent state inherently compresses usable semantic sentence information.

What would settle it

A replication on the same five benchmarks and protocol that finds patch-boundary readouts statistically outperforming mean pooling across three seeds, or that measures low anisotropy in the extracted vectors.

Figures

Figures reproduced from arXiv: 2605.00253 by Akash Singh, Bhagyashree Wagh.

Figure 1
Figure 1. Figure 1: The four extraction strategies and their empirical outcomes. (a) Patched-Mamba divides the input into 32-token patches, carries the SSM state ht across patch boundaries, and extracts the post-projection token output yt at each boundary. (b) Mean Pool averages all post-projection token outputs across every real position. (c) Final State extracts the raw SSM state hT after the full sequence without passing t… view at source ↗
Figure 2
Figure 2. Figure 2: Pairwise cosine similarity heatmaps for ten se￾mantically unrelated sentences (cats, quantum mechanics, stock markets, pizza, etc.). Left: mamba final state. Every off-diagonal cell is the same dark green (mean = 0.9999, std = 0.000044, diagonal excluded): all state vectors point in essentially the same direction regardless of content. Right: mamba mean pool from the same back￾bone. The matrix shows clear … view at source ↗
read the original abstract

Mamba's recurrent state h_t is, by construction, a compressed summary of every token seen so far. This raises a tempting hypothesis: if we extract token-level outputs y_t at fixed patch boundaries, we obtain semantic sentence summaries for free, with no pooling head, no fine-tuning, and no [CLS] token. We test this hypothesis carefully. Across five benchmarks (SST-2, CoLA, MRPC, STS-B, IMDb), we compare four strategies for extracting frozen sentence representations from a pretrained Mamba-130M backbone under a strict frozen-feature probing protocol, using three random seeds where computationally feasible. The results do not support the hypothesis: patch boundary readouts do not consistently outperform simple mean pooling. We identify and quantify two structural pathologies: severe anisotropy (mean pairwise cosine similarity 0.9999, std 0.000044) and representational collapse in the raw final SSM state (MCC = 0.000 on CoLA across all three seeds, confirmed via confusion matrix). We further propose orthogonal injection, a modified recurrence that constrains new information per

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper tests whether token-level outputs extracted from Mamba's recurrent state h_t at fixed patch boundaries can serve as semantic sentence representations without pooling, fine-tuning, or a [CLS] token. Using a frozen Mamba-130M backbone on five benchmarks (SST-2, CoLA, MRPC, STS-B, IMDb), it compares four extraction strategies under a strict linear-probe protocol with three random seeds, finding that patch-boundary readouts do not consistently outperform mean pooling. It further quantifies two pathologies—anisotropy (mean pairwise cosine ~0.9999) and representational collapse (MCC=0 on CoLA)—and proposes orthogonal injection as a modified recurrence.

Significance. If the negative result holds, it provides concrete evidence that Mamba's compressed recurrent state does not inherently yield usable sentence-level semantics at arbitrary boundaries, highlighting limitations of frozen SSM representations for NLP. The multi-benchmark, multi-seed design with metrics such as MCC and cosine similarity strengthens the empirical contribution and should inform future work on state extraction or architectural fixes in state-space models.

major comments (2)
  1. [§3] §3 (Extraction strategies and orthogonal injection): The four extraction strategies and the orthogonal injection modification are defined, but the manuscript does not report the exact patch-boundary indices used or provide pseudocode/equations for the injection update rule; without these, it is difficult to verify that the chosen boundaries constitute a fair test of the 'free summary' hypothesis or to reproduce the proposed fix for the observed collapse.
  2. [Results section] Results section, CoLA row of the main table: MCC=0 is reported for the raw final state across all three seeds, yet no per-seed variance, confusion-matrix breakdown, or comparison to a trivial baseline (e.g., always predicting the majority class) is supplied; this weakens the strength of the collapse claim as evidence against the recurrent state.
minor comments (3)
  1. [Abstract] Abstract: the description of orthogonal injection is truncated mid-sentence and omits any mention of the quantitative results or statistical protocol.
  2. The manuscript should include a brief discussion of how the observed anisotropy compares to known anisotropy in Transformer representations and whether any post-hoc whitening was attempted.
  3. Table captions and axis labels on any anisotropy or MCC plots should explicitly state the number of seeds and the exact pooling variants being compared.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive assessment of the empirical contribution. We address each major comment point by point below and will revise the manuscript accordingly to improve reproducibility and strengthen the evidence presented.

read point-by-point responses
  1. Referee: [§3] §3 (Extraction strategies and orthogonal injection): The four extraction strategies and the orthogonal injection modification are defined, but the manuscript does not report the exact patch-boundary indices used or provide pseudocode/equations for the injection update rule; without these, it is difficult to verify that the chosen boundaries constitute a fair test of the 'free summary' hypothesis or to reproduce the proposed fix for the observed collapse.

    Authors: We agree that the exact patch-boundary indices and the update rule for orthogonal injection should be specified for reproducibility. In the revised manuscript, we will add the precise indices used for each benchmark (determined by fixed token intervals scaled to average sentence length in the dataset) and include both pseudocode and the full mathematical equations for the orthogonal injection modification in §3. This will allow direct verification of the extraction strategies and the proposed fix. revision: yes

  2. Referee: [Results section] Results section, CoLA row of the main table: MCC=0 is reported for the raw final state across all three seeds, yet no per-seed variance, confusion-matrix breakdown, or comparison to a trivial baseline (e.g., always predicting the majority class) is supplied; this weakens the strength of the collapse claim as evidence against the recurrent state.

    Authors: We appreciate this suggestion for strengthening the collapse claim. The MCC=0 result was consistent across all three seeds, and we had verified it internally using confusion matrices (which showed predictions collapsing exclusively to the majority class). In the revision, we will report the per-seed MCC values (all exactly 0), include the confusion-matrix breakdown in an appendix, and add an explicit comparison to the majority-class baseline (which also yields MCC=0). This will be noted in the results section or a footnote for clarity. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is a purely empirical probing study. It defines extraction strategies, benchmarks, and a frozen linear-probe protocol explicitly, then reports direct experimental comparisons (patch-boundary readouts vs. mean pooling) across seeds. No derivations, equations, or fitted parameters are presented that reduce to inputs by construction. The negative result on the hypothesis follows from the stated comparisons without load-bearing self-citations or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and relies on standard assumptions in representation learning such as the validity of frozen probing and benchmark labels; no free parameters, new axioms, or invented entities are introduced in the reported results.

axioms (1)
  • domain assumption The pretrained Mamba-130M backbone provides a representative frozen feature extractor for testing the state-compression hypothesis.
    Central to the strict frozen-feature probing protocol described.

pith-pipeline@v0.9.0 · 5487 in / 1132 out tokens · 55189 ms · 2026-05-09T19:53:02.455729+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv preprint arXiv:2312.00752 , year=

  2. [2]

    Transformers are

    Dao, Tri and Gu, Albert , journal=. Transformers are

  3. [3]

    Advances in Neural Information Processing Systems , volume=

    Combining recurrent, convolutional, and continuous-time models with linear state space layers , author=. Advances in Neural Information Processing Systems , volume=

  4. [4]

    International Conference on Learning Representations , year=

    Efficiently modeling long sequences with structured state spaces , author=. International Conference on Learning Representations , year=

  5. [5]

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , journal=

  6. [6]

    Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=

  7. [7]

    Sentence-

    Reimers, Nils and Gurevych, Iryna , journal=. Sentence-

  8. [8]

    Gao, Tianyu and Yao, Xingcheng and Chen, Danqi , journal=

  9. [9]

    Pagnoni, R

    Byte latent transformer: Patches scale better than tokens , author=. arXiv preprint arXiv:2412.09871 , year=

  10. [10]

    Yu, Lili and others , journal=

  11. [11]

    How contextual are contextualized word representations? Comparing the geometry of

    Ethayarajh, Kawin , journal=. How contextual are contextualized word representations? Comparing the geometry of

  12. [12]

    Proceedings of EMNLP , year=

    On the sentence embeddings from pre-trained language models , author=. Proceedings of EMNLP , year=

  13. [13]

    Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel , journal=

  14. [14]

    Proceedings of ACL , year=

    Learning word vectors for sentiment analysis , author=. Proceedings of ACL , year=

  15. [15]

    International Conference on Learning Representations , year=

    Representation degeneration problem in training natural language generation models , author=. International Conference on Learning Representations , year=

  16. [16]

    Understanding intermediate layers using linear classifier probes

    Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=

  17. [17]

    Tenney, Ian and Das, Dipanjan and Pavlick, Ellie , journal=

  18. [18]

    Proceedings of NAACL-HLT , year=

    A structural probe for finding syntax in word representations , author=. Proceedings of NAACL-HLT , year=

  19. [19]

    International Conference on Learning Representations , year=

    All-but-the-top: Simple and effective postprocessing for word representations , author=. International Conference on Learning Representations , year=

  20. [20]

    Jamba: A Hybrid Transformer-Mamba Language Model

    Jamba: A hybrid transformer-mamba language model , author=. arXiv preprint arXiv:2403.19887 , year=