Recognition: unknown
On the Expressive Power and Limitations of Multi-Layer SSMs
Pith reviewed 2026-05-10 12:33 UTC · model grok-4.3
The pith
Multi-layer SSMs lag streaming models on compositional tasks, but online CoT makes them equivalent in power.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Multi-layer state-space models face fundamental limitations in compositional tasks, revealing an inherent gap between SSMs and streaming models. Offline CoT does not fundamentally increase the expressiveness, while online CoT can substantially increase its power. With online CoT, multi-layer SSMs become equivalent in power to streaming algorithms. Width and precision are not interchangeable in the base model, but admit a clean equivalence once online CoT is allowed.
What carries the argument
The formal distinction between online and offline chain-of-thought together with finite-precision arithmetic constraints used to prove equivalences and separations between multi-layer SSMs and streaming algorithms.
If this is right
- Multi-layer SSMs without online CoT cannot solve the full class of compositional tasks that streaming algorithms handle.
- Offline CoT adds no fundamental expressive power to multi-layer SSMs.
- Online CoT renders width and precision interchangeable resources inside multi-layer SSMs.
- The overall power of SSMs is jointly determined by depth, finite precision, and the availability of online CoT.
Where Pith is reading between the lines
- Architectures that embed online reasoning steps inside SSM forward passes could extend their reach to tasks currently reserved for more general recurrent or attention-based models.
- The results suggest that simply increasing model depth or width will not overcome compositional limits unless paired with an online CoT mechanism.
- Designers of efficient sequence models may need to treat online CoT as a first-class architectural choice rather than an optional post-processing step.
Load-bearing premise
The specific formal definitions of compositional tasks, online versus offline CoT, and finite-precision arithmetic used to prove the equivalences and gaps.
What would settle it
A concrete compositional task on which a multi-layer SSM with online CoT fails to match a streaming algorithm (or succeeds where one should not) under matching width, depth, and finite precision.
read the original abstract
We study the expressive power and limitations of multi-layer state-space models (SSMs). First, we show that multi-layer SSMs face fundamental limitations in compositional tasks, revealing an inherent gap between SSMs and streaming models. Then, we examine the role of chain-of-thought (CoT), showing that offline CoT does not fundamentally increase the expressiveness, while online CoT can substantially increase its power. Indeed, with online CoT, multi-layer SSMs become equivalent in power to streaming algorithms. Finally, we investigate the tradeoff between width and precision, showing that these resources are not interchangeable in the base model, but admit a clean equivalence once online CoT is allowed. Overall, our results offer a unified perspective on how depth, finite precision, and CoT shape the power and limits of SSMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that multi-layer state-space models (SSMs) have inherent limitations in handling compositional tasks, creating a gap with streaming models. Offline chain-of-thought (CoT) does not substantially increase their expressiveness, whereas online CoT allows multi-layer SSMs to become equivalent in power to streaming algorithms. Additionally, while width and precision are not interchangeable in the base SSM model, they admit an equivalence under online CoT. The results are supported by proofs for the limitations, equivalences, and trade-offs.
Significance. If the formal results hold, this work provides a valuable unified perspective on the roles of depth, finite precision, and CoT in determining the capabilities of SSMs. This is significant for advancing the theoretical understanding of efficient sequence models like S4 and Mamba, and could guide practical design choices in model architecture and inference strategies. The provision of proofs for limitations, equivalences, and trade-offs is a strength.
major comments (3)
- [§3] §3 (limitations of multi-layer SSMs): the gap with streaming models in compositional tasks is load-bearing for the first main claim; the proof must explicitly define the class of compositional tasks and the streaming-algorithm baseline to ensure the separation is not an artifact of the chosen formalization.
- [§5] §5 (online CoT equivalence): the headline result that online CoT lifts multi-layer SSMs to streaming-algorithm power depends on the precise model of online CoT (token generation count, re-injection into the SSM state update, and whether the state is reset or carried over). Any deviation from standard linear-recurrence SSM semantics (as in S4/Mamba) would invalidate both the limitation and recovery claims.
- [§6] §6 (width-precision tradeoff): the claim that width and precision become interchangeable only under online CoT requires an explicit finite-precision arithmetic model (bit width per state entry, rounding mode, and exactness of multiplication/accumulation). Without this, the tradeoff equivalence cannot be verified.
minor comments (2)
- [Abstract] Abstract: the summary of results is clear, but a brief mention of the concrete models (S4, Mamba) would help readers connect the theory to practice.
- [Notation] Notation section: ensure symbols for hidden dimension, precision bits, and state update are defined once and used consistently in all theorems.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help strengthen the clarity of our formal results. We address each major comment below and will incorporate revisions to make definitions and models explicit while preserving the core claims.
read point-by-point responses
-
Referee: [§3] §3 (limitations of multi-layer SSMs): the gap with streaming models in compositional tasks is load-bearing for the first main claim; the proof must explicitly define the class of compositional tasks and the streaming-algorithm baseline to ensure the separation is not an artifact of the chosen formalization.
Authors: We agree that explicit definitions strengthen the result. In the revised manuscript, we will add a new subsection at the start of §3 that formally defines compositional tasks as those requiring the sequential composition of k independent functions (e.g., iterated parity or nested modular counting) on the input stream, and defines the streaming baseline as constant-space streaming algorithms that may perform arbitrary (but finite) computation per token. The separation proof shows that fixed-depth multi-layer SSMs cannot maintain the necessary cross-composition state, while the streaming model can; this separation is robust under the stated definitions and aligns with standard automata-theoretic notions. revision: yes
-
Referee: [§5] §5 (online CoT equivalence): the headline result that online CoT lifts multi-layer SSMs to streaming-algorithm power depends on the precise model of online CoT (token generation count, re-injection into the SSM state update, and whether the state is reset or carried over). Any deviation from standard linear-recurrence SSM semantics (as in S4/Mamba) would invalidate both the limitation and recovery claims.
Authors: We share the concern for precision. Our model of online CoT in §5 adheres to standard linear-recurrence SSM semantics: at each step the SSM produces a token that is immediately re-injected as the next input without state reset, the state is carried forward, and the number of generated tokens per original input token is bounded by a constant. We will insert a formal definition of this process (including the re-injection rule and state carry-over) at the beginning of §5 to make the equivalence to streaming algorithms fully verifiable under these semantics. revision: yes
-
Referee: [§6] §6 (width-precision tradeoff): the claim that width and precision become interchangeable only under online CoT requires an explicit finite-precision arithmetic model (bit width per state entry, rounding mode, and exactness of multiplication/accumulation). Without this, the tradeoff equivalence cannot be verified.
Authors: We accept that an explicit arithmetic model is required. The paper employs a fixed-point model with b-bit entries per state dimension, exact multiplication and accumulation within the bit width, and round-to-nearest rounding. We will expand the opening of §6 with a formal definition of this model (specifying bit width, rounding, and exactness) and restate the tradeoff theorem under it: without online CoT, width and precision are not interchangeable, while with online CoT they trade off linearly (constant product w·b suffices for equivalent power). revision: yes
Circularity Check
No circularity: equivalences derived from explicit model definitions and standard streaming comparisons
full rationale
The abstract and framing present proofs of limitations for base multi-layer SSMs versus streaming models, the differential impact of offline versus online CoT, and width-precision tradeoffs, all grounded in formal definitions of the models, CoT variants, and task classes. These are compared against external benchmarks (streaming algorithms) rather than reducing to self-referential fits, self-citations, or ansatzes. No load-bearing step equates a claimed result to its own inputs by construction; the derivations remain self-contained once the chosen formalizations are accepted.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard definitions of multi-layer SSMs, streaming algorithms, and finite-precision computation
Reference graph
Works this paper leans on
-
[1]
Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando de Freitas, and Caglar Gulcehre. Griffin: Mixing gated linear recurrences with local attention for efficien...
work page internal anchor Pith review arXiv
-
[2]
Gadgetless lifting beats round elimination: Improved lower bounds for pointer chasing
Xinyu Mao, Guangxu Yang, and Jiapeng Zhang. Gadgetless lifting beats round elimination: Improved lower bounds for pointer chasing. In16th Innovations in Theoretical Computer Science Conference (ITCS 2025),
2025
-
[3]
Rwkv: Reinventing rnns for the transformer era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. In Findings of the association for computational linguistics: EMNLP 2023, pp. 14048–14077,
2023
-
[4]
Taylan Soydan, Nikola Zubić, Nico Messikommer, Siddhartha Mishra, and Davide Scaramuzza. S7: Selective and simplified state space layers for sequence modeling.arXiv preprint arXiv:2410.03464,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.