arxiv: 2604.14501 · v1 · submitted 2026-04-16 · 💻 cs.LG · cs.AI· cs.CC

Recognition: unknown

On the Expressive Power and Limitations of Multi-Layer SSMs

Nikola Zubi\'c , Qian Li , Yuyi Wang , Davide Scaramuzza

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CC

keywords state-space modelsexpressive powerchain-of-thoughtcompositional tasksstreaming algorithmsfinite precisionmulti-layer modelsmodel limitations

0 comments

The pith

Multi-layer SSMs lag streaming models on compositional tasks, but online CoT makes them equivalent in power.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper maps the expressive limits of multi-layer state-space models by comparing them directly to streaming algorithms on tasks that require composing multiple operations in sequence. It establishes that SSMs face an inherent gap on these tasks that cannot be closed by adding more layers or offline chain-of-thought reasoning. Allowing online chain-of-thought, however, lets the models match the full power of streaming algorithms. The work also shows that model width and arithmetic precision cannot be traded against each other in the base SSM, yet become interchangeable once online CoT is introduced. These boundaries matter for understanding when SSM architectures can serve as efficient replacements for more general sequence processors.

Core claim

Multi-layer state-space models face fundamental limitations in compositional tasks, revealing an inherent gap between SSMs and streaming models. Offline CoT does not fundamentally increase the expressiveness, while online CoT can substantially increase its power. With online CoT, multi-layer SSMs become equivalent in power to streaming algorithms. Width and precision are not interchangeable in the base model, but admit a clean equivalence once online CoT is allowed.

What carries the argument

The formal distinction between online and offline chain-of-thought together with finite-precision arithmetic constraints used to prove equivalences and separations between multi-layer SSMs and streaming algorithms.

If this is right

Multi-layer SSMs without online CoT cannot solve the full class of compositional tasks that streaming algorithms handle.
Offline CoT adds no fundamental expressive power to multi-layer SSMs.
Online CoT renders width and precision interchangeable resources inside multi-layer SSMs.
The overall power of SSMs is jointly determined by depth, finite precision, and the availability of online CoT.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures that embed online reasoning steps inside SSM forward passes could extend their reach to tasks currently reserved for more general recurrent or attention-based models.
The results suggest that simply increasing model depth or width will not overcome compositional limits unless paired with an online CoT mechanism.
Designers of efficient sequence models may need to treat online CoT as a first-class architectural choice rather than an optional post-processing step.

Load-bearing premise

The specific formal definitions of compositional tasks, online versus offline CoT, and finite-precision arithmetic used to prove the equivalences and gaps.

What would settle it

A concrete compositional task on which a multi-layer SSM with online CoT fails to match a streaming algorithm (or succeeds where one should not) under matching width, depth, and finite precision.

read the original abstract

We study the expressive power and limitations of multi-layer state-space models (SSMs). First, we show that multi-layer SSMs face fundamental limitations in compositional tasks, revealing an inherent gap between SSMs and streaming models. Then, we examine the role of chain-of-thought (CoT), showing that offline CoT does not fundamentally increase the expressiveness, while online CoT can substantially increase its power. Indeed, with online CoT, multi-layer SSMs become equivalent in power to streaming algorithms. Finally, we investigate the tradeoff between width and precision, showing that these resources are not interchangeable in the base model, but admit a clean equivalence once online CoT is allowed. Overall, our results offer a unified perspective on how depth, finite precision, and CoT shape the power and limits of SSMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Online CoT lifts multi-layer SSMs to streaming-algorithm power and swaps width for precision, but the result sits on narrow formal definitions of CoT injection and finite-precision updates.

read the letter

The punchline is that base multi-layer SSMs cannot match streaming models on compositional tasks, offline CoT leaves that gap in place, and only online CoT closes it while also making width and precision interchangeable. The paper works through these separations with claimed proofs and gives a unified view of depth, precision, and CoT effects on SSM expressivity. That framing is new relative to earlier SSM theory and cleanly isolates the online versus offline distinction. The work is careful about stating the equivalences and the base-model non-interchangeability result, and it avoids overclaiming by tying everything to the chosen task class and arithmetic model. Credit is due for making the distinctions explicit rather than leaving them implicit. The soft spots are real but not fatal. Everything rests on the precise definition of an online CoT step—how many tokens are generated, whether the SSM state carries over or resets, and how the generated tokens re-enter the recurrence—and on the exact finite-precision model for state updates. If those choices diverge from standard SSM implementations such as Mamba, the claimed equivalence and the width-precision swap may not transfer. The compositional tasks used to exhibit the base gap are also specific; broader classes could change the picture. The abstract states the proofs exist, but the strength of the paper will be determined by how tightly the formalizations match real SSM semantics and whether the derivations contain hidden assumptions about state carry-over. This paper is for theorists who study sequence-model expressivity and for practitioners who need to decide when SSMs are sufficient for long-context or reasoning workloads. Readers who want concrete boundaries rather than hand-wavy intuition will find it useful. It deserves a serious referee because the claims are sharp enough to affect architecture selection if the proofs hold up under scrutiny. I would send it out for review, asking referees to focus on the CoT and precision formalizations.

Referee Report

3 major / 2 minor

Summary. The paper claims that multi-layer state-space models (SSMs) have inherent limitations in handling compositional tasks, creating a gap with streaming models. Offline chain-of-thought (CoT) does not substantially increase their expressiveness, whereas online CoT allows multi-layer SSMs to become equivalent in power to streaming algorithms. Additionally, while width and precision are not interchangeable in the base SSM model, they admit an equivalence under online CoT. The results are supported by proofs for the limitations, equivalences, and trade-offs.

Significance. If the formal results hold, this work provides a valuable unified perspective on the roles of depth, finite precision, and CoT in determining the capabilities of SSMs. This is significant for advancing the theoretical understanding of efficient sequence models like S4 and Mamba, and could guide practical design choices in model architecture and inference strategies. The provision of proofs for limitations, equivalences, and trade-offs is a strength.

major comments (3)

[§3] §3 (limitations of multi-layer SSMs): the gap with streaming models in compositional tasks is load-bearing for the first main claim; the proof must explicitly define the class of compositional tasks and the streaming-algorithm baseline to ensure the separation is not an artifact of the chosen formalization.
[§5] §5 (online CoT equivalence): the headline result that online CoT lifts multi-layer SSMs to streaming-algorithm power depends on the precise model of online CoT (token generation count, re-injection into the SSM state update, and whether the state is reset or carried over). Any deviation from standard linear-recurrence SSM semantics (as in S4/Mamba) would invalidate both the limitation and recovery claims.
[§6] §6 (width-precision tradeoff): the claim that width and precision become interchangeable only under online CoT requires an explicit finite-precision arithmetic model (bit width per state entry, rounding mode, and exactness of multiplication/accumulation). Without this, the tradeoff equivalence cannot be verified.

minor comments (2)

[Abstract] Abstract: the summary of results is clear, but a brief mention of the concrete models (S4, Mamba) would help readers connect the theory to practice.
[Notation] Notation section: ensure symbols for hidden dimension, precision bits, and state update are defined once and used consistently in all theorems.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help strengthen the clarity of our formal results. We address each major comment below and will incorporate revisions to make definitions and models explicit while preserving the core claims.

read point-by-point responses

Referee: [§3] §3 (limitations of multi-layer SSMs): the gap with streaming models in compositional tasks is load-bearing for the first main claim; the proof must explicitly define the class of compositional tasks and the streaming-algorithm baseline to ensure the separation is not an artifact of the chosen formalization.

Authors: We agree that explicit definitions strengthen the result. In the revised manuscript, we will add a new subsection at the start of §3 that formally defines compositional tasks as those requiring the sequential composition of k independent functions (e.g., iterated parity or nested modular counting) on the input stream, and defines the streaming baseline as constant-space streaming algorithms that may perform arbitrary (but finite) computation per token. The separation proof shows that fixed-depth multi-layer SSMs cannot maintain the necessary cross-composition state, while the streaming model can; this separation is robust under the stated definitions and aligns with standard automata-theoretic notions. revision: yes
Referee: [§5] §5 (online CoT equivalence): the headline result that online CoT lifts multi-layer SSMs to streaming-algorithm power depends on the precise model of online CoT (token generation count, re-injection into the SSM state update, and whether the state is reset or carried over). Any deviation from standard linear-recurrence SSM semantics (as in S4/Mamba) would invalidate both the limitation and recovery claims.

Authors: We share the concern for precision. Our model of online CoT in §5 adheres to standard linear-recurrence SSM semantics: at each step the SSM produces a token that is immediately re-injected as the next input without state reset, the state is carried forward, and the number of generated tokens per original input token is bounded by a constant. We will insert a formal definition of this process (including the re-injection rule and state carry-over) at the beginning of §5 to make the equivalence to streaming algorithms fully verifiable under these semantics. revision: yes
Referee: [§6] §6 (width-precision tradeoff): the claim that width and precision become interchangeable only under online CoT requires an explicit finite-precision arithmetic model (bit width per state entry, rounding mode, and exactness of multiplication/accumulation). Without this, the tradeoff equivalence cannot be verified.

Authors: We accept that an explicit arithmetic model is required. The paper employs a fixed-point model with b-bit entries per state dimension, exact multiplication and accumulation within the bit width, and round-to-nearest rounding. We will expand the opening of §6 with a formal definition of this model (specifying bit width, rounding, and exactness) and restate the tradeoff theorem under it: without online CoT, width and precision are not interchangeable, while with online CoT they trade off linearly (constant product w·b suffices for equivalent power). revision: yes

Circularity Check

0 steps flagged

No circularity: equivalences derived from explicit model definitions and standard streaming comparisons

full rationale

The abstract and framing present proofs of limitations for base multi-layer SSMs versus streaming models, the differential impact of offline versus online CoT, and width-precision tradeoffs, all grounded in formal definitions of the models, CoT variants, and task classes. These are compared against external benchmarks (streaming algorithms) rather than reducing to self-referential fits, self-citations, or ansatzes. No load-bearing step equates a claimed result to its own inputs by construction; the derivations remain self-contained once the chosen formalizations are accepted.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard mathematical definitions of SSMs, streaming algorithms, compositional tasks, and finite-precision arithmetic; no new entities or fitted parameters are introduced in the abstract.

axioms (1)

standard math Standard definitions of multi-layer SSMs, streaming algorithms, and finite-precision computation
The abstract builds all claims on these established formal objects without re-deriving them.

pith-pipeline@v0.9.0 · 5448 in / 1205 out tokens · 44623 ms · 2026-05-10T12:33:04.761347+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando de Freitas, and Caglar Gulcehre. Griffin: Mixing gated linear recurrences with local attention for efficien...

work page internal anchor Pith review arXiv
[2]

Gadgetless lifting beats round elimination: Improved lower bounds for pointer chasing

Xinyu Mao, Guangxu Yang, and Jiapeng Zhang. Gadgetless lifting beats round elimination: Improved lower bounds for pointer chasing. In16th Innovations in Theoretical Computer Science Conference (ITCS 2025),

2025
[3]

Rwkv: Reinventing rnns for the transformer era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. In Findings of the association for computational linguistics: EMNLP 2023, pp. 14048–14077,

2023
[4]

S7: Selective and simplified state space layers for sequence modeling.arXiv preprint arXiv:2410.03464,

Taylan Soydan, Nikola Zubić, Nico Messikommer, Siddhartha Mishra, and Davide Scaramuzza. S7: Selective and simplified state space layers for sequence modeling.arXiv preprint arXiv:2410.03464,

work page arXiv