pith. machine review for the scientific record. sign in

arxiv: 2604.18580 · v2 · submitted 2026-04-20 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Sessa: Selective State Space Attention

Liubomyr Horbatko

Pith reviewed 2026-05-10 04:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords sequence modelingstate space modelsattention mechanismslong contextpower-law memoryselective retrievalrecurrent feedback
0
0 comments X

The pith

Sessa places attention inside recurrent feedback paths to produce power-law memory tails and distance-independent selective retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sessa as a decoder architecture that embeds attention directly inside a recurrent state feedback loop. This setup generates multiple distinct pathways for past tokens to shape future states instead of depending on one attention operation or one recurrent chain. The authors prove that under explicit assumptions and matched regimes this structure produces memory influence decaying as a power law O(ℓ^{-β}) with 0 < β < 1, slower than the decay in matched Transformers or Mamba-style models. They also show an explicit construction that attains the rate and demonstrate that only Sessa among the compared classes supports flexible selective retrieval profiles whose strength does not fall with distance. Experiments confirm stronger long-context results while remaining competitive on short-context language modeling.

Core claim

By placing attention inside a recurrent feedback path, Sessa creates many attention-based paths through which past tokens can influence future states, rather than relying on a single attention read or a single recurrent chain. Under explicit assumptions and matched regimes, Sessa admits power-law memory tails O(ℓ^{-β}) for 0 < β < 1 with slower decay than corresponding Transformer and Mamba-style baselines, and an explicit construction achieves this rate. Under the same assumptions Sessa is the only model class among those considered that realizes flexible selective retrieval, including profiles whose influence does not decay with distance.

What carries the argument

The Sessa decoder that places attention inside a recurrent feedback path to create multiple influence paths from past tokens to future states.

If this is right

  • Memory influence from past tokens decays as a power law rather than exponentially, preserving distant information more effectively.
  • Selective retrieval becomes possible with influence profiles that remain constant regardless of distance.
  • Sessa records the strongest results on long-context benchmarks while staying competitive on short-context language modeling.
  • Only Sessa among the examined classes achieves flexible selective retrieval under the stated assumptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of new sequence models could adopt the same nested attention-recurrent pattern to obtain controllable memory decay rates.
  • The approach points to a general route for overcoming both attention dilution and recurrent forgetting by multiplying the number of influence pathways.
  • Empirical tests on real-world long documents could reveal how sensitive the observed power-law behavior is to the paper's modeling assumptions.

Load-bearing premise

The power-law memory tails and unique selective retrieval hold only under the paper's explicit assumptions and matched regimes.

What would settle it

Direct measurement of how token influence strength decays with distance in trained Sessa models on long sequences, checking whether the decay follows a power law with exponent between 0 and 1 rather than the faster decay of the baselines.

Figures

Figures reproduced from arXiv: 2604.18580 by Liubomyr Horbatko.

Figure 1
Figure 1. Figure 1: One-hop and multi-hop temporal routing within a single mixer layer. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Sessa Layer. 3.2 Sessa mixer The mixer maps ̄𝑎 ∈ ℝ𝐵batch×𝑇 ×𝐷 to 𝑠 ∈ ℝ𝐵batch×𝑇 ×𝐷. It uses two causal attention mechanisms: (i) a forward causal attention that produces a forward signal 𝑓𝑡 ∈ ℝ𝐷, and (ii) a feedback attention that produces weights over the strict past, used inside a causal feedback solve. Projections. At each time 𝑡, we form forward queries, keys, and values, as well as feedback queries and… view at source ↗
read the original abstract

Modern sequence modeling is dominated by two families: Transformers, whose self-attention can access arbitrary elements of the visible sequence, and structured state-space models, which propagate information through an explicit recurrent state. These mechanisms face different limitations on long contexts: when attention is diffuse, the influence of individual tokens is diluted across the effective support, while recurrent state propagation can lose long-range sensitivity unless information is actively preserved. As a result, both mechanisms face challenges in preserving and selectively retrieving information over long contexts. We propose Sessa, a decoder that places attention inside a recurrent feedback path. This creates many attention-based paths through which past tokens can influence future states, rather than relying on a single attention read or a single recurrent chain. We prove that, under explicit assumptions and matched regimes, Sessa admits power-law memory tails $O(\ell^{-\beta})$ for $0 < \beta < 1$, with slower decay than in the corresponding Transformer and Mamba-style baselines. We further give an explicit construction that achieves this power-law rate. Under the same assumptions, Sessa is the only model class among those considered that realizes flexible selective retrieval, including profiles whose influence does not decay with distance. Consistent with this theoretical advantage, across matched experiments, Sessa achieves the strongest performance on long-context benchmarks while remaining competitive with Transformer and Mamba-style baselines on short-context language modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript proposes Sessa, a decoder architecture that embeds attention mechanisms inside a recurrent feedback path, thereby creating multiple attention-based paths through which past tokens can influence future states. It proves that, under explicit assumptions and matched regimes, Sessa admits power-law memory tails O(ℓ^{-β}) for 0 < β < 1 with slower decay than corresponding Transformer and Mamba-style baselines, supplies an explicit construction achieving this rate, and is the only model class among those considered that realizes flexible selective retrieval including non-decaying influence profiles. Empirical results across matched experiments show Sessa achieving the strongest performance on long-context benchmarks while remaining competitive with baselines on short-context language modeling.

Significance. If the theoretical claims hold, the work offers a meaningful advance by providing a hybrid architecture with provable long-range memory advantages over both attention-only and recurrent-only families, together with an explicit construction and a uniqueness result for selective retrieval. These elements, if substantiated, could inform the design of efficient long-context models and address dilution and sensitivity issues in existing mechanisms.

major comments (1)
  1. [§3] §3 (theoretical analysis): the power-law tail result and the uniqueness claim for non-decaying selective retrieval are load-bearing; the manuscript should explicitly enumerate all assumptions, state the precise matched regimes used for the Transformer/Mamba comparisons, and discuss how restrictive those assumptions are for practical training regimes.
minor comments (3)
  1. [Abstract] The abstract refers to 'explicit assumptions' without listing them; a one-sentence summary of the key assumptions would improve accessibility.
  2. [§2] Notation for the memory tail O(ℓ^{-β}) should include an early definition of the distance variable ℓ and the range of β in the main text.
  3. [Experiments] Experimental section: more detail on how the 'matched regimes' are enforced across model classes (e.g., parameter count, training steps, context length) would strengthen reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and the constructive suggestion regarding the theoretical section. We address the major comment below and will incorporate the requested clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (theoretical analysis): the power-law tail result and the uniqueness claim for non-decaying selective retrieval are load-bearing; the manuscript should explicitly enumerate all assumptions, state the precise matched regimes used for the Transformer/Mamba comparisons, and discuss how restrictive those assumptions are for practical training regimes.

    Authors: We agree that the power-law tail result and uniqueness claim are central, and that their supporting assumptions and comparison regimes merit explicit enumeration for clarity. In the revised manuscript we will add a dedicated paragraph (or short subsection) at the beginning of §3 that: (1) lists every assumption used in the proofs of the O(ℓ^{-β}) memory tails and the selective-retrieval uniqueness result; (2) states the precise matched regimes (parameter count, state dimension, sequence length, and architectural constraints) under which the Transformer and Mamba-style baselines are compared; and (3) discusses the restrictiveness of these assumptions for practical training, including any known relaxations and their implications for real-world deployment. These additions will not change the stated theorems but will make the scope of the claims fully transparent. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained under explicit assumptions; no circularity

full rationale

The central claims consist of a proof of power-law memory tails O(ℓ^{-β}) (0<β<1) and an explicit construction for flexible selective retrieval, both stated to hold only under the paper's explicit assumptions and matched regimes. The abstract and described results frame these as derivations from the model definition (attention inside recurrent feedback) rather than reductions to fitted parameters, self-citations, or renamed inputs. No load-bearing step reduces by construction to its own outputs, and no self-citation chain is invoked for uniqueness or ansatz. The result is therefore independent of the target claims and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the central claims rest on unspecified explicit assumptions for the power-law analysis and on the validity of the matched experimental regimes.

axioms (1)
  • domain assumption Explicit assumptions on model regimes and memory dynamics that enable the power-law tail analysis
    The abstract states that the power-law result holds under these assumptions but does not enumerate them.
invented entities (1)
  • Sessa decoder class no independent evidence
    purpose: To realize multiple attention-based paths through a recurrent feedback loop
    New architecture introduced in the paper to combine attention and recurrence.

pith-pipeline@v0.9.0 · 5539 in / 1407 out tokens · 59270 ms · 2026-05-10T04:42:33.121028+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 31 canonical work pages · 11 internal anchors

  1. [1]

    CoRR , volume =

    On the Opportunities and Risks of Foundation Models , author =. CoRR , volume =. 2021 , eprint =

  2. [2]

    Attention Is All You Need

    Attention Is All You Need , author =. Advances in Neural Information Processing Systems 30 (NIPS 2017) , year =. doi:10.48550/arXiv.1706.03762 , url =. 1706.03762 , archivePrefix =

  3. [3]

    Limitations of Normalization in Attention Mechanism,

    Limitations of Normalization in Attention Mechanism , author =. 2025 , booktitle =. doi:10.48550/arXiv.2508.17821 , url =. 2508.17821 , archivePrefix =

  4. [4]

    Generating Long Sequences with Sparse Transformers

    Generating Long Sequences with Sparse Transformers , author =. arXiv preprint arXiv:1904.10509 , year =. doi:10.48550/arXiv.1904.10509 , url =. 1904.10509 , archivePrefix =

  5. [5]

    Longformer: The Long-Document Transformer

    Longformer: The Long-Document Transformer , author =. arXiv preprint arXiv:2004.05150 , year =. doi:10.48550/arXiv.2004.05150 , url =. 2004.05150 , archivePrefix =

  6. [6]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi

    Big Bird: Transformers for Longer Sequences , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =. doi:10.48550/arXiv.2007.14062 , url =. 2007.14062 , archivePrefix =

  7. [7]

    LongNet: Scaling transformers to 1,000,000,000 tokens.arXiv preprint arXiv:2307.02486,

    LongNet: Scaling Transformers to 1,000,000,000 Tokens , author =. arXiv preprint arXiv:2307.02486 , year =. doi:10.48550/arXiv.2307.02486 , url =. 2307.02486 , archivePrefix =

  8. [8]

    and Staats, Charles , journal =

    Rabe, Markus N. and Staats, Charles , journal =. Self-Attention Does Not Need. 2021 , eprint =

  9. [9]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Efficiently Modeling Long Sequences with Structured State Spaces , author =. International Conference on Learning Representations (ICLR) , year =. doi:10.48550/arXiv.2111.00396 , url =. 2111.00396 , archivePrefix =

  10. [10]

    Journal of Basic Engineering , year =

    A New Approach to Linear Filtering and Prediction Problems , author =. Journal of Basic Engineering , year =

  11. [11]

    2006 , edition =

    Linear Systems , author =. 2006 , edition =

  12. [12]

    2012 , doi =

    Matrix Analysis , author =. 2012 , doi =

  13. [13]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author =. Conference on Language Modeling (COLM) , year =. doi:10.48550/arXiv.2312.00752 , url =. 2312.00752 , archivePrefix =

  14. [14]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality , author =. Proceedings of the 41st International Conference on Machine Learning , series =. 2024 , note =. doi:10.48550/arXiv.2405.21060 , url =. 2405.21060 , archivePrefix =

  15. [15]

    Proceedings of the 42nd International Conference on Machine Learning , series =

    Understanding Input Selectivity in Mamba: Impact on Approximation Power, Memorization, and Associative Recall Capacity , author =. Proceedings of the 42nd International Conference on Machine Learning , series =. 2025 , note =. doi:10.48550/arXiv.2506.11891 , url =. 2506.11891 , archivePrefix =

  16. [16]

    doi: 10.18653/v1/P19-1285

    Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context , author =. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) , year =. doi:10.18653/v1/P19-1285 , url =. 1901.02860 , archivePrefix =

  17. [17]

    arXiv preprint arXiv:2002.09402 , year =

    Addressing Some Limitations of Transformers with Feedback Memory , author =. arXiv preprint arXiv:2002.09402 , year =. doi:10.48550/arXiv.2002.09402 , url =. 2002.09402 , archivePrefix =

  18. [18]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Block-Recurrent Transformers , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =. doi:10.48550/arXiv.2203.07852 , url =. 2203.07852 , archivePrefix =

  19. [19]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Recurrent Memory Transformer , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =. doi:10.48550/arXiv.2207.06881 , url =. 2207.06881 , archivePrefix =

  20. [20]

    Hwang, W

    TransformerFAM: Feedback attention is working memory , author =. arXiv preprint arXiv:2404.09173 , year =. doi:10.48550/arXiv.2404.09173 , url =

  21. [21]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  22. [22]

    HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

    HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , author =. IEEE/ACM Transactions on Audio, Speech, and Language Processing , year =. doi:10.1109/TASLP.2021.3122291 , note =

  23. [23]

    Chronos: Learning the Language of Time Series

    Chronos: Learning the Language of Time Series , author =. Transactions on Machine Learning Research , year =. doi:10.48550/arXiv.2403.07815 , url =. 2403.07815 , archivePrefix =

  24. [24]

    Nature Methods , year =

    The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics , author =. Nature Methods , year =. doi:10.1038/s41592-024-02523-z , url =

  25. [25]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  26. [26]

    International Conference on Learning Representations (ICLR) , year =

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author =. International Conference on Learning Representations (ICLR) , year =

  27. [27]

    WaveNet: A Generative Model for Raw Audio

    WaveNet: A Generative Model for Raw Audio , author =. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9) , year =. doi:10.48550/arXiv.1609.03499 , url =. 1609.03499 , archivePrefix =

  28. [28]

    Ke Tran, Arianna Bisazza, and Christof Monz

    Sequence to Sequence Learning with Neural Networks , author =. Advances in Neural Information Processing Systems (NIPS 2014) , year =. doi:10.48550/arXiv.1409.3215 , url =. 1409.3215 , archivePrefix =

  29. [29]

    Layer Normalization

    Layer Normalization , author =. 2016 , eprint =. doi:10.48550/arXiv.1607.06450 , url =

  30. [30]

    Gaussian Error Linear Units (GELUs)

    Gaussian Error Linear Units (GELUs) , author =. 2016 , eprint =. doi:10.48550/arXiv.1606.08415 , url =

  31. [31]

    Transformer quality in linear time.arXiv preprint arXiv:2202.10447, 2022

    Transformer Quality in Linear Time , author =. Proceedings of the 39th International Conference on Machine Learning (ICML) , year =. 2202.10447 , archivePrefix =

  32. [32]

    GLU Variants Improve Transformer

    GLU Variants Improve Transformer , author =. 2020 , eprint =. doi:10.48550/arXiv.2002.05202 , url =

  33. [33]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    RoFormer: Enhanced Transformer with Rotary Position Embedding , author =. 2021 , eprint =. doi:10.48550/arXiv.2104.09864 , url =

  34. [34]

    LLaMA: Open and Efficient Foundation Language Models

    LLaMA: Open and Efficient Foundation Language Models , author =. 2023 , eprint =. doi:10.48550/arXiv.2302.13971 , url =

  35. [35]

    GPT - N eo X -20 B : An open-source autoregressive language model

    Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, Usvsn Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel , booktitle =. 2022 , month = may, ad...

  36. [36]

    International Conference on Learning Representations (ICLR) , year =

    Are Transformers universal approximators of sequence-to-sequence functions? , author =. International Conference on Learning Representations (ICLR) , year =

  37. [37]

    Annals of Mathematical Statistics , year =

    A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the Sum of Observations , author =. Annals of Mathematical Statistics , year =

  38. [38]

    Journal of the American Statistical Association , year =

    Probability Inequalities for Sums of Bounded Random Variables , author =. Journal of the American Statistical Association , year =

  39. [39]

    2011 , howpublished =

    Lectures on Dynamic Systems and Control, Chapter 15: External Input-Output Stability , author =. 2011 , howpublished =

  40. [40]

    2011 , howpublished =

    Lectures on Dynamic Systems and Control, Chapter 27: Poles and Zeros of MIMO Systems , author =. 2011 , howpublished =

  41. [41]

    2011 , howpublished =

    Lectures on Dynamic Systems and Control, Chapter 30: Minimality and Stability of Interconnected Systems , author =. 2011 , howpublished =

  42. [42]

    Journal of Mathematics and Physics , volume =

    Some elementary inequalities relating to the gamma and incomplete gamma function , author =. Journal of Mathematics and Physics , volume =. 1959 , doi =

  43. [43]

    Uber Funktionen, die auf einer abgeschlossenen Menge stetig sind , author =. Journal f\

    \"Uber Funktionen, die auf einer abgeschlossenen Menge stetig sind , author =. Journal f\"ur die reine und angewandte Mathematik , volume =. 1915 , doi =

  44. [44]

    Neural Networks , volume =

    Multilayer feedforward networks are universal approximators , author =. Neural Networks , volume =. 1989 , doi =

  45. [45]

    Neural Networks , volume =

    Multilayer feedforward networks with a nonpolynomial activation function can approximate any function , author =. Neural Networks , volume =. 1993 , doi =

  46. [46]

    Proceedings of the 37th International Conference on Machine Learning (ICML) , year =

    On Layer Normalization in the Transformer Architecture , author =. Proceedings of the 37th International Conference on Machine Learning (ICML) , year =

  47. [47]

    On the parameterization and initialization of diagonal state space models

    On the Parameterization and Initialization of Diagonal State Space Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =. doi:10.48550/arXiv.2206.11893 , url =

  48. [48]

    International Conference on Learning Representations (ICLR) , year =

    Long Range Arena: A Benchmark for Efficient Transformers , author =. International Conference on Learning Representations (ICLR) , year =

  49. [49]

    arXiv preprint arXiv:2501.14850 , year =

    On the locality bias and results in the Long Range Arena , author =. arXiv preprint arXiv:2501.14850 , year =. doi:10.48550/arXiv.2501.14850 , url =. 2501.14850 , archivePrefix =

  50. [50]

    arXiv preprint arXiv:2504.09184 , year=

    Parameterized Synthetic Text Generation with SimpleStories , author =. NeurIPS 2025 Datasets and Benchmarks Track , year =. doi:10.48550/arXiv.2504.09184 , eprint =

  51. [51]

    Zoology: Measuring and improving recall in efficient language models, 2024

    Zoology: Measuring and Improving Recall in Efficient Language Models , author =. International Conference on Learning Representations (ICLR) , year =. 2312.04927 , archivePrefix =