Recognition: unknown
Optimal Decay Spectra for Linear Recurrences
Pith reviewed 2026-05-10 17:14 UTC · model grok-4.3
The pith
Geometric log-decay reparameterization plus position-adaptive scaling achieves minimax-optimal exponential memory retention in linear recurrent models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The decay spectrum determines long-range retention in linear recurrent models. For N channels, random initialization collapses the minimum spectral gap to O(N^{-2}), producing sub-exponential error exp(-Ω(N/log N)); linear spacing avoids collapse but yields only exp(-O(N/√T)). Spectral Reparameterization structurally enforces geometrically spaced log-decay rates and is proven minimax optimal at rate O(exp(-cN/log T)). Position-Adaptive Scaling is the unique mechanism that removes static scale mismatch—where only N log t / log T channels are effective at position t—by stretching the spectrum to the actual dependency range, sharpening the rate to O(exp(-cN/log t)) and natively inducing scale-f
What carries the argument
Position-Adaptive Spectral Tapering (PoST) via Spectral Reparameterization to enforce geometric log-decay spacing and Position-Adaptive Scaling to dynamically match the spectrum to current position t.
If this is right
- Linear recurrent models achieve exponential memory improvement scaling with channel count N rather than sub-exponential or algebraic rates.
- Zero-shot language modeling performance improves consistently when PoST is integrated into Mamba-2, RWKV-7, Gated DeltaNet, Gated Linear Attention, and RetNet at 180M-440M scales.
- Long-context retrieval accuracy rises substantially on tasks such as MQAR and NIAH for Mamba-2 without added compute.
- The impulse response becomes scale-free, allowing channels to interpolate between relative and absolute temporal coordinates.
Where Pith is reading between the lines
- The same spectral mechanisms could be tested for gains in linear models outside the five architectures evaluated.
- Direct measurement of empirical error curves against the claimed rates on controlled synthetic tasks would provide a clear test of the optimality proofs.
- Scale-free responses may improve generalization when training and inference sequence lengths differ substantially.
Load-bearing premise
Any diagonal linear recurrence can accept the reparameterization and adaptive scaling without introducing instability or unintended changes to the learned dynamics.
What would settle it
Train models with and without the proposed mechanisms on a synthetic long-range dependency task while varying N and T, then check whether measured retention error follows the predicted O(exp(-cN/log t)) curve versus the slower baseline rates.
Figures
read the original abstract
Linear recurrent models offer linear-time sequence processing but often suffer from suboptimal long-range memory. We trace this to the decay spectrum: for $N$ channels, random initialization collapses the minimum spectral gap to $O(N^{-2})$, yielding sub-exponential error $\exp(-\Omega(N/\log N))$; linear spacing avoids collapse but degrades to $\exp(-O(N/\sqrt{T}))$, practically algebraic over long contexts. We introduce Position-Adaptive Spectral Tapering (PoST), an architecture-agnostic framework combining two mechanisms: (1) Spectral Reparameterization, which structurally enforces geometrically spaced log-decay rates, proven minimax optimal at rate $O(\exp(-cN/\log T))$; and (2) Position-Adaptive Scaling, the provably unique mechanism that eliminates the scale mismatch of static spectra (where only $N\log t/\log T$ of $N$ channels are effective at position $t$) by stretching the spectrum to the actual dependency range, sharpening the rate to $O(\exp(-cN/\log t))$. This scaling natively induces fractional invariance: the impulse response becomes scale-free, with channels interpolating between relative and absolute temporal coordinates. PoST integrates into any diagonal linear recurrence without overhead. We instantiate it across Mamba-2, RWKV-7, Gated DeltaNet, Gated Linear Attention, and RetNet. Pre-training at 180M-440M scales shows consistent zero-shot language modeling improvements, significant long-context retrieval gains for Mamba-2 (MQAR and NIAH), and competitive or improved performance across other architectures. Code: https://github.com/SiLifen/PoST.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that suboptimal long-range memory in linear recurrent models stems from poor decay spectra under random initialization or linear spacing. It introduces Position-Adaptive Spectral Tapering (PoST), combining (1) Spectral Reparameterization that enforces geometrically spaced log-decay rates, asserted to be minimax optimal at rate O(exp(-cN/log T)), and (2) Position-Adaptive Scaling, asserted to be the unique fix for position-dependent scale mismatch (only N log t / log T channels effective at position t), sharpening the rate to O(exp(-cN/log t)) and inducing fractional invariance. PoST is architecture-agnostic with zero overhead and is instantiated on Mamba-2, RWKV-7, Gated DeltaNet, Gated Linear Attention, and RetNet, yielding empirical gains in 180M-440M pre-training, MQAR/NIAH retrieval, and other tasks.
Significance. If the optimality and uniqueness claims hold rigorously and the modifications preserve stability and expressivity, PoST would provide a principled, low-overhead improvement to decay spectra in efficient linear recurrences, with potential impact on long-context scaling for models like Mamba. The multi-architecture empirical results, if reproducible, would strengthen the case for broad applicability.
major comments (3)
- [Abstract] Abstract: the central optimality claim ('proven minimax optimal at rate O(exp(-cN/log T))') and uniqueness claim for Position-Adaptive Scaling are stated without any derivation, minimax setup, error metric, or function class; this prevents verification of whether the geometric spacing is independent of parameter choices or reduces to a fitted quantity by construction.
- [Abstract] Abstract: the assertion that PoST integrates into any diagonal linear recurrence 'without overhead' and without altering learned dynamics is load-bearing for the architecture-agnostic claim, yet no analysis addresses whether position-dependent stretching introduces time-varying parameters that affect recurrence stability, gradient flow, or require implicit regularization not captured in the rate analysis.
- [Abstract] Abstract: the claim that geometric spacing does not constrain expressivity (i.e., models can still learn non-geometric decays) is unexamined; no theoretical bound or ablation shows that enforcing the spectrum leaves the original model's eigenvalue distribution and training dynamics unchanged.
minor comments (2)
- The abstract references code at https://github.com/SiLifen/PoST but provides no experimental details (hyperparameters, training protocol, baseline implementations) needed to reproduce the reported zero-shot and retrieval gains.
- Notation in the rates (e.g., T vs. t, the constant c) should be defined explicitly when first introduced, as T appears to denote maximum context while t is per-position.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address the major comments point by point below, providing clarifications on the theoretical claims and indicating where revisions can strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central optimality claim ('proven minimax optimal at rate O(exp(-cN/log T))') and uniqueness claim for Position-Adaptive Scaling are stated without any derivation, minimax setup, error metric, or function class; this prevents verification of whether the geometric spacing is independent of parameter choices or reduces to a fitted quantity by construction.
Authors: The abstract summarizes the key results, but the full derivations are provided in the main text. Specifically, Section 3 presents the minimax optimization problem where we seek to minimize the maximum approximation error over all positions t ≤ T for an N-channel linear recurrence. The error metric is the supremum of the residual decay error, and the function class is the set of all possible decay rate assignments. We prove that geometric spacing in log-decay rates is the unique solution achieving the rate O(exp(-c N / log T)), and this holds independently of parameter choices as it arises from the optimal covering of the logarithmic scale. For Position-Adaptive Scaling, we show uniqueness by proving that any fixed spectrum results in only a fraction log t / log T of channels being effective at position t, and the adaptive stretching is the only mechanism that equalizes the effective spectrum across positions. We will revise the abstract to include a pointer to Section 3 for readers seeking the setup. revision: partial
-
Referee: [Abstract] Abstract: the assertion that PoST integrates into any diagonal linear recurrence 'without overhead' and without altering learned dynamics is load-bearing for the architecture-agnostic claim, yet no analysis addresses whether position-dependent stretching introduces time-varying parameters that affect recurrence stability, gradient flow, or require implicit regularization not captured in the rate analysis.
Authors: PoST applies a deterministic, position-dependent scaling to the fixed geometric spectrum at each timestep, introducing no additional parameters or computational overhead beyond the original recurrence. The scaling is a smooth function of t, ensuring that the effective decay rates remain bounded in [0,1] for stability. In the manuscript, we demonstrate through experiments on multiple architectures that training proceeds stably without extra regularization. A detailed analysis of the time-varying Jacobian and its impact on gradient flow is not explicitly derived in the current version but can be included as supplementary material, as the Lipschitz continuity of the scaling preserves the contraction mapping properties of the linear recurrence. revision: partial
-
Referee: [Abstract] Abstract: the claim that geometric spacing does not constrain expressivity (i.e., models can still learn non-geometric decays) is unexamined; no theoretical bound or ablation shows that enforcing the spectrum leaves the original model's eigenvalue distribution and training dynamics unchanged.
Authors: The Spectral Reparameterization reparameterizes the decay rates to enforce geometric spacing on the log scale but preserves the full expressivity because the position-adaptive scaling and other model parameters (such as the input projections) allow the effective impulse response to adapt. Section 5 includes ablations on MQAR and language modeling tasks showing that PoST models achieve performance gains without restricting the ability to model various dependency lengths. The eigenvalue distribution is not fixed; the reparameterization maps the original parameters to a geometric basis, but training can still adjust the overall scale. We will add a theoretical note clarifying that the reparameterization is bijective within the stable regime, thus not constraining the representable functions, and expand the ablation to include eigenvalue distribution comparisons. revision: yes
Circularity Check
No circularity: claims rest on external analysis rather than self-definition or fitted inputs
full rationale
The abstract asserts that Spectral Reparameterization is 'proven minimax optimal' at a stated rate and that Position-Adaptive Scaling is the 'provably unique' fix for scale mismatch, but supplies no equations, proof sketches, or self-citations that would allow the optimality or uniqueness to reduce to a redefinition of the proposed mechanisms themselves. No fitted parameter is relabeled as a prediction, no ansatz is smuggled via prior self-work, and no uniqueness theorem is imported from the authors' own earlier papers. The empirical pre-training results on Mamba-2, RWKV-7 and other models are presented as separate validation, not as the source of the theoretical rates. Because the provided text contains no load-bearing derivation step that collapses by construction to its own inputs, the circularity score is zero.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
NTK-aware scaled RoPE allows LLaMA models to have extended (8k+) context size.Reddit post, r/LocalLLaMA, 2023.https://www.reddit.com/r/ LocalLLaMA/comments/14lz7j5/
[blo23] bloc97. NTK-aware scaled RoPE allows LLaMA models to have extended (8k+) context size.Reddit post, r/LocalLLaMA, 2023.https://www.reddit.com/r/ LocalLLaMA/comments/14lz7j5/. [BM05] Gregory Beylkin and Lucas Monz´ on. On approximation of functions by exponential sums.Applied and Computational Harmonic Analysis, 19(1):17–48,
2023
-
[2]
27 [DG24] Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,
work page internal anchor Pith review arXiv
-
[3]
[DJP+24] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
[DSF+24] Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivat- san Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, and Caglar Gulcehre. Griffin: Mixing gated linear recurrences with local attention f...
-
[5]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
[GD23] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Scaling Laws for Neural Language Models
28 [KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[7]
FineWeb-Edu: The finest collection of educational content the Web has to of- fer.Hugging Face Blog, 2024.https://huggingface.co/spaces/HuggingFaceFW/ blogpost-fineweb-v1
[LBAvWW24] Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. FineWeb-Edu: The finest collection of educational content the Web has to of- fer.Hugging Face Blog, 2024.https://huggingface.co/spaces/HuggingFaceFW/ blogpost-fineweb-v1. [LH19] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InIn- ternational Conferen...
2024
-
[8]
[PGA+24] Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, et al. Eagle and finch: RWKV with matrix-valued states and dynamic recurrence.arXiv preprint arXiv:2404.05892,
-
[9]
RWKV-7 “Goose” with expressive dynamic state evolution, 2025
[PZG+25] Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng. RWKV-7 “goose” with expressive dynamic state evolution. arXiv preprint arXiv:2503.14456,
-
[10]
Uncovering the spectral bias in diagonal state space models.arXiv preprint arXiv:2508.20441,
29 [SBA+25] Ruben Solozabal, Velibor Bojkovic, Hilal AlQuabeh, Kentaro Inui, and Martin Tak´ aˇ c. Uncovering the spectral bias in diagonal state space models.arXiv preprint arXiv:2508.20441,
-
[11]
Retentive Network: A Successor to Transformer for Large Language Models
[SDH+23] Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jiany- ong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,
work page internal anchor Pith review arXiv
-
[12]
[YXF+25] Zhifan Ye, Kejing Xia, Yonggan Fu, Xin Dong, Jihoon Hong, Xiangchi Yuan, Shizhe Diao, Jan Kautz, Pavlo Molchanov, and Yingyan Celine Lin. LongMamba: Enhanc- ing Mamba’s long-context capabilities via training-free receptive field enlargement. arXiv preprint arXiv:2504.16053,
-
[13]
Furthermore, by Gonchar-Rakhmanov theory, a geometric progressionp ∗ k+1 −p ∗ k ≈const is asymptotically necessary to attain this minimax limit
achieving the un-degraded optimal exponential rate: EN(K)≤C 5 exp − π2N logT+C 6 , whereC 5, C6 >0depend onβbut not onN. Furthermore, by Gonchar-Rakhmanov theory, a geometric progressionp ∗ k+1 −p ∗ k ≈const is asymptotically necessary to attain this minimax limit. Proof.The proof proceeds in three steps: (1) reduce exponential-sum approximation to ration...
2048
-
[14]
39 Algorithm 4PoST-RetNet / GLA: Retention Forward Pass Require:Inputx∈R B×T×D , learnable parametersθ γ ∈R,δ γ ∈R H−1, RetNet projection weights, position offsett 0 ≥0
while pre- serving the chunk-parallel retention computation: within each chunk,γ h,l varies smoothly and the retention matrix remains lower-triangular with known structure. 39 Algorithm 4PoST-RetNet / GLA: Retention Forward Pass Require:Inputx∈R B×T×D , learnable parametersθ γ ∈R,δ γ ∈R H−1, RetNet projection weights, position offsett 0 ≥0. Ensure:Outputy...
2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.