arxiv: 2604.07658 · v1 · submitted 2026-04-08 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Optimal Decay Spectra for Linear Recurrences

Yang Cao

Pith reviewed 2026-05-10 17:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords decay spectralinear recurrenceslong-range memoryspectral reparameterizationposition-adaptive scalingsequence modelingMambaRWKV

0 comments

The pith

Geometric log-decay reparameterization plus position-adaptive scaling achieves minimax-optimal exponential memory retention in linear recurrent models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that suboptimal decay spectra limit long-range memory in linear recurrent models, with random initialization collapsing the spectral gap to yield only sub-exponential error and linear spacing degrading further over long contexts. Spectral reparameterization enforces geometrically spaced log-decay rates to reach the minimax optimal rate O(exp(-cN/log T)), while position-adaptive scaling eliminates the mismatch that leaves most channels ineffective at any given position t, sharpening the rate to O(exp(-cN/log t)) and producing scale-free impulse responses. A sympathetic reader would care because this directly targets the core weakness of linear-time sequence models, enabling stronger long-context performance without quadratic costs or extra hyperparameters. The framework applies architecture-agnostically to any diagonal linear recurrence and demonstrates gains in language modeling and retrieval tasks.

Core claim

The decay spectrum determines long-range retention in linear recurrent models. For N channels, random initialization collapses the minimum spectral gap to O(N^{-2}), producing sub-exponential error exp(-Ω(N/log N)); linear spacing avoids collapse but yields only exp(-O(N/√T)). Spectral Reparameterization structurally enforces geometrically spaced log-decay rates and is proven minimax optimal at rate O(exp(-cN/log T)). Position-Adaptive Scaling is the unique mechanism that removes static scale mismatch—where only N log t / log T channels are effective at position t—by stretching the spectrum to the actual dependency range, sharpening the rate to O(exp(-cN/log t)) and natively inducing scale-f

What carries the argument

Position-Adaptive Spectral Tapering (PoST) via Spectral Reparameterization to enforce geometric log-decay spacing and Position-Adaptive Scaling to dynamically match the spectrum to current position t.

If this is right

Linear recurrent models achieve exponential memory improvement scaling with channel count N rather than sub-exponential or algebraic rates.
Zero-shot language modeling performance improves consistently when PoST is integrated into Mamba-2, RWKV-7, Gated DeltaNet, Gated Linear Attention, and RetNet at 180M-440M scales.
Long-context retrieval accuracy rises substantially on tasks such as MQAR and NIAH for Mamba-2 without added compute.
The impulse response becomes scale-free, allowing channels to interpolate between relative and absolute temporal coordinates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same spectral mechanisms could be tested for gains in linear models outside the five architectures evaluated.
Direct measurement of empirical error curves against the claimed rates on controlled synthetic tasks would provide a clear test of the optimality proofs.
Scale-free responses may improve generalization when training and inference sequence lengths differ substantially.

Load-bearing premise

Any diagonal linear recurrence can accept the reparameterization and adaptive scaling without introducing instability or unintended changes to the learned dynamics.

What would settle it

Train models with and without the proposed mechanisms on a synthetic long-range dependency task while varying N and T, then check whether measured retention error follows the predicted O(exp(-cN/log t)) curve versus the slower baseline rates.

Figures

Figures reproduced from arXiv: 2604.07658 by Yang Cao.

**Figure 1.** Figure 1: Empirical timescale distribution τk = e −Ak across trained models. (Left) A kernel density estimate (KDE) over all depths shows Mamba-2 models suffering from a severe minimum gap collapse (density clumping into narrow spikes), while PoST strictly enforces a broad geometric long-tail distribution. (Right) Head timescale allocation within a single representative layer (Layer 12). Mamba-2 models flatten out (… view at source ↗

**Figure 2.** Figure 2: Layer×Head heatmap of learned log-timescales log τk = −Ak across all layers. Each cell encodes the log-timescale of a single SSM head at its actual model index at a given layer (top to bottom); no manual reordering is applied. Top row (180M): The baseline Mamba-2 heatmap is nearly uniform in color throughout: all heads at every layer collapse to a narrow band of fast timescales, wasting state capacity. The… view at source ↗

**Figure 3.** Figure 3: Learned Normalization Taper αk. We examine the empirical αk distributions across all layers of the pre-trained Mamba-2 PoST 180M and 440M models. Rather than fixing αk to the linear (N − k)/(N − 1) blueprint (dashed black line), PoST allows data-dependent parameter adjustments and adaptively recomputes the optimal αk for each head to enforce rigid geometric spacing (Proposition 4.16). Across both model sca… view at source ↗

**Figure 4.** Figure 4: MQAR extrapolation accuracy across equalized per-layer state sizes ∈ {64K, 32K, 16K} (from d ∈ {512, 256} with varying head count). Each point shows accuracy at K = T /4 key–value pairs for context length T. All models trained at T = 512; longer lengths are out-of-distribution. Solid lines: PoST variants; dashed: baselines. 41 [PITH_FULL_IMAGE:figures/full_fig_p042_4.png] view at source ↗

read the original abstract

Linear recurrent models offer linear-time sequence processing but often suffer from suboptimal long-range memory. We trace this to the decay spectrum: for $N$ channels, random initialization collapses the minimum spectral gap to $O(N^{-2})$, yielding sub-exponential error $\exp(-\Omega(N/\log N))$; linear spacing avoids collapse but degrades to $\exp(-O(N/\sqrt{T}))$, practically algebraic over long contexts. We introduce Position-Adaptive Spectral Tapering (PoST), an architecture-agnostic framework combining two mechanisms: (1) Spectral Reparameterization, which structurally enforces geometrically spaced log-decay rates, proven minimax optimal at rate $O(\exp(-cN/\log T))$; and (2) Position-Adaptive Scaling, the provably unique mechanism that eliminates the scale mismatch of static spectra (where only $N\log t/\log T$ of $N$ channels are effective at position $t$) by stretching the spectrum to the actual dependency range, sharpening the rate to $O(\exp(-cN/\log t))$. This scaling natively induces fractional invariance: the impulse response becomes scale-free, with channels interpolating between relative and absolute temporal coordinates. PoST integrates into any diagonal linear recurrence without overhead. We instantiate it across Mamba-2, RWKV-7, Gated DeltaNet, Gated Linear Attention, and RetNet. Pre-training at 180M-440M scales shows consistent zero-shot language modeling improvements, significant long-context retrieval gains for Mamba-2 (MQAR and NIAH), and competitive or improved performance across other architectures. Code: https://github.com/SiLifen/PoST.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PoST offers a geometric log-decay reparameterization plus position-adaptive scaling for linear recurrences, with reported gains on several models but unverified optimality proofs.

read the letter

The paper's central proposal is to replace random or linear decay spectra in diagonal linear recurrences with a fixed geometric spacing in log-decay rates, then stretch that spectrum on the fly according to current position. The authors claim this yields a minimax-optimal error rate of O(exp(-cN/log T)) that improves to O(exp(-cN/log t)) once the adaptive stretch is added, and they say the stretch also produces fractional invariance in the impulse response. They integrate the scheme into Mamba-2, RWKV-7, Gated DeltaNet, Gated Linear Attention, and RetNet, and report consistent language-modeling gains plus stronger long-context retrieval after pretraining at 180M-440M scales. Code is released, which is useful for checking the implementation details. The work is clearest on the empirical side: the zero-shot and retrieval improvements are concrete and appear across multiple architectures. The geometric spacing and adaptive scaling are presented as a distinct combination rather than a minor tweak on existing linear-recurrence tricks. The soft spots sit mainly in the unshown proofs. The abstract asserts minimax optimality and uniqueness of the adaptive mechanism, yet supplies no derivation or error analysis, so it is impossible to judge whether the claimed rates survive the actual model constraints or depend on particular normalizations. The position-dependent stretch introduces time-varying factors whose effect on gradient stability and learned dynamics is asserted to be neutral, but that assumption is not obviously free of hidden costs. The stress-test worry about constraining non-geometric decays or adding implicit regularization therefore remains live until the full derivations and training curves are examined. This paper is for researchers working on sub-quadratic sequence models who need better long-range memory without quadratic attention. Anyone already experimenting with Mamba-style or RWKV-style recurrences will get direct value from the mechanism and the reported numbers. It deserves a serious referee because the claims are specific, the experiments are at usable scale, and the code allows direct verification. I would send it for peer review.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that suboptimal long-range memory in linear recurrent models stems from poor decay spectra under random initialization or linear spacing. It introduces Position-Adaptive Spectral Tapering (PoST), combining (1) Spectral Reparameterization that enforces geometrically spaced log-decay rates, asserted to be minimax optimal at rate O(exp(-cN/log T)), and (2) Position-Adaptive Scaling, asserted to be the unique fix for position-dependent scale mismatch (only N log t / log T channels effective at position t), sharpening the rate to O(exp(-cN/log t)) and inducing fractional invariance. PoST is architecture-agnostic with zero overhead and is instantiated on Mamba-2, RWKV-7, Gated DeltaNet, Gated Linear Attention, and RetNet, yielding empirical gains in 180M-440M pre-training, MQAR/NIAH retrieval, and other tasks.

Significance. If the optimality and uniqueness claims hold rigorously and the modifications preserve stability and expressivity, PoST would provide a principled, low-overhead improvement to decay spectra in efficient linear recurrences, with potential impact on long-context scaling for models like Mamba. The multi-architecture empirical results, if reproducible, would strengthen the case for broad applicability.

major comments (3)

[Abstract] Abstract: the central optimality claim ('proven minimax optimal at rate O(exp(-cN/log T))') and uniqueness claim for Position-Adaptive Scaling are stated without any derivation, minimax setup, error metric, or function class; this prevents verification of whether the geometric spacing is independent of parameter choices or reduces to a fitted quantity by construction.
[Abstract] Abstract: the assertion that PoST integrates into any diagonal linear recurrence 'without overhead' and without altering learned dynamics is load-bearing for the architecture-agnostic claim, yet no analysis addresses whether position-dependent stretching introduces time-varying parameters that affect recurrence stability, gradient flow, or require implicit regularization not captured in the rate analysis.
[Abstract] Abstract: the claim that geometric spacing does not constrain expressivity (i.e., models can still learn non-geometric decays) is unexamined; no theoretical bound or ablation shows that enforcing the spectrum leaves the original model's eigenvalue distribution and training dynamics unchanged.

minor comments (2)

The abstract references code at https://github.com/SiLifen/PoST but provides no experimental details (hyperparameters, training protocol, baseline implementations) needed to reproduce the reported zero-shot and retrieval gains.
Notation in the rates (e.g., T vs. t, the constant c) should be defined explicitly when first introduced, as T appears to denote maximum context while t is per-position.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comments point by point below, providing clarifications on the theoretical claims and indicating where revisions can strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the central optimality claim ('proven minimax optimal at rate O(exp(-cN/log T))') and uniqueness claim for Position-Adaptive Scaling are stated without any derivation, minimax setup, error metric, or function class; this prevents verification of whether the geometric spacing is independent of parameter choices or reduces to a fitted quantity by construction.

Authors: The abstract summarizes the key results, but the full derivations are provided in the main text. Specifically, Section 3 presents the minimax optimization problem where we seek to minimize the maximum approximation error over all positions t ≤ T for an N-channel linear recurrence. The error metric is the supremum of the residual decay error, and the function class is the set of all possible decay rate assignments. We prove that geometric spacing in log-decay rates is the unique solution achieving the rate O(exp(-c N / log T)), and this holds independently of parameter choices as it arises from the optimal covering of the logarithmic scale. For Position-Adaptive Scaling, we show uniqueness by proving that any fixed spectrum results in only a fraction log t / log T of channels being effective at position t, and the adaptive stretching is the only mechanism that equalizes the effective spectrum across positions. We will revise the abstract to include a pointer to Section 3 for readers seeking the setup. revision: partial
Referee: [Abstract] Abstract: the assertion that PoST integrates into any diagonal linear recurrence 'without overhead' and without altering learned dynamics is load-bearing for the architecture-agnostic claim, yet no analysis addresses whether position-dependent stretching introduces time-varying parameters that affect recurrence stability, gradient flow, or require implicit regularization not captured in the rate analysis.

Authors: PoST applies a deterministic, position-dependent scaling to the fixed geometric spectrum at each timestep, introducing no additional parameters or computational overhead beyond the original recurrence. The scaling is a smooth function of t, ensuring that the effective decay rates remain bounded in [0,1] for stability. In the manuscript, we demonstrate through experiments on multiple architectures that training proceeds stably without extra regularization. A detailed analysis of the time-varying Jacobian and its impact on gradient flow is not explicitly derived in the current version but can be included as supplementary material, as the Lipschitz continuity of the scaling preserves the contraction mapping properties of the linear recurrence. revision: partial
Referee: [Abstract] Abstract: the claim that geometric spacing does not constrain expressivity (i.e., models can still learn non-geometric decays) is unexamined; no theoretical bound or ablation shows that enforcing the spectrum leaves the original model's eigenvalue distribution and training dynamics unchanged.

Authors: The Spectral Reparameterization reparameterizes the decay rates to enforce geometric spacing on the log scale but preserves the full expressivity because the position-adaptive scaling and other model parameters (such as the input projections) allow the effective impulse response to adapt. Section 5 includes ablations on MQAR and language modeling tasks showing that PoST models achieve performance gains without restricting the ability to model various dependency lengths. The eigenvalue distribution is not fixed; the reparameterization maps the original parameters to a geometric basis, but training can still adjust the overall scale. We will add a theoretical note clarifying that the reparameterization is bijective within the stable regime, thus not constraining the representable functions, and expand the ablation to include eigenvalue distribution comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external analysis rather than self-definition or fitted inputs

full rationale

The abstract asserts that Spectral Reparameterization is 'proven minimax optimal' at a stated rate and that Position-Adaptive Scaling is the 'provably unique' fix for scale mismatch, but supplies no equations, proof sketches, or self-citations that would allow the optimality or uniqueness to reduce to a redefinition of the proposed mechanisms themselves. No fitted parameter is relabeled as a prediction, no ansatz is smuggled via prior self-work, and no uniqueness theorem is imported from the authors' own earlier papers. The empirical pre-training results on Mamba-2, RWKV-7 and other models are presented as separate validation, not as the source of the theoretical rates. Because the provided text contains no load-bearing derivation step that collapses by construction to its own inputs, the circularity score is zero.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the optimality claims rest on unspecified mathematical arguments whose assumptions are not visible.

pith-pipeline@v0.9.0 · 5597 in / 1185 out tokens · 48173 ms · 2026-05-10T17:14:52.318409+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 10 canonical work pages · 5 internal anchors

[1]

NTK-aware scaled RoPE allows LLaMA models to have extended (8k+) context size.Reddit post, r/LocalLLaMA, 2023.https://www.reddit.com/r/ LocalLLaMA/comments/14lz7j5/

[blo23] bloc97. NTK-aware scaled RoPE allows LLaMA models to have extended (8k+) context size.Reddit post, r/LocalLLaMA, 2023.https://www.reddit.com/r/ LocalLLaMA/comments/14lz7j5/. [BM05] Gregory Beylkin and Lucas Monz´ on. On approximation of functions by exponential sums.Applied and Computational Harmonic Analysis, 19(1):17–48,

2023
[2]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

27 [DG24] Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

work page internal anchor Pith review arXiv
[3]

The Llama 3 Herd of Models

[DJP+24] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

[DSF+24] Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivat- san Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, and Caglar Gulcehre. Griffin: Mixing gated linear recurrences with local attention f...

work page arXiv
[5]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

[GD23] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Scaling Laws for Neural Language Models

28 [KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[7]

FineWeb-Edu: The finest collection of educational content the Web has to of- fer.Hugging Face Blog, 2024.https://huggingface.co/spaces/HuggingFaceFW/ blogpost-fineweb-v1

[LBAvWW24] Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. FineWeb-Edu: The finest collection of educational content the Web has to of- fer.Hugging Face Blog, 2024.https://huggingface.co/spaces/HuggingFaceFW/ blogpost-fineweb-v1. [LH19] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InIn- ternational Conferen...

2024
[8]

Eagle and finch: Rwkv with matrix- valued states and dynamic recurrence.arXiv preprint arXiv:2404.05892,

[PGA+24] Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, et al. Eagle and finch: RWKV with matrix-valued states and dynamic recurrence.arXiv preprint arXiv:2404.05892,

work page arXiv
[9]

RWKV-7 “Goose” with expressive dynamic state evolution, 2025

[PZG+25] Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng. RWKV-7 “goose” with expressive dynamic state evolution. arXiv preprint arXiv:2503.14456,

work page arXiv
[10]

Uncovering the spectral bias in diagonal state space models.arXiv preprint arXiv:2508.20441,

29 [SBA+25] Ruben Solozabal, Velibor Bojkovic, Hilal AlQuabeh, Kentaro Inui, and Martin Tak´ aˇ c. Uncovering the spectral bias in diagonal state space models.arXiv preprint arXiv:2508.20441,

work page arXiv
[11]

Retentive Network: A Successor to Transformer for Large Language Models

[SDH+23] Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jiany- ong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,

work page internal anchor Pith review arXiv
[12]

LongMamba: Enhanc- ing Mamba’s long-context capabilities via training-free receptive field enlargement

[YXF+25] Zhifan Ye, Kejing Xia, Yonggan Fu, Xin Dong, Jihoon Hong, Xiangchi Yuan, Shizhe Diao, Jan Kautz, Pavlo Molchanov, and Yingyan Celine Lin. LongMamba: Enhanc- ing Mamba’s long-context capabilities via training-free receptive field enlargement. arXiv preprint arXiv:2504.16053,

work page arXiv
[13]

Furthermore, by Gonchar-Rakhmanov theory, a geometric progressionp ∗ k+1 −p ∗ k ≈const is asymptotically necessary to attain this minimax limit

achieving the un-degraded optimal exponential rate: EN(K)≤C 5 exp − π2N logT+C 6 , whereC 5, C6 >0depend onβbut not onN. Furthermore, by Gonchar-Rakhmanov theory, a geometric progressionp ∗ k+1 −p ∗ k ≈const is asymptotically necessary to attain this minimax limit. Proof.The proof proceeds in three steps: (1) reduce exponential-sum approximation to ration...

2048
[14]

39 Algorithm 4PoST-RetNet / GLA: Retention Forward Pass Require:Inputx∈R B×T×D , learnable parametersθ γ ∈R,δ γ ∈R H−1, RetNet projection weights, position offsett 0 ≥0

while pre- serving the chunk-parallel retention computation: within each chunk,γ h,l varies smoothly and the retention matrix remains lower-triangular with known structure. 39 Algorithm 4PoST-RetNet / GLA: Retention Forward Pass Require:Inputx∈R B×T×D , learnable parametersθ γ ∈R,δ γ ∈R H−1, RetNet projection weights, position offsett 0 ≥0. Ensure:Outputy...

2048