pith. machine review for the scientific record. sign in

arxiv: 2605.08966 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: no theorem link

VORT: Adaptive Power-Law Memory for NLP Transformers

Nabil Mlaiki

Pith reviewed 2026-05-12 01:55 UTC · model grok-4.3

classification 💻 cs.LG
keywords transformerslong-range dependenciespower-law memoryfractional calculussum-of-exponentialsattention mechanismsretention kernels
0
0 comments X

The pith

VORT lets each token learn its own power-law memory decay rate inside a transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard transformers force distant tokens to lose influence exponentially, yet natural language shows power-law decay in how far back context still matters. VORT gives every token a learnable fractional order alpha that sets a power-law retention kernel based on the Grünwald-Letnikov definition. The kernel is turned into a short sum of exponentials by Gauss-Laguerre quadrature so that memory updates stay cheap and Markovian. Theory shows the approximation error drops geometrically with more terms and that any fixed-decay mixture leaves growing L2 error on long sequences. Synthetic tests confirm better performance on retrieval and copying tasks that require long, power-law lags.

Core claim

The Variable-Order Retention Transformer assigns each token a learnable fractional order alpha in [delta,1] that defines a Grünwald-Letnikov power-law retention kernel. This non-Markovian sum is replaced by a sum-of-exponentials approximation obtained from Gauss-Laguerre quadrature on the Laplace integral of the kernel, yielding an O(S d_v) Markovian recurrence per step with S logarithmic in horizon length. An SOE convergence theorem, a quantization bound on [delta,1], an L2 lower bound showing unbounded error for any fixed-minimum-decay mixture when alpha exceeds 1/2, and linear convergence of a gradient plasticity rule under the Polyak-Lojasiewicz condition are all proved.

What carries the argument

The sum-of-exponentials approximation of the Grünwald-Letnikov power-law kernel, obtained by Gauss-Laguerre quadrature on its Laplace integral representation, which converts the fractional retention into parallel linear recurrences.

If this is right

  • Memory over any horizon T can be kept to uniform epsilon error with only O(log(T/epsilon)) exponential terms.
  • Any mixture of fixed exponential decays incurs L2 error on [1,T] that grows without bound relative to a true power-law kernel when the order exceeds 1/2.
  • Keyed associative retrieval remains possible with an exact linear-attention accumulator at O(K S d_phi d_v) cost per step.
  • A simple gradient plasticity rule converges linearly whenever the training loss satisfies the Polyak-Lojasiewicz condition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The per-token alpha values could let the model automatically allocate longer memory to some tokens and shorter memory to others, something fixed-kernel models cannot do.
  • The same SOE technique might be applied to other sequence models that currently rely on exponential decay, such as state-space or recurrent architectures.
  • If the learned alphas on real text cluster around particular values, that distribution itself could become a diagnostic for the typical range of long-range dependencies in language.
  • Replacing the synthetic Zipf and lag-copy tasks with real long-document modeling would test whether the power-law advantage survives the noise of actual data.

Load-bearing premise

Power-law long-range dependencies are the dominant structure that ordinary transformers miss, and the fixed-term sum-of-exponentials approximation remains accurate enough throughout end-to-end training on real data.

What would settle it

Run VORT on a long-document question-answering benchmark; if accuracy does not exceed a standard transformer baseline or if the learned alpha values all cluster near 1 instead of spreading through the interval, the claimed advantage is refuted.

Figures

Figures reproduced from arXiv: 2605.08966 by Nabil Mlaiki.

Figure 1
Figure 1. Figure 1: Evolution of multi-bank memory. (a) The bank-state norm ∥M(α) t ∥ scales as O(t α ) (Proposition 4.4): high-α banks (back) grow fast while low-α banks (front) stay compressed. (b) Retrieval accuracy landscape. Simulated accuracy of the keyed retrieval (13) as a function of distance log10(distance) and fractional order α. High α preserves accuracy at large distances, while at low α the kernel becomes aggres… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Extraction of heavy-tailed dependencies. Power-law kernels (α ∈ {0.3, 0.5, 0.7, 0.9}, solid) and three exponential baselines (dashed). Normalized retention weight for the distance. Power-law curves decline algebraically on log-log axes, with non￾negligible weight out to distances 103–105 , whereas exponential curves cliff-drop beyond their characteristic horizon 1/λ. The behaviour is analytically captu… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Anomalous diffusion in memory space. Trajectory of a fractional Brownian motion (Hurst exponent H=0.8) in the three-dimensional space spanned by bank-state components (Mx, My, Mz), coloured by time step. The long-range correlated drift is a manifestation of the non-Markovian, heavy-tailed nature of the GL kernel (2): memory states drift persistently, as opposed to memoryless random walks, consistent wi… view at source ↗
Figure 4
Figure 4. Figure 4: (a) GL weights on log-log axes. w (α) j ∼ j α−1/Γ(α) for α ∈ {0.3, 0.5, 0.7, 0.9}(solid lines) versus exponential baseline e −0.025j (dashed). Each power-law sequence is linear with slope α−1 ∈ (−1, 0) on log-log axes, while the exponential baseline curves sharply downward past lag ≈ 100. This difference is the empirical motivation for the design of the VORT kernel and is formalized by the asymptotics (3) … view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of learned fractional orders [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Standard Transformers impose near-exponential decay on the influence of distant tokens, conflicting with the power-law structure of long-range dependencies in natural language. We introduce the \emph{Variable-Order Retention Transformer} (\VORT{}), a memory architecture in which each ingested token is assigned a learnable fractional order \alpha_i\in[\delta,1] that governs a Gr\"unwald--Letnikov power-law retention kernel. Because the fractional weighted sum is non-Markovian, we approximate it through a sum-of-exponentials (SOE) decomposition computed by Gauss--Laguerre quadrature on a Laplace-type integral representation of the kernel weights. Each exponential component admits a one-step Markovian recurrence at O(Sd_v) per step, where S=O(\log(T/\varepsilon)) terms suffice for \varepsilon-uniform accuracy on horizon [1,T]. Retrieval is keyed and associative via a linear-attention accumulator with an exact O(KSd_\phi d_v) -per-step recurrence. Four results are established: (i) an SOE approximation theorem with geometric convergence rate from the analyticity of the integrand after a log-change of variables; (ii) a quantisation bound valid on [\delta,1] with correct analysis near \alpha=0; (iii) a direct L^2 energy argument (Proposition) showing that for \alpha>1/2 any mixture with fixed minimum decay rate \Lambda>0 incurs L^2([1,T]) error at least N_\alpha(T)-C(\Lambda)\to\infty, with the \Lambda-dependence made explicit; and (iv) linear convergence of a gradient plasticity rule under the Polyak--\L{}ojasiewicz condition. Two synthetic experiments confirm the architectural advantage: a Zipf-distributed retrieval benchmark and an entity label-copy task with uniform lag distribution, the latter ruling out prior-matching as an explanation for the power-law kernel's advantage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the Variable-Order Retention Transformer (VORT), which assigns each ingested token a learnable fractional order α_i ∈ [δ,1] to define a Grünwald–Letnikov power-law retention kernel. The non-Markovian weighted sum is approximated by a sum-of-exponentials (SOE) decomposition via Gauss–Laguerre quadrature on a Laplace integral representation, yielding an O(S d_v) per-step Markovian recurrence with S = O(log(T/ε)) for ε-uniform accuracy on [1,T]. Retrieval uses a keyed linear-attention accumulator. Four results are proved: (i) an SOE approximation theorem with geometric convergence from analyticity after log-change of variables; (ii) a quantization bound on [δ,1] with analysis near α=0; (iii) an L² energy lower bound showing that any fixed-minimum-decay mixture incurs unbounded L²([1,T]) error for α>1/2; and (iv) linear convergence of a gradient plasticity rule under the Polyak–Łojasiewicz condition. Two synthetic experiments (Zipf retrieval benchmark and uniform-lag entity label-copy) confirm the architectural advantage.

Significance. If the central claims hold, the work supplies a mathematically grounded mechanism for adaptive power-law memory in transformers, directly addressing the exponential-decay limitation of standard attention. Strengths include the SOE approximation theorem with explicit geometric rate, the explicit L² lower-bound argument with Λ-dependence, the Polyak–Łojasiewicz convergence result, and synthetic experiments that test the hypothesized advantage rather than fitting to a pre-specified target. These elements could influence long-context NLP architectures if the recurrence efficiency is clarified.

major comments (1)
  1. Abstract (recurrence claim): The stated O(S d_v) per-step recurrence with fixed S = O(log(T/ε)) appears incompatible with per-token learnable α_i. Each token’s distinct Grünwald–Letnikov kernel produces its own set of S exponential components; maintaining and decaying these independently causes the state dimension to grow linearly with sequence length, contradicting the fixed-size Markovian claim. If α is instead shared (e.g., per head), the phrasing “each ingested token” is inaccurate and the per-token adaptivity central to the power-law motivation disappears. This issue is load-bearing for the architectural efficiency and the variable-order retention motivation.
minor comments (2)
  1. The abstract mentions four theorems and two synthetic experiments but provides no error-bar details, data-exclusion rules, or explicit statements of the Polyak–Łojasiewicz parameters; these should be added for reproducibility.
  2. Notation for the SOE state update and the exact form of the keyed linear-attention accumulator recurrence should be clarified with a small diagram or pseudocode.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the potential inconsistency between the per-token α_i and the claimed fixed-size Markovian recurrence. The comment is well-taken and points to an ambiguity in the abstract's phrasing. We clarify the architecture below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Abstract (recurrence claim): The stated O(S d_v) per-step recurrence with fixed S = O(log(T/ε)) appears incompatible with per-token learnable α_i. Each token’s distinct Grünwald–Letnikov kernel produces its own set of S exponential components; maintaining and decaying these independently causes the state dimension to grow linearly with sequence length, contradicting the fixed-size Markovian claim. If α is instead shared (e.g., per head), the phrasing “each ingested token” is inaccurate and the per-token adaptivity central to the power-law motivation disappears. This issue is load-bearing for the architectural efficiency and the variable-order retention motivation.

    Authors: We thank the referee for identifying this important point. The apparent incompatibility stems from the assumption that each α_i requires its own distinct set of exponential bases. In the VORT construction, the SOE approximation is obtained by Gauss–Laguerre quadrature on the Laplace integral representation of the kernel after a log-change of variables (see Section 3 and Theorem 1). This yields a fixed collection of quadrature nodes s_j (hence fixed decay rates) that do not depend on α. The α-dependence is confined to the quadrature weights w_j(α), which are evaluated per token at ingestion time. The memory state is therefore realized by S parallel recurrences state_j^{(t)} = exp(−s_j) · state_j^{(t−1)} + w_j(α_t) · v_t. The state dimension remains O(S d_v) independently of sequence length and of how many distinct α_i appear. Per-token adaptivity is fully preserved because each token’s α_i directly modulates the injection coefficients into the shared bases. We will revise the abstract to read “each ingested token is assigned a learnable fractional order α_i that modulates the coefficients of a shared sum-of-exponentials approximation” and will add an analogous clarifying sentence in the method section. This preserves both the efficiency claim and the per-token motivation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theorems rely on independent analytic and energy arguments

full rationale

The four established results—an SOE approximation theorem via analyticity after log-change of variables, a quantization bound on [δ,1], an L² energy lower bound for mixtures with fixed decay, and linear convergence under Polyak-Łojasiewicz—are derived from standard quadrature, direct energy estimates, and classical optimization theory. These steps do not reduce by construction to the architectural claims or to self-citations; the synthetic experiments test the hypothesized advantage on Zipf and lag-copy tasks without pre-fitting targets. The per-token α_i recurrence description, while potentially inconsistent with fixed-state O(S d_v) complexity, is an implementation claim rather than a load-bearing derivation that collapses into its own inputs.

Axiom & Free-Parameter Ledger

3 free parameters · 3 axioms · 1 invented entities

The central claim rests on standard fractional calculus definitions and quadrature rules plus the new per-token alpha parameters and the architectural choice of power-law kernels; no new physical entities are postulated.

free parameters (3)
  • α_i
    Learnable fractional order assigned to each token, chosen during training to fit the data.
  • S = O(log(T/ε))
    Number of exponential terms in the SOE approximation, set to O(log(T/ε)) for target accuracy.
  • δ
    Lower bound on α_i to avoid numerical issues near zero.
axioms (3)
  • standard math Grünwald-Letnikov definition of the fractional integral for the retention kernel.
    Invoked to define the power-law memory update.
  • standard math Existence of a Laplace-type integral representation that permits Gauss-Laguerre quadrature after log substitution.
    Used to obtain the sum-of-exponentials decomposition.
  • domain assumption Polyak-Łojasiewicz condition for the loss landscape.
    Assumed to prove linear convergence of the gradient rule.
invented entities (1)
  • Variable-Order Retention mechanism no independent evidence
    purpose: To implement adaptive power-law memory inside the transformer stack.
    New architectural component introduced by the paper.

pith-pipeline@v0.9.0 · 5645 in / 1827 out tokens · 57659 ms · 2026-05-12T01:55:55.600211+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 9 internal anchors

  1. [1]

    Altmann, E

    G. Altmann, E. G. Altmann, and M. Gerlach. Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words.PLoS ONE, 4(11):e7678, 2009

  2. [2]

    Gemini: A Family of Highly Capable Multimodal Models

    R. Anil et al. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  3. [3]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

  4. [4]

    Beran.Statistics for Long-Memory Processes

    J. Beran.Statistics for Long-Memory Processes. Chapman & Hall, New York, 1994

  5. [5]

    P. J. Brockwell and R. A. Davis.Time Series: Theory and Methods, 2nd edition. Springer-Verlag, New York, 1991

  6. [6]

    T. B. Brown et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020

  7. [7]

    Generating Long Sequences with Sparse Transformers

    R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

  8. [8]

    Choromanski et al

    K. Choromanski et al. Rethinking attention with performers. InInternational Conference on Learning Representations, 2021

  9. [9]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery et al. PaLM: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311, 2022

  10. [10]

    J. P. Crutchfield and D. P. Feldman. Regularities unseen, randomness observed: Levels of entropy convergence.Chaos, 13(1):25–54, 2003

  11. [11]

    T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré. FlashAttention: Fast and memory- efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems, volume 35, pages 16344–16359, 2022

  12. [12]

    Ebeling and G

    W. Ebeling and G. Nicolis. Word frequency and entropy of symbolic sequences.Chaos, Solitons & Fractals, 2(6):635–650, 1992. 15

  13. [13]

    A. A. Gonchar and E. A. Rakhmanov. Equilibrium distributions and the rate of rational approximation of analytic functions.Mathematics of the USSR-Sbornik, 62 (2):305–348, 1987

  14. [14]

    C. W. J. Granger and R. Joyeux. An introduction to long-memory time series models and fractional differencing.Journal of Time Series Analysis, 1(1):15–29, 1980

  15. [15]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

  16. [16]

    A. Gu, K. Goel, and C. Ré. Efficiently modeling long sequences with structured state spaces. InInternational Conference on Learning Representations, 2022

  17. [17]

    Hairer, S

    E. Hairer, S. P. Nørsett, and G. Wanner.Solving Ordinary Differential Equations I: Nonstiff Problems. Springer-Verlag, Berlin, 1988

  18. [18]

    W., Shao, Y

    C. Hooper et al. KVQuant: Towards 10 million context length LLM inference with KV cache quantization.arXiv preprint arXiv:2401.18079, 2024

  19. [19]

    J. R. M. Hosking. Fractional differencing.Biometrika, 68(1):165–176, 1981

  20. [20]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    C.-Y. Hsieh et al. RULER: What’s the real context window size of your LLM?arXiv preprint arXiv:2404.06654, 2024

  21. [21]

    Jumper et al

    J. Jumper et al. Highly accurate protein structure prediction with AlphaFold.Nature, 596:583–589, 2021

  22. [22]

    Katharopoulos, A

    A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InProceedings of the 37th ICML, pages 5156–5165. PMLR, 2020

  23. [23]

    Keysers et al

    D. Keysers et al. Measuring compositional generalization. InICLR, 2020

  24. [24]

    B. M. Lake and M. Baroni. Generalization without systematicity. InProceedings of the 35th ICML, pages 2879–2888. PMLR, 2018

  25. [25]

    SnapKV: LLM Knows What You are Looking for Before Generation

    Y. Li et al. SnapKV: LLM knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024

  26. [26]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InICLR, 2019

  27. [27]

    C. Lubich. Discretized fractional calculus.SIAM Journal on Mathematical Analysis, 17(3):704–719, 1986

  28. [28]

    K. S. Miller and B. Ross.An Introduction to the Fractional Calculus and Fractional Differential Equations. John Wiley & Sons, New York, 1993

  29. [29]

    C. A. Monje, Y. Chen, B. M. Vinagre, D. Xue, and V. Feliu.Fractional-Order Systems and Controls. Springer, London, 2010

  30. [30]

    K. B. Oldham and J. Spanier.The Fractional Calculus. Academic Press, New York, 1974

  31. [31]

    Orvieto et al

    A. Orvieto et al. Resurrecting recurrent neural networks for long sequences. In Proceedings of the 40th ICML, pages 26670–26698. PMLR, 2023. 16

  32. [32]

    Peng et al

    B. Peng et al. RWKV: Reinventing RNNs for the transformer era. InFindings of EMNLP 2023, pages 14048–14077. ACL, 2023

  33. [33]

    Podlubny.Fractional Differential Equations

    I. Podlubny.Fractional Differential Equations. Academic Press, San Diego, 1999

  34. [34]

    Press, N

    O. Press, N. A. Smith, and M. Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. InICLR, 2022

  35. [35]

    Radford et al

    A. Radford et al. Learning transferable visual models from natural language supervi- sion. InProceedings of the 38th ICML, pages 8748–8763. PMLR, 2021

  36. [36]

    J. W. Rae et al. Compressive transformers for long-range sequence modelling. In ICLR, 2020

  37. [37]

    Raffel et al

    C. Raffel et al. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

  38. [38]

    S. G. Samko, A. A. Kilbas, and O. I. Marichev.Fractional Integrals and Derivatives: Theory and Applications. Gordon and Breach, Yverdon, 1993

  39. [39]

    P. Shaw, J. Uszkoreit, and A. Vaswani. Self-attention with relative position represen- tations. InProceedings of NAACL-HLT 2018, pages 464–468. ACL, 2018

  40. [40]

    I. M. Sokolov and J. Klafter. From diffusion to anomalous diffusion: A century after Einstein’s Brownian motion.Chaos, 15(2):026103, 2005

  41. [41]

    J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  42. [42]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  43. [43]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, pages 5998–6008, 2017

  44. [44]

    S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020

  45. [45]

    Welbl, P

    J. Welbl, P. Stenetorp, and S. Riedel. Constructing datasets for multi-hop reading comprehension across documents.Transactions of the ACL, 6:287–302, 2018

  46. [46]

    Towards ai-complete question answering: A set of prerequisite toy tasks

    J. Weston et al. Towards AI-complete question answering: A set of prerequisite toy tasks.arXiv preprint arXiv:1502.05698, 2015

  47. [47]

    Wiener.Extrapolation, Interpolation and Smoothing of Stationary Time Series

    N. Wiener.Extrapolation, Interpolation and Smoothing of Stationary Time Series. MIT Press, Cambridge, MA, 1949

  48. [48]

    S. Yang, B. Wang, Y. Shen, R. Panda, and Y. Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2023

  49. [49]

    Zaheer et al

    M. Zaheer et al. Big bird: Transformers for longer sequences. InAdvances in Neural Information Processing Systems, volume 33, pages 17283–17297, 2020. 17

  50. [50]

    Zhang et al

    Z. Zhang et al. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, volume 36, 2023

  51. [51]

    G. K. Zipf.The Psycho-Biology of Language. Houghton Mifflin, Boston, MA, 1935. 18