arxiv: 2605.11196 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

Vishal Pandey , Gopal Singh

Authors on Pith no claims yet

Pith reviewed 2026-05-13 03:36 UTC · model grok-4.3

classification 💻 cs.LG

keywords linear attentionvariational linear attentionassociative memorylong contextstable recurrencespectral normSherman-Morrison

0 comments

The pith

Normalizing each write vector to unit length makes the linear attention Jacobian spectral norm exactly 1 for every sequence length and head dimension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard linear attention lets its memory state grow without bound in Frobenius norm, so later writes gradually overwrite earlier associations. Variational Linear Attention reframes the update as an online regularized least-squares problem whose penalty matrix adapts through the Sherman-Morrison rank-one formula. The decisive step is forcing every write direction to unit length; the authors prove this choice keeps the recurrence Jacobian at spectral norm one and renders the state norm self-limiting whenever inputs remain bounded. If the proofs hold, linear attention can retain clean retrieval accuracy over long sequences while keeping linear time and memory cost.

Core claim

VLA reframes the memory update as an online regularised least-squares problem with an adaptive penalty matrix maintained via the Sherman-Morrison rank-1 formula. Normalising the write direction to unit length gives the recurrence Jacobian spectral norm exactly 1 for all sequence lengths and head dimensions (Proposition 2), and the state norm is self-limiting under bounded inputs (Proposition 1).

What carries the argument

Adaptive penalty matrix updated by Sherman-Morrison rank-1 formula, paired with explicit unit-length normalization of each write vector, which together force the Jacobian spectral norm to exactly 1.

If this is right

The Frobenius norm of the memory state drops by 109 times at sequence length 1,000 compared with standard linear attention.
Multi-query associative recall reaches near-perfect exact-match accuracy when the number of pairs is smaller than the head dimension.
Retrieval performance remains substantially higher than DeltaNet and standard linear attention as memory load increases.
Accuracy holds at 62 percent even when the number of associations reaches the per-head capacity boundary.
A Triton-fused kernel delivers 14 times speedup over sequential code and undercuts softmax attention latency past roughly 43,000 tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same unit-norm write rule could be tested in other linear recurrent memory models to check whether it produces comparable stability.
The method's behavior under occasional large or drifting inputs remains open and could be checked by injecting controlled outliers into long sequences.
VLA might be combined with existing long-context techniques such as sliding windows or sparse attention to push usable context length further.
Existing linear attention implementations could adopt the update rule directly to measure accuracy gains on document-level retrieval tasks.

Load-bearing premise

Real inputs remain bounded and the adaptive penalty matrix stays well-conditioned for arbitrary sequence lengths and head dimensions.

What would settle it

Generate a sequence of 10,000 bounded random vectors, track the memory state Frobenius norm after each step, and verify whether it stays within a small constant factor of its initial value; if the norm grows without bound the stability claim is false.

Figures

Figures reproduced from arXiv: 2605.11196 by Gopal Singh, Vishal Pandey.

**Figure 2.** Figure 2: Experimental evaluation of VLA against three baselines. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Forward latency (left) and throughput (right) vs. sequence length. All linear-time models ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Left: ∥St∥F over T=1,000 tokens. Linear attention grows linearly (1 630 at t=1,000); both DeltaNet and VLA remain bounded (VLA below 15). Right: ∥At∥F for VLA only, showing exponential decay from ∼56 to ∼10 as the penalty matrix accumulates mass. The decaying At norm is the mechanistic signature of the SM update: directions that receive repeated penalty become progressively less influential. 6.3 Copy task … view at source ↗

**Figure 5.** Figure 5: MQAR eval accuracy vs. npairs (mean over 3 seeds). All experiments use npairs ≤ 24 < dh = 32; the regime beyond capacity (n > dh) is reported in [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: MQAR eval accuracy with n=8 pairs and increasing T ∈ {64, 128, 256, 512}. VLA maintains 1.000 at all sequence lengths. Linear attention and softmax attention plateau at ≈0.14–0.15; DeltaNet collapses to ≈0.01 at all lengths tested [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Copy task training curves (T=64, 1 500 steps). Left: accuracy vs. step. Right: cross-entropy loss vs. step. All four attention mechanisms reach 100% accuracy by step ≈150 with identical loss trajectories. No differences are visible across mechanisms, confirming that all implementations share the same optimisation dynamics. Differences observed in MQAR therefore arise from the attention mechanism, not from … view at source ↗

**Figure 8.** Figure 8: MQAR capacity overload curve. Eval accuracy vs. npairs (n ∈ {8, 16, 24, 32, 48}, 1 000 training steps, single seed). The vertical dashed line marks the per-head capacity boundary dh=32. Key findings: (1) VLA maintains 1.000 exact-match for all n < dh (within capacity; Proposition 3); (2) VLA degrades to 0.62 at n=dh=32 and 0.04 at n=48, confirming the capacity bound is tight; (3) DeltaNet collapses to near… view at source ↗

**Figure 9.** Figure 9: OOD key generalisation test. Models are trained with key tokens drawn exclusively from {0, . . . , 63} and evaluated on two conditions: in-distribution (ID) keys from {0, . . . , 63}, and out-of-distribution (OOD) keys from {64, . . . , 126} (never seen as keys during training). Left: solid lines = ID accuracy; dashed lines = OOD accuracy. VLA achieves ID accuracy 1.000 and OOD accuracy ≥0.70 at npairs=24,… view at source ↗

read the original abstract

Linear attention reduces the quadratic cost of softmax attention to $\mathcal{O}(T)$, but its memory state grows as $\mathcal{O}(T)$ in Frobenius norm, causing progressive interference between stored associations. We introduce \textbf{Variational Linear Attention} (VLA), which reframes the memory update as an online regularised least-squares problem with an adaptive penalty matrix maintained via the Sherman-Morrison rank-1 formula. We prove that normalising the write direction to unit length gives the recurrence Jacobian spectral norm exactly $1$ for all sequence lengths and head dimensions (Proposition 2), and that the state norm is self-limiting under bounded inputs (Proposition 1). Empirically, VLA reduces $\|S_t\|_F$ by $109\times$ relative to standard linear attention at $T{=}1{,}000$, achieves near-perfect exact-match accuracy on multi-query associative recall within the effective per-head memory regime ($n_\text{pairs} < d_h$), maintaining substantially higher retrieval performance than DeltaNet and standard linear attention under increasing memory load, and maintains 62\% accuracy at the per-head capacity boundary. A Triton-fused kernel achieves $14\times$ speedup over sequential Python and $\mathcal{O}(T)$ scaling, crossing below softmax attention latency at approximately 43\,000 tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLA pins the Jacobian spectral norm at exactly 1 via unit normalization of the write vector and shows a 109x state-norm drop, but the stability claims still need checks on whether the penalty matrix stays well-conditioned past T=1000.

read the letter

The main takeaway is that normalizing the write direction to unit length in this regularized least-squares update gives the recurrence Jacobian spectral norm exactly 1 for any sequence length and head dimension. That, plus the Sherman-Morrison maintenance of the adaptive penalty, is the concrete new piece relative to prior linear attention work. It directly targets the interference problem where the memory state grows in norm and associations start to overlap. The empirical side backs it up with a 109 times smaller Frobenius norm at T=1000, near-perfect recall when the number of pairs stays under the per-head dimension, and 62 percent accuracy right at the capacity boundary, plus a Triton kernel that crosses below softmax latency around 43k tokens. Those numbers line up with the stated goal of stable long-context linear attention. The proofs themselves are clean applications of standard linear algebra, and the self-limiting state norm under bounded inputs follows from the formulation. The soft spots sit in the assumptions. Both propositions require that inputs remain bounded and that the penalty matrix stays positive definite and well-conditioned no matter how long the sequence runs or how the keys align over time. The abstract and stress-test note give no explicit bound on the condition number or monitoring past T=1000, so it is not yet clear whether repeated rank-1 updates could degrade the inverse or push the effective Jacobian away from norm 1 in practice. The accuracy figures would also be easier to trust with some indication of run count or variance. This paper is aimed at people working on efficient transformers and associative memory who already know the linear-attention literature. It has enough new theory and targeted experiments to deserve a serious referee, even if the stability guarantees need tighter analysis or additional long-sequence stress tests before they can be taken as fully general.

Referee Report

2 major / 2 minor

Summary. The paper introduces Variational Linear Attention (VLA), which reframes the linear attention memory update as an online regularized least-squares problem with an adaptive penalty matrix maintained via the Sherman-Morrison rank-1 update formula. It provides two propositions: Proposition 1 establishes that the state norm is self-limiting under bounded inputs, and Proposition 2 proves that normalizing the write direction to unit length yields a recurrence Jacobian with spectral norm exactly 1 for all sequence lengths T and head dimensions d_h. Empirically, VLA achieves a 109× reduction in ||S_t||_F relative to standard linear attention at T=1000, near-perfect exact-match accuracy on multi-query associative recall when n_pairs < d_h, substantially higher retrieval performance than DeltaNet and standard linear attention under increasing load, 62% accuracy at the per-head capacity boundary, and a Triton-fused kernel delivering 14× speedup over sequential Python with O(T) scaling that undercuts softmax attention latency around 43,000 tokens.

Significance. If the stability guarantees hold, VLA supplies a theoretically grounded mechanism for preventing progressive memory interference in linear attention, addressing a key limitation that has hindered its use in long-context transformers. The clean application of Sherman-Morrison for efficient online updates and the explicit normalization step that enforces Jacobian spectral norm 1 constitute genuine strengths, offering falsifiable predictions about state boundedness that could be tested in broader settings. Combined with the practical kernel implementation, the work has clear potential to improve scaling of associative memory in sequence models without quadratic cost.

major comments (2)

[§3, Propositions 1 and 2] §3, Propositions 1 and 2: Both stability claims rest on the assumptions that inputs remain bounded and that the adaptive penalty matrix P_t remains positive definite and well-conditioned for arbitrary T and d_h without further tuning. No explicit bound on cond(P_t) is derived, and the manuscript reports no empirical monitoring of the condition number under repeated rank-1 updates (especially when new keys align with the prior span), which risks the effective Jacobian spectral norm deviating from exactly 1 and introducing numerical instability in the inverse update.
[§5 (Empirical results)] §5 (Empirical results): The reported 109× reduction in ||S_t||_F at T=1000 and the accuracy figures (near-perfect exact-match, 62% at capacity boundary) are presented without error analysis, number of runs, variance estimates, or statistical tests, which weakens the strength of the cross-baseline performance claims relative to DeltaNet and standard linear attention.

minor comments (2)

[Abstract] Abstract: The claim of 'near-perfect exact-match accuracy' and 'substantially higher retrieval performance' would benefit from a brief parenthetical note on the precise metric and the range of T and d_h tested.
[§3] Notation: The definition of the adaptive penalty matrix P_t and its initialization should be stated explicitly in the main text before the propositions, as the recurrence relies on it.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the theoretical and practical contributions of Variational Linear Attention. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§3, Propositions 1 and 2] §3, Propositions 1 and 2: Both stability claims rest on the assumptions that inputs remain bounded and that the adaptive penalty matrix P_t remains positive definite and well-conditioned for arbitrary T and d_h without further tuning. No explicit bound on cond(P_t) is derived, and the manuscript reports no empirical monitoring of the condition number under repeated rank-1 updates (especially when new keys align with the prior span), which risks the effective Jacobian spectral norm deviating from exactly 1 and introducing numerical instability in the inverse update.

Authors: We thank the referee for this observation on the assumptions. Proposition 1 explicitly assumes bounded inputs, consistent with standard analyses of recurrent dynamics. Proposition 2 proves that unit-length normalization of the write vector yields Jacobian spectral norm exactly 1, provided the Sherman-Morrison update remains defined, which requires P_t to stay positive definite. Positive definiteness holds by construction: P_0 = λI with λ > 0 and each rank-1 update preserves it. We do not claim an explicit analytic bound on cond(P_t), as deriving a tight worst-case bound under arbitrary alignments would require additional analysis beyond the current scope. However, the normalization step ensures the spectral-norm claim holds independently of conditioning whenever the inverse exists. To address the concern directly, the revised manuscript will (i) explicitly restate the positive-definiteness assumption in §3 and (ii) add empirical monitoring of cond(P_t) and ||S_t||_F over long sequences (T up to 10k) in §5, including cases where keys align with prior spans. These additions will confirm practical well-conditioning and mitigate any perceived risk of deviation from the theoretical Jacobian bound. revision: partial
Referee: [§5 (Empirical results)] §5 (Empirical results): The reported 109× reduction in ||S_t||_F at T=1000 and the accuracy figures (near-perfect exact-match, 62% at capacity boundary) are presented without error analysis, number of runs, variance estimates, or statistical tests, which weakens the strength of the cross-baseline performance claims relative to DeltaNet and standard linear attention.

Authors: We agree that the empirical section would be strengthened by statistical reporting. The current results were obtained from single runs for brevity, but we recognize this limits the strength of comparisons. In the revised manuscript we will: rerun all associative-recall and state-norm experiments over at least five independent random seeds; report means and standard deviations for the 109× reduction, exact-match accuracies, and load-sweep curves; add error bars to the relevant figures; and include paired statistical tests (e.g., t-tests) against DeltaNet and standard linear attention. These changes will be incorporated into §5 and the associated figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; propositions follow from explicit normalization and Sherman-Morrison updates

full rationale

The paper's central stability claims (Propositions 1 and 2) are derived directly from the reframing as regularized least-squares with unit-length normalization of the write direction and the rank-1 Sherman-Morrison update for the penalty matrix. These steps produce the claimed Jacobian spectral norm of exactly 1 and self-limiting state norm as algebraic consequences of the chosen update rule under the stated bounded-input assumption, without reducing to fitted hyperparameters, self-citations, or tautological redefinitions. No performance numbers are shown to be forced by construction, and the derivation remains self-contained against external linear-algebra benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard matrix-inversion lemmas and the modeling choice to treat memory as a regularized least-squares state; no new physical entities or heavily fitted constants are introduced.

axioms (1)

standard math Sherman-Morrison rank-1 update formula for matrix inverse
Invoked to maintain the adaptive penalty matrix in O(d^2) time per step.

pith-pipeline@v0.9.0 · 5532 in / 1219 out tokens · 119717 ms · 2026-05-13T03:36:42.958803+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 1 internal anchor

[1]

Zoology: Measuring and improving recall in efficient language models

S. Arora, S. Eyuboglu, A. Timalsina, I. Johnson, M. Poli, J. Zou, A. Rudra, and C. Ré. Zoology: Measuring and improving recall in efficient language models.arXiv preprint arXiv:2312.04927, 2023

work page arXiv 2023
[2]

G. E. Blelloch. Prefix sums and their applications. InTechnical Report CMU-CS-90-190, 1990

work page 1990
[3]

Choromanski, V

K. Choromanski, V . Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, Ł. Kaiser, D. Belanger, L. Colwell, and A. Weller. Rethinking attention with performers. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[4]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Haykin.Adaptive Filter Theory

S. Haykin.Adaptive Filter Theory. Prentice Hall, 2002

work page 2002
[6]

Katharopoulos, A

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning (ICML), 2020

work page 2020
[7]

Ramsauer, B

H. Ramsauer, B. Schäfl, et al. Hopfield networks is all you need. InInternational Conference on Learning Representations, 2021

work page 2021
[8]

Schlag, K

I. Schlag, K. Irie, and J. Schmidhuber. Linear transformers are secretly fast weight programmers. InInternational Conference on Machine Learning (ICML), 2021

work page 2021
[9]

Schmidhuber

J. Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

work page 1992
[10]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[11]

Yang et al

S. Yang et al. Deltanet: Conditional state-space models. InInternational Conference on Machine Learning, 2024. 13 Variational Linear Attention A Full Derivation of the VLAv3 Update A.1 From regularised least squares to the recursive update Consider the penalised objective at stept: S∗ t = arg min S tX s=1 ∥vs −S ˆks∥2 + tr(SMtS⊤), M t =λ 0I+ tX s=1 usu⊤ s...

work page 2024