pith. sign in

arxiv: 2606.01294 · v1 · pith:5TSPYAH4new · submitted 2026-05-31 · 💻 cs.CL · cs.LG

Don't Read Everything: A Curvature-Conditioned Query for Linear Attention

Pith reviewed 2026-06-28 17:21 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords linear attentioncurvature-conditioned querykey covariancelong-context retrievalsoftmax log-partitionrecurrent statequery contractionin-context learning
0
0 comments X

The pith

Curvature from key covariance contracts the query to improve linear attention on long contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Linear attention maintains a recurrent state for efficiency but reads every past key additively, diluting useful targets amid the bulk of stored vectors. The paper derives a read-time contraction from a second-order Taylor expansion of the softmax log-partition around the isotropic-attention point. This expansion produces a local quadratic model whose curvature equals the running key covariance, a quantity trackable with the same recurrent mechanism as the linear-attention state. The resulting linear operator contracts the query along high-density directions of memory before it reads the state. The mechanism, called Curvature-Conditioned Query, modifies only the read step and composes with existing linear-attention backbones such as GLA and Gated DeltaNet.

Core claim

The paper establishes that the second-order Taylor expansion of the softmax log-partition at the isotropic-attention point yields a local quadratic model whose curvature coincides with the running key covariance, and that the associated linear operator contracts the query along the high-density directions of memory before reading the recurrent state, supplying a cheap read-time filter that improves in-context retrieval and long-context performance.

What carries the argument

Curvature-Conditioned Query (CCQ): the linear operator obtained from the Hessian of the softmax log-partition at the isotropic point, which equals the key covariance and contracts the query along high-density memory directions before interaction with the recurrent state.

If this is right

  • Attaching CCQ to GLA or Gated DeltaNet improves perplexity on long sequences.
  • CCQ raises zero-shot downstream accuracy and S-NIAH retrieval performance both inside and beyond the training context.
  • Length-extrapolation perplexity improves when context grows from 4K to 20K.
  • LongBench accuracy increases while adding only modest computational cost.
  • The read-side contraction works with any linear-attention write path.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The covariance-tracking approach could supply analogous contractions for attention variants whose geometry differs from softmax.
  • Read-time conditioning may turn out to be more efficient than further write-side gating when scaling context length.
  • Measuring how contraction strength correlates with retrieval accuracy on controlled synthetic tasks would test the mechanism directly.

Load-bearing premise

The second-order Taylor expansion of the softmax log-partition at the isotropic point supplies a curvature that usefully identifies high-density key directions for query contraction.

What would settle it

If attaching CCQ to GLA or Gated DeltaNet produces no gain or a loss in S-NIAH retrieval accuracy on sequences longer than the training context, the utility of the curvature-derived contraction would be refuted.

Figures

Figures reproduced from arXiv: 2606.01294 by Anh Tuan Luu, Cong-Duy Nguyen, Dong Le, Thong Nguyen.

Figure 1
Figure 1. Figure 1: Layer-by-layer geometric separation, aggre [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Length-extrapolation PPL at 500M over six [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Length-extrapolation PPL at 1.3B on the same [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-(layer, head) ∆µ heatmap for Qwen3-4B-Instruct (36 layers × 8 key heads). Each cell is ∆µ = µunique-needle − µdistractor on top-16 alignment, pooled across all prompts. Diverging colormap centred at zero; blue = negative (matches the CCQ premise), red = positive. Annotated value is ∆µ for that cell. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-(layer, head) ∆µ heatmap for Gated DeltaNet 500M (21 layers × 6 heads). Same colormap and pooling convention as [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
read the original abstract

Linear attention reduces the quadratic cost of softmax attention by maintaining a recurrent fast-weight state, but it consistently lags on in-context retrieval and long-context tasks. Existing remedies act on the write side of memory through gating, delta updates, or kernel feature maps, but the read step is left unchanged: every past key contributes additively to the output, so useful targets are diluted by the bulk of stored vectors. We borrow one specific piece of softmax's geometry to construct a cheap read-time contraction of the query. A second-order Taylor expansion of the softmax log-partition at the isotropic-attention point gives a local quadratic model whose curvature coincides with the running key covariance, a quantity that can be maintained with the same recurrent/chunkwise mechanism as the linear-attention state. The associated linear operator contracts the query along the high-density directions of memory before it reads the state. We call this mechanism Curvature-Conditioned Query (CCQ). CCQ modifies only the read step and is composable with any linear-attention backbone. Attached to GLA and Gated DeltaNet, it improves perplexity, zero-shot downstream accuracy, S-NIAH retrieval at and beyond the training context, length-extrapolation perplexity from 4K to 20K, and LongBench accuracy, at small extra cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes Curvature-Conditioned Query (CCQ), a read-time modification for linear attention. A second-order Taylor expansion of the softmax log-partition at the isotropic point yields a local quadratic whose Hessian equals the key covariance; this covariance is maintained recurrently and used to define a linear contraction operator applied to the incoming query before it reads the fast-weight state. CCQ is presented as composable with any linear-attention backbone and is evaluated on GLA and Gated DeltaNet, reporting gains in perplexity, zero-shot accuracy, S-NIAH retrieval (in- and out-of-context), length extrapolation (4K to 20K), and LongBench at modest extra cost.

Significance. If the local Taylor-derived contraction remains effective for learned queries, the approach supplies a lightweight, geometry-motivated fix for the additive dilution problem in linear attention without altering the write path. The composability claim and the reuse of an already-maintained covariance statistic are attractive; successful validation would be a modest but concrete advance for efficient long-context models.

major comments (1)
  1. [§3.2] §3.2 (derivation of the CCQ operator): the second-order Taylor expansion of log Z(q) is performed at the isotropic point q_iso; the resulting Hessian is identified with the key covariance Σ. No bound on the remainder term or estimate of typical ||q − q_iso|| for the learned query projections is supplied. Because the contraction operator is applied directly to arbitrary queries, the local nature of the approximation is load-bearing for the central claim that the operator preferentially suppresses high-density directions.
minor comments (2)
  1. [§4] The experimental section should report standard deviations across seeds and explicit ablations that isolate the contraction operator from other implementation choices (e.g., gating or chunking).
  2. [§3.1] Notation for the isotropic point and the precise definition of the contraction matrix (e.g., whether it is Σ^{-1/2} or a regularized variant) should be stated once in a single equation block for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the local nature of the Taylor approximation in §3.2. The observation correctly identifies that the validity of the curvature-based contraction rests on the quality of the second-order expansion for the queries encountered in practice. We address this directly below and will strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (derivation of the CCQ operator): the second-order Taylor expansion of log Z(q) is performed at the isotropic point q_iso; the resulting Hessian is identified with the key covariance Σ. No bound on the remainder term or estimate of typical ||q − q_iso|| for the learned query projections is supplied. Because the contraction operator is applied directly to arbitrary queries, the local nature of the approximation is load-bearing for the central claim that the operator preferentially suppresses high-density directions.

    Authors: We acknowledge that the manuscript does not supply an analytic bound on the Lagrange remainder or a quantitative estimate of ||q − q_iso||. Deriving a uniform bound that holds for arbitrary learned query projections is non-trivial because the remainder depends on the third- and higher-order derivatives of log Z, which are themselves data-dependent. In the revised version we will add an empirical section that reports the distribution of ||q − q_iso|| (in the appropriate norm) measured on held-out sequences from the training distribution for both GLA and Gated DeltaNet backbones. We will also include a short discussion of how the observed distances relate to the scale at which the quadratic term dominates, thereby providing concrete support for the claim that the contraction preferentially damps high-density directions. This addition does not alter the core derivation but directly addresses the load-bearing assumption. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external Taylor expansion

full rationale

The paper motivates its Curvature-Conditioned Query by applying a standard second-order Taylor expansion to the softmax log-partition at an isotropic point, identifying the resulting Hessian with the empirical key covariance that is already maintained by the linear-attention state. This covariance is computed directly from stored keys rather than fitted to any target metric, and the contraction operator is obtained from the external approximation rather than defined in terms of the method's own outputs. No self-citations, fitted-input predictions, or self-definitional steps appear in the provided derivation chain; the construction therefore remains independent of its own results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; the ledger is therefore limited to the explicit modeling choice stated in the abstract.

axioms (1)
  • domain assumption The second-order Taylor expansion of the softmax log-partition at the isotropic-attention point supplies a usable local quadratic model for query contraction.
    Invoked to justify that curvature coincides with key covariance and yields a beneficial read-time operator.

pith-pipeline@v0.9.1-grok · 5772 in / 1252 out tokens · 20122 ms · 2026-06-28T17:21:14.480731+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 2 canonical work pages

  1. [1]

    Advances in Neural Information Processing Systems , volume =

    Attention is All You Need , author =. Advances in Neural Information Processing Systems , volume =

  2. [2]

    Transformers are

    Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. Proceedings of the 37th International Conference on Machine Learning , series =. 2020 , url =

  3. [3]

    Neural Computation , volume =

    Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , author =. Neural Computation , volume =. 1992 , doi =

  4. [4]

    Proceedings of the 38th International Conference on Machine Learning , series =

    Linear Transformers Are Secretly Fast Weight Programmers , author =. Proceedings of the 38th International Conference on Machine Learning , series =

  5. [5]

    Advances in Neural Information Processing Systems , year =

    Parallelizing Linear Transformers with the Delta Rule over Sequence Length , author =. Advances in Neural Information Processing Systems , year =

  6. [6]

    The Thirteenth International Conference on Learning Representations , year =

    Gated Delta Networks: Improving Mamba2 with Delta Rule , author =. The Thirteenth International Conference on Learning Representations , year =

  7. [7]

    and Lajoie, Guillaume and Frenkel, Charlotte and Pascanu, Razvan and Ag

    von Oswald, Johannes and Scherrer, Nino and Kobayashi, Seijin and Versari, Luca and Yang, Songlin and Schlegel, Maximilian and Maile, Kaitlin and Schimpf, Yanick and Sieberling, Oliver and Meulemans, Alexander and Saurous, Rif A. and Lajoie, Guillaume and Frenkel, Charlotte and Pascanu, Razvan and Ag. The Fourteenth International Conference on Learning Re...

  8. [8]

    Proceedings of the 41st International Conference on Machine Learning , series =

    Gated Linear Attention Transformers with Hardware-Efficient Training , author =. Proceedings of the 41st International Conference on Machine Learning , series =

  9. [9]

    arXiv preprint arXiv:2307.08621 , year =

    Retentive Network: A Successor to Transformer for Large Language Models , author =. arXiv preprint arXiv:2307.08621 , year =

  10. [10]

    and Wozniak, Stanislaw and Zhang, Ruichong and Zhang, Zhenyuan and Zhao, Qihang and Zhou, Peng and Zhou, Qinghua and Zhu, Jian and Zhu, Rui-Jie , booktitle =

    Peng, Bo and Alcaide, Eric and Anthony, Quentin and Albalak, Alon and Arcadinho, Samuel and Biderman, Stella and Cao, Huanqi and Cheng, Xin and Chung, Michael and Grella, Matteo and GV, Kranthi Kiran and He, Xuzheng and Hou, Haowen and Lin, Jiaju and Kazienko, Przemyslaw and Kocon, Jan and Kong, Jiaming and Koptyra, Bartlomiej and Lau, Hayden and Mantri, ...

  11. [11]

    First Conference on Language Modeling , year =

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author =. First Conference on Language Modeling , year =

  12. [12]

    Transformers are

    Dao, Tri and Gu, Albert , booktitle =. Transformers are. 2024 , url =

  13. [13]

    International Conference on Learning Representations , year =

    Rethinking Attention with Performers , author =. International Conference on Learning Representations , year =

  14. [14]

    International Conference on Learning Representations , year =

    Random Feature Attention , author =. International Conference on Learning Representations , year =

  15. [15]

    , booktitle =

    Kasai, Jungo and Peng, Hao and Zhang, Yizhe and Yogatama, Dani and Ilharco, Gabriel and Pappas, Nikolaos and Mao, Yi and Chen, Weizhu and Smith, Noah A. , booktitle =. Finetuning Pretrained Transformers into. 2021 , doi =

  16. [16]

    The Twelfth International Conference on Learning Representations , year =

    The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry , author =. The Twelfth International Conference on Learning Representations , year =

  17. [17]

    International Conference on Learning Representations , year =

    Zoology: Measuring and Improving Recall in Efficient Language Models , author =. International Conference on Learning Representations , year =

  18. [18]

    Proceedings of the 41st International Conference on Machine Learning , series =

    Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff , author =. Proceedings of the 41st International Conference on Machine Learning , series =

  19. [19]

    2024 , url =

    Hsieh, Cheng-Ping and Sun, Simeng and Kriman, Samuel and Acharya, Shantanu and Rekesh, Dima and Jia, Fei and Ginsburg, Boris , booktitle =. 2024 , url =

  20. [20]

    LongBench:

    Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi. L ong B ench: A Bilingual, Multitask Benchmark for Long Context Understanding. Proceedings of the 62nd Annual Meeting of the Association for Computation...

  21. [21]

    2021 , publisher =

    Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

  22. [22]

    Penedo, Guilherme and Kydl. The. Advances in Neural Information Processing Systems , year =

  23. [23]

    Learning to (

    Sun, Yu and Li, Xinhao and Dalal, Karan and Xu, Jiarui and Vikram, Arjun and Zhang, Genghan and Dubois, Yann and Chen, Xinlei and Wang, Xiaolong and Koyejo, Sanmi and Hashimoto, Tatsunori and Guestrin, Carlos , booktitle =. Learning to (. 2025 , url =

  24. [24]

    Advances in Neural Information Processing Systems , volume=

    Titans: Learning to memorize at test time , author=. Advances in Neural Information Processing Systems , volume=

  25. [25]

    Advances in Neural Information Processing Systems , year =

    Beck, Maximilian and P. Advances in Neural Information Processing Systems , year =

  26. [26]

    International Conference on Learning Representations , year =

    Hopfield Networks is All You Need , author =. International Conference on Learning Representations , year =

  27. [27]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =

    Transformer Feed-Forward Layers Are Key-Value Memories , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =. 2021 , doi =

  28. [28]

    Numerical Optimization , author =

  29. [29]

    Convex Optimization , author =

  30. [30]

    Foundations and Trends in Machine Learning , volume =

    Graphical Models, Exponential Families, and Variational Inference , author =. Foundations and Trends in Machine Learning , volume =. 2008 , doi =