pith. machine review for the scientific record. sign in

arxiv: 2605.09778 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Nectar: Neural Estimation of Cached-Token Attention via Regression

Jo\~ao Monteiro, Marco Cuturi, Michal Klein, Pierre Ablin

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:35 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords attention approximationlong-context transformersneural regressionKV cache efficiencyinference optimizationsoftmax attentioncontext scaling
0
0 comments X

The pith

A compact neural network can approximate full attention over long cached contexts, replacing the linear scan with a fixed-cost forward pass while keeping generated text semantically equivalent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the attention output for any query over a fixed long context is a deterministic function that can be learned by a small neural network. By training separate networks per layer and attention head to predict both the attention output and its normalizer, the method avoids scanning the entire cache for each new token. Experiments across models up to 8 billion parameters on long-context datasets demonstrate that the approximation error closely follows the drop in next-token prediction accuracy compared to full attention. Non-uniform allocation of network capacity across layers further reduces this gap. Generations from models using this approximation produce text whose meaning matches that from full-cache attention.

Core claim

Nectar fits two compact neural networks per transformer layer and key-value head: one that regresses the attention output directly and another that predicts the log-normalizer of the softmax. At inference these replace the standard attention computation over the cached keys and values, incurring a cost independent of context length. The approach is trained on queries sampled from a task-relevant distribution and evaluated on actual generation trajectories.

What carries the argument

Nectar module: a target network predicting attention outputs paired with a score network predicting the log-normalizer, inserted into masked self-attention to bypass cache scanning.

If this is right

  • The approximation error of the networks tracks the next-token accuracy gap to full attention.
  • Allocating capacity non-uniformly across layers reduces the accuracy gap more than uniform allocation.
  • Text generations from Nectar-equipped models match the semantic content of full-cache generations on question prompts.
  • The parameter count per module stays much smaller than the KV-cache footprint it replaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This substitution could allow context lengths to grow without a matching rise in per-token compute.
  • The regression approach might extend to approximating other expensive deterministic operations inside large models.
  • Adaptive per-layer sizing of the networks could further tighten the accuracy-compute trade-off beyond the tested ablations.

Load-bearing premise

A compact neural network trained on task-relevant queries can closely approximate the exact attention function and will continue to do so for the distribution of queries encountered during actual text generation.

What would settle it

A measurable divergence in the semantic content of text generated by the same model when using Nectar versus full attention on the same long-context question prompts would falsify the claim.

read the original abstract

Evaluating softmax attention over a fixed long context requires reading every cached key-value pair for each new query token. For a given context (a book, a manual, a legal corpus) the attention output is a deterministic function of the query. We propose Nectar, which fits a compact neural network to this function for queries drawn from a task-relevant distribution. Nectar fits two networks per layer and KV-head: a target network that predicts the attention output and a score network that predicts the log-normalizer. The pair plugs into the standard masked self-attention at inference time, replacing the $O(n)$ attention over the cache with a forward pass whose cost does not depend on $n$. Each module carries on the order of $|\theta|$ parameters per layer and KV-head, typically much smaller than the $2nd$ KV-cache footprint at the same granularity. We report experiments on models from 1.7B to 8B parameters across five long-context datasets. The approximation error tracks the next-token accuracy gap to full attention, and allocating capacity non-uniformly across layers reduces that gap in our ablation. Beyond this analysis of metrics, we check that the text generations (following a question prompt) of a model equipped with a Nectar module match in semantic content those obtained by giving the same model access to the full cache.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Nectar, which fits compact per-layer, per-KV-head neural networks (a target network for the attention output vector and a score network for the log-normalizer) via supervised regression to the deterministic softmax attention function, using queries sampled from a task-relevant distribution. At inference this replaces the O(n) cache attention with a fixed-cost forward pass whose size is independent of context length n. Experiments on 1.7B–8B models across five long-context datasets are said to show that approximation error tracks the next-token accuracy gap to full attention, that non-uniform capacity allocation across layers reduces the gap, and that generations remain semantically matched to full-cache outputs.

Significance. If the generalization from training queries to inference-time hidden states holds, the method offers a practical route to constant-time attention for long contexts, with the non-uniform allocation ablation providing a useful design insight. The regression-to-deterministic-function framing is clean and the semantic-equivalence check is a relevant practical metric. However, the current evidence base is too thin to establish the result as load-bearing for the field.

major comments (3)
  1. [Abstract] Abstract and experimental results: the central claim that 'approximation error tracks the next-token accuracy gap' is stated without any quantitative tables, correlation coefficients, error bars, or baseline comparisons, so the strength of the tracking cannot be evaluated.
  2. [Experiments] Experiments section: no explicit validation is provided that the distribution of queries used for fitting matches the statistics of hidden states arising during autoregressive generation on the target tasks; if a shift exists, the reported error tracking and semantic equivalence may not hold at inference.
  3. [Ablation] Ablation on capacity allocation: the claim that non-uniform allocation reduces the accuracy gap is presented without the actual per-layer error or accuracy numbers, making it impossible to judge whether the improvement is material or merely consistent with the overall limited support.
minor comments (2)
  1. [Abstract] The abstract should include at least one concrete table or figure reference showing the reported error–accuracy correlation.
  2. [Method] Training details (optimizer, learning rate, number of samples per head, exact architecture of the target/score networks) are missing and should be supplied for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript to strengthen the quantitative evidence and add necessary validations.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental results: the central claim that 'approximation error tracks the next-token accuracy gap' is stated without any quantitative tables, correlation coefficients, error bars, or baseline comparisons, so the strength of the tracking cannot be evaluated.

    Authors: We agree that the abstract claim requires quantitative support to be fully evaluable. In the revised manuscript we will add a dedicated table in the Experiments section reporting per-model and per-dataset approximation error (MSE and cosine similarity to full attention) alongside the next-token accuracy gap. We will include Pearson correlation coefficients between these quantities, error bars computed over multiple random seeds, and a uniform-capacity baseline for comparison. This will allow direct assessment of how closely the tracking holds. revision: yes

  2. Referee: [Experiments] Experiments section: no explicit validation is provided that the distribution of queries used for fitting matches the statistics of hidden states arising during autoregressive generation on the target tasks; if a shift exists, the reported error tracking and semantic equivalence may not hold at inference.

    Authors: This is a fair and important point. Our query sampling procedure drew from hidden-state activations collected while running the base model on task-relevant long-context sequences (books, manuals, legal text) during data preparation. To address the concern explicitly, the revision will include a new subsection that compares first- and second-order statistics (means, variances, and selected quantiles) of the regression training queries against queries extracted from full autoregressive rollouts on the five evaluation datasets. Any observed shifts will be quantified and discussed with respect to their potential impact on the reported metrics. revision: yes

  3. Referee: [Ablation] Ablation on capacity allocation: the claim that non-uniform allocation reduces the accuracy gap is presented without the actual per-layer error or accuracy numbers, making it impossible to judge whether the improvement is material or merely consistent with the overall limited support.

    Authors: We acknowledge that the ablation description is insufficient without the underlying numbers. The revised manuscript will expand the ablation subsection with a table (or supplementary figure) that lists per-layer approximation error and next-token accuracy gap for both the uniform and non-uniform capacity allocations. This will make clear the magnitude of the improvement and identify which layers benefit most from the non-uniform budget. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit supervised regression to computed attention outputs with empirical validation of generalization

full rationale

The paper trains compact target and score networks via supervised regression directly on attention outputs and log-normalizers computed from the full KV cache for queries sampled from a task-relevant distribution. The reported claims—that approximation error correlates with next-token accuracy gaps, that non-uniform capacity allocation reduces the gap, and that generated text matches full-cache semantics—are presented as empirical observations measured on held-out data and generation tasks rather than as algebraic identities or tautologies following from the fitting equations themselves. No load-bearing premise reduces by the paper's own definitions or self-citations to a renamed input; the method is self-contained against the external benchmark of exact attention computation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the determinism of attention for fixed KV cache and standard assumptions of neural network regression; no new physical entities or ad-hoc constants beyond the learned network weights are introduced.

free parameters (1)
  • network parameters |θ|
    Weights of the target and score networks are fitted to match true attention outputs on sampled queries.
axioms (1)
  • domain assumption Attention output is a deterministic function of the query given a fixed KV cache.
    Explicitly stated as the basis for fitting a network to the function.

pith-pipeline@v0.9.0 · 5542 in / 1293 out tokens · 32336 ms · 2026-05-12T02:35:31.779639+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 7 internal anchors

  1. [1]

    Perceiver

    Jaegle, Andrew and Borgeaud, Sebastian and Alayrac, Jean-Baptiste and Doersch, Carl and Ionescu, Catalin and Ding, David and Koppula, Skanda and Zoran, Daniel and Brock, Andrew and Shelhamer, Evan and Henaff, Olivier and Botvinick, Matthew and Zisserman, Andrew and Vinyals, Oriol and Carreira, Joao , booktitle=. Perceiver

  2. [2]

    arXiv:2603.08001 , year=

    Amortizing Maximum Inner Product Search with Learned Support Functions , author=. arXiv:2603.08001 , year=

  3. [3]

    The Fourteenth International Conference on Learning Representations , year=

    Cartridges: Lightweight and general-purpose long context representations via self-study , author=. The Fourteenth International Conference on Learning Representations , year=

  4. [4]

    Approximate Calculation of Multiple Integrals , author=

  5. [5]

    Fast Transformer Decoding: One Write-Head is All You Need

    Fast transformer decoding: One write-head is all you need , author=. arXiv preprint arXiv:1911.02150 , year=

  6. [6]

    Proceedings of Machine Learning and Systems , volume=

    Efficiently scaling transformer inference , author=. Proceedings of Machine Learning and Systems , volume=

  7. [7]

    Neurocomputing , volume=

    RoFormer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=

  8. [8]

    International Conference on Machine Learning , pages=

    Input convex neural networks , author=. International Conference on Machine Learning , pages=

  9. [9]

    Advances in Neural Information Processing Systems , volume=

    FlashAttention: Fast and memory-efficient exact attention with IO-awareness , author=. Advances in Neural Information Processing Systems , volume=

  10. [10]

    Xiong, Yunyang and Zeng, Zhanpeng and Chakraborty, Rudrasis and Tan, Mingxing and Fung, Glenn and Li, Yin and Singh, Vikas , journal=. Nystr

  11. [11]

    International Conference on Learning Representations , year=

    Rethinking attention with performers , author=. International Conference on Learning Representations , year=

  12. [12]

    Linformer: Self-Attention with Linear Complexity

    Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=

  13. [13]

    International Conference on Learning Representations , year=

    Reformer: The efficient transformer , author=. International Conference on Learning Representations , year=

  14. [14]

    Advances in Neural Information Processing Systems , volume=

    Retrieval-augmented generation for knowledge-intensive NLP tasks , author=. Advances in Neural Information Processing Systems , volume=

  15. [15]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  16. [16]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  17. [17]

    arXiv preprint arXiv:2602.16284 , year=

    Fast KV Compaction via Attention Matching , author=. arXiv preprint arXiv:2602.16284 , year=

  18. [18]

    arXiv preprint arXiv:2503.08727 , year=

    Training plug-n-play knowledge modules with deep context distillation , author=. arXiv preprint arXiv:2503.08727 , year=

  19. [19]

    Text-to-LoRA: Instant transformer adaption.arXiv preprint arXiv:2506.06105, 2025

    Text-to-lora: Instant transformer adaption , author=. arXiv preprint arXiv:2506.06105 , year=

  20. [20]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Learning to Compress Prompts with Gist Tokens , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  21. [21]

    The 2023 Conference on Empirical Methods in Natural Language Processing , year=

    Adapting Language Models to Compress Contexts , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

  22. [22]

    Scott , title=

    Fitzgerald, F. Scott , title=. 1925 , publisher=

  23. [23]

    Wells, H. G. , title=. 1895 , publisher=

  24. [24]

    1899 , publisher=

    Conrad, Joseph , title=. 1899 , publisher=

  25. [25]

    Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    Thakur, Nandan and Reimers, Nils and R. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  26. [26]

    Wang, Wenhui and Wei, Furu and Dong, Li and Bao, Hangbo and Yang, Nan and Zhou, Ming , booktitle=

  27. [27]

    arXiv preprint arXiv:2401.14490 , year=

    LongHealth: A Question Answering Benchmark with Long Clinical Documents , author=. arXiv preprint arXiv:2401.14490 , year=

  28. [28]

    , booktitle=

    Pang, Richard Yuanzhe and Parrish, Alicia and Joshi, Nitish and Nangia, Nikita and Phang, Jason and Chen, Angelica and Padmakumar, Vishakh and Ma, Johnny and Thompson, Jana and He, He and Bowman, Samuel R. , booktitle=

  29. [29]

    Advances in Neural Information Processing Systems , volume=

    Language Models are Few-Shot Learners , author=. Advances in Neural Information Processing Systems , volume=

  30. [30]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=

  31. [31]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=

  32. [32]

    Longformer: The Long-Document Transformer

    Longformer: The Long-Document Transformer , author=. arXiv preprint arXiv:2004.05150 , year=

  33. [33]

    NeurIPS , year=

    Zhang, Zhenyu and Sheng, Ying and Zhou, Tianyi and Chen, Tianlong and Zheng, Lianmin and Cai, Ruisi and Song, Zhao and Tian, Yuandong and R. NeurIPS , year=

  34. [34]

    Li, Yuhong and Huang, Yingbing and Yang, Bowen and Vber, Bharat and Hashimoto, Tatsunori and Iyer, Shankar and Chen, Jianfeng , journal=

  35. [35]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author=. arXiv preprint arXiv:2312.00752 , year=

  36. [36]

    Transformers are

    Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. ICML , year=

  37. [37]

    Peng, Bowen and Quesnelle, Jeffrey and Fan, Honglu and Shao, Enrico , journal=

  38. [38]

    Fine-tuning or Retrieval? Comparing Knowledge Injection in

    Ovadia, Oded and Brief, Menachem and Mishaeli, Moshik and Elisha, Oren , journal=. Fine-tuning or Retrieval? Comparing Knowledge Injection in

  39. [39]

    OpenAI GPT-5 System Card

    Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

  40. [40]

    Qwen2.5-1m technical report.ArXiv, abs/2501.15383, 2025

    Qwen2.5-1M Technical Report , author=. arXiv preprint arXiv:2501.15383 , year=

  41. [41]

    2024 , publisher=

    Introduction to Intellectual Property , author=. 2024 , publisher=