arxiv: 2605.09778 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Nectar: Neural Estimation of Cached-Token Attention via Regression

Jo\~ao Monteiro, Marco Cuturi, Michal Klein, Pierre Ablin

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:35 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords attention approximationlong-context transformersneural regressionKV cache efficiencyinference optimizationsoftmax attentioncontext scaling

0 comments

The pith

A compact neural network can approximate full attention over long cached contexts, replacing the linear scan with a fixed-cost forward pass while keeping generated text semantically equivalent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the attention output for any query over a fixed long context is a deterministic function that can be learned by a small neural network. By training separate networks per layer and attention head to predict both the attention output and its normalizer, the method avoids scanning the entire cache for each new token. Experiments across models up to 8 billion parameters on long-context datasets demonstrate that the approximation error closely follows the drop in next-token prediction accuracy compared to full attention. Non-uniform allocation of network capacity across layers further reduces this gap. Generations from models using this approximation produce text whose meaning matches that from full-cache attention.

Core claim

Nectar fits two compact neural networks per transformer layer and key-value head: one that regresses the attention output directly and another that predicts the log-normalizer of the softmax. At inference these replace the standard attention computation over the cached keys and values, incurring a cost independent of context length. The approach is trained on queries sampled from a task-relevant distribution and evaluated on actual generation trajectories.

What carries the argument

Nectar module: a target network predicting attention outputs paired with a score network predicting the log-normalizer, inserted into masked self-attention to bypass cache scanning.

If this is right

The approximation error of the networks tracks the next-token accuracy gap to full attention.
Allocating capacity non-uniformly across layers reduces the accuracy gap more than uniform allocation.
Text generations from Nectar-equipped models match the semantic content of full-cache generations on question prompts.
The parameter count per module stays much smaller than the KV-cache footprint it replaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This substitution could allow context lengths to grow without a matching rise in per-token compute.
The regression approach might extend to approximating other expensive deterministic operations inside large models.
Adaptive per-layer sizing of the networks could further tighten the accuracy-compute trade-off beyond the tested ablations.

Load-bearing premise

A compact neural network trained on task-relevant queries can closely approximate the exact attention function and will continue to do so for the distribution of queries encountered during actual text generation.

What would settle it

A measurable divergence in the semantic content of text generated by the same model when using Nectar versus full attention on the same long-context question prompts would falsify the claim.

read the original abstract

Evaluating softmax attention over a fixed long context requires reading every cached key-value pair for each new query token. For a given context (a book, a manual, a legal corpus) the attention output is a deterministic function of the query. We propose Nectar, which fits a compact neural network to this function for queries drawn from a task-relevant distribution. Nectar fits two networks per layer and KV-head: a target network that predicts the attention output and a score network that predicts the log-normalizer. The pair plugs into the standard masked self-attention at inference time, replacing the $O(n)$ attention over the cache with a forward pass whose cost does not depend on $n$. Each module carries on the order of $|\theta|$ parameters per layer and KV-head, typically much smaller than the $2nd$ KV-cache footprint at the same granularity. We report experiments on models from 1.7B to 8B parameters across five long-context datasets. The approximation error tracks the next-token accuracy gap to full attention, and allocating capacity non-uniformly across layers reduces that gap in our ablation. Beyond this analysis of metrics, we check that the text generations (following a question prompt) of a model equipped with a Nectar module match in semantic content those obtained by giving the same model access to the full cache.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Nectar trains small per-layer networks to regress cached attention outputs, which is a clean idea, but the evidence that it generalizes from task queries to actual generation queries is still thin.

read the letter

The core move here is straightforward: for a fixed long context, attention is a deterministic map from query to output vector plus normalizer, so why not fit a compact network to it instead of scanning the whole cache every step. They do this with two separate networks per layer and KV head—one for the weighted value sum and one for the log-sum-exp—and plug the pair back into the usual attention formula at inference. That replaces the linear scan with a fixed-cost forward pass whose size is independent of context length. The training uses queries sampled from a task-relevant distribution, which is a reasonable choice on paper. Experiments on 1.7B–8B models across five long-context datasets show that the approximation error lines up with the drop in next-token accuracy, that non-uniform capacity allocation across layers helps, and that generated text stays semantically close to the full-cache version. Those are the concrete positives. The soft spot is the distribution question the stress-test note flags. The paper claims the fitted networks work at inference, but the abstract gives no direct check that the hidden-state statistics during autoregressive generation match the training query distribution, nor does it report quantitative error magnitudes, variance across runs, or head-to-head numbers against other cache approximations. Without those, it is hard to know how large the practical win actually is or how often the semantic match would break. This is the kind of work that belongs in a reading group for people building long-context inference systems. It deserves a serious referee because the target problem is real and the regression framing is new enough to be worth testing, even if the current evidence is preliminary and the generalization claim needs tighter validation.

Referee Report

3 major / 2 minor

Summary. The paper proposes Nectar, which fits compact per-layer, per-KV-head neural networks (a target network for the attention output vector and a score network for the log-normalizer) via supervised regression to the deterministic softmax attention function, using queries sampled from a task-relevant distribution. At inference this replaces the O(n) cache attention with a fixed-cost forward pass whose size is independent of context length n. Experiments on 1.7B–8B models across five long-context datasets are said to show that approximation error tracks the next-token accuracy gap to full attention, that non-uniform capacity allocation across layers reduces the gap, and that generations remain semantically matched to full-cache outputs.

Significance. If the generalization from training queries to inference-time hidden states holds, the method offers a practical route to constant-time attention for long contexts, with the non-uniform allocation ablation providing a useful design insight. The regression-to-deterministic-function framing is clean and the semantic-equivalence check is a relevant practical metric. However, the current evidence base is too thin to establish the result as load-bearing for the field.

major comments (3)

[Abstract] Abstract and experimental results: the central claim that 'approximation error tracks the next-token accuracy gap' is stated without any quantitative tables, correlation coefficients, error bars, or baseline comparisons, so the strength of the tracking cannot be evaluated.
[Experiments] Experiments section: no explicit validation is provided that the distribution of queries used for fitting matches the statistics of hidden states arising during autoregressive generation on the target tasks; if a shift exists, the reported error tracking and semantic equivalence may not hold at inference.
[Ablation] Ablation on capacity allocation: the claim that non-uniform allocation reduces the accuracy gap is presented without the actual per-layer error or accuracy numbers, making it impossible to judge whether the improvement is material or merely consistent with the overall limited support.

minor comments (2)

[Abstract] The abstract should include at least one concrete table or figure reference showing the reported error–accuracy correlation.
[Method] Training details (optimizer, learning rate, number of samples per head, exact architecture of the target/score networks) are missing and should be supplied for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript to strengthen the quantitative evidence and add necessary validations.

read point-by-point responses

Referee: [Abstract] Abstract and experimental results: the central claim that 'approximation error tracks the next-token accuracy gap' is stated without any quantitative tables, correlation coefficients, error bars, or baseline comparisons, so the strength of the tracking cannot be evaluated.

Authors: We agree that the abstract claim requires quantitative support to be fully evaluable. In the revised manuscript we will add a dedicated table in the Experiments section reporting per-model and per-dataset approximation error (MSE and cosine similarity to full attention) alongside the next-token accuracy gap. We will include Pearson correlation coefficients between these quantities, error bars computed over multiple random seeds, and a uniform-capacity baseline for comparison. This will allow direct assessment of how closely the tracking holds. revision: yes
Referee: [Experiments] Experiments section: no explicit validation is provided that the distribution of queries used for fitting matches the statistics of hidden states arising during autoregressive generation on the target tasks; if a shift exists, the reported error tracking and semantic equivalence may not hold at inference.

Authors: This is a fair and important point. Our query sampling procedure drew from hidden-state activations collected while running the base model on task-relevant long-context sequences (books, manuals, legal text) during data preparation. To address the concern explicitly, the revision will include a new subsection that compares first- and second-order statistics (means, variances, and selected quantiles) of the regression training queries against queries extracted from full autoregressive rollouts on the five evaluation datasets. Any observed shifts will be quantified and discussed with respect to their potential impact on the reported metrics. revision: yes
Referee: [Ablation] Ablation on capacity allocation: the claim that non-uniform allocation reduces the accuracy gap is presented without the actual per-layer error or accuracy numbers, making it impossible to judge whether the improvement is material or merely consistent with the overall limited support.

Authors: We acknowledge that the ablation description is insufficient without the underlying numbers. The revised manuscript will expand the ablation subsection with a table (or supplementary figure) that lists per-layer approximation error and next-token accuracy gap for both the uniform and non-uniform capacity allocations. This will make clear the magnitude of the improvement and identify which layers benefit most from the non-uniform budget. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit supervised regression to computed attention outputs with empirical validation of generalization

full rationale

The paper trains compact target and score networks via supervised regression directly on attention outputs and log-normalizers computed from the full KV cache for queries sampled from a task-relevant distribution. The reported claims—that approximation error correlates with next-token accuracy gaps, that non-uniform capacity allocation reduces the gap, and that generated text matches full-cache semantics—are presented as empirical observations measured on held-out data and generation tasks rather than as algebraic identities or tautologies following from the fitting equations themselves. No load-bearing premise reduces by the paper's own definitions or self-citations to a renamed input; the method is self-contained against the external benchmark of exact attention computation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the determinism of attention for fixed KV cache and standard assumptions of neural network regression; no new physical entities or ad-hoc constants beyond the learned network weights are introduced.

free parameters (1)

network parameters |θ|
Weights of the target and score networks are fitted to match true attention outputs on sampled queries.

axioms (1)

domain assumption Attention output is a deterministic function of the query given a fixed KV cache.
Explicitly stated as the basis for fitting a network to the function.

pith-pipeline@v0.9.0 · 5542 in / 1293 out tokens · 32336 ms · 2026-05-12T02:35:31.779639+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Nectar fits two networks per layer and KV-head: a target network that predicts the attention output and a score network that predicts the log-normalizer... replacing the O(n) attention over the cache with a forward pass whose cost does not depend on n.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
We report experiments on models from 1.7B to 8B parameters across five long-context datasets. The approximation error tracks the next-token accuracy gap...

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 7 internal anchors

[1]

Perceiver

Jaegle, Andrew and Borgeaud, Sebastian and Alayrac, Jean-Baptiste and Doersch, Carl and Ionescu, Catalin and Ding, David and Koppula, Skanda and Zoran, Daniel and Brock, Andrew and Shelhamer, Evan and Henaff, Olivier and Botvinick, Matthew and Zisserman, Andrew and Vinyals, Oriol and Carreira, Joao , booktitle=. Perceiver

work page
[2]

arXiv:2603.08001 , year=

Amortizing Maximum Inner Product Search with Learned Support Functions , author=. arXiv:2603.08001 , year=

work page arXiv
[3]

The Fourteenth International Conference on Learning Representations , year=

Cartridges: Lightweight and general-purpose long context representations via self-study , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[4]

Approximate Calculation of Multiple Integrals , author=

work page
[5]

Fast Transformer Decoding: One Write-Head is All You Need

Fast transformer decoding: One write-head is all you need , author=. arXiv preprint arXiv:1911.02150 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1911
[6]

Proceedings of Machine Learning and Systems , volume=

Efficiently scaling transformer inference , author=. Proceedings of Machine Learning and Systems , volume=

work page
[7]

Neurocomputing , volume=

RoFormer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=

work page
[8]

International Conference on Machine Learning , pages=

Input convex neural networks , author=. International Conference on Machine Learning , pages=

work page
[9]

Advances in Neural Information Processing Systems , volume=

FlashAttention: Fast and memory-efficient exact attention with IO-awareness , author=. Advances in Neural Information Processing Systems , volume=

work page
[10]

Xiong, Yunyang and Zeng, Zhanpeng and Chakraborty, Rudrasis and Tan, Mingxing and Fung, Glenn and Li, Yin and Singh, Vikas , journal=. Nystr

work page
[11]

International Conference on Learning Representations , year=

Rethinking attention with performers , author=. International Conference on Learning Representations , year=

work page
[12]

Linformer: Self-Attention with Linear Complexity

Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006
[13]

International Conference on Learning Representations , year=

Reformer: The efficient transformer , author=. International Conference on Learning Representations , year=

work page
[14]

Advances in Neural Information Processing Systems , volume=

Retrieval-augmented generation for knowledge-intensive NLP tasks , author=. Advances in Neural Information Processing Systems , volume=

work page
[15]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[17]

arXiv preprint arXiv:2602.16284 , year=

Fast KV Compaction via Attention Matching , author=. arXiv preprint arXiv:2602.16284 , year=

work page arXiv
[18]

arXiv preprint arXiv:2503.08727 , year=

Training plug-n-play knowledge modules with deep context distillation , author=. arXiv preprint arXiv:2503.08727 , year=

work page arXiv
[19]

Text-to-LoRA: Instant transformer adaption.arXiv preprint arXiv:2506.06105, 2025

Text-to-lora: Instant transformer adaption , author=. arXiv preprint arXiv:2506.06105 , year=

work page arXiv
[20]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Learning to Compress Prompts with Gist Tokens , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[21]

The 2023 Conference on Empirical Methods in Natural Language Processing , year=

Adapting Language Models to Compress Contexts , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

work page 2023
[22]

Scott , title=

Fitzgerald, F. Scott , title=. 1925 , publisher=

work page 1925
[23]

Wells, H. G. , title=. 1895 , publisher=

work page
[24]

1899 , publisher=

Conrad, Joseph , title=. 1899 , publisher=

work page
[25]

Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

Thakur, Nandan and Reimers, Nils and R. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[26]

Wang, Wenhui and Wei, Furu and Dong, Li and Bao, Hangbo and Yang, Nan and Zhou, Ming , booktitle=

work page
[27]

arXiv preprint arXiv:2401.14490 , year=

LongHealth: A Question Answering Benchmark with Long Clinical Documents , author=. arXiv preprint arXiv:2401.14490 , year=

work page arXiv
[28]

, booktitle=

Pang, Richard Yuanzhe and Parrish, Alicia and Joshi, Nitish and Nangia, Nikita and Phang, Jason and Chen, Angelica and Padmakumar, Vishakh and Ma, Johnny and Thompson, Jana and He, He and Bowman, Samuel R. , booktitle=

work page
[29]

Advances in Neural Information Processing Systems , volume=

Language Models are Few-Shot Learners , author=. Advances in Neural Information Processing Systems , volume=

work page
[30]

LLaMA: Open and Efficient Foundation Language Models

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=

work page
[32]

Longformer: The Long-Document Transformer

Longformer: The Long-Document Transformer , author=. arXiv preprint arXiv:2004.05150 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2004
[33]

NeurIPS , year=

Zhang, Zhenyu and Sheng, Ying and Zhou, Tianyi and Chen, Tianlong and Zheng, Lianmin and Cai, Ruisi and Song, Zhao and Tian, Yuandong and R. NeurIPS , year=

work page
[34]

Li, Yuhong and Huang, Yingbing and Yang, Bowen and Vber, Bharat and Hashimoto, Tatsunori and Iyer, Shankar and Chen, Jianfeng , journal=

work page
[35]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author=. arXiv preprint arXiv:2312.00752 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Transformers are

Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. ICML , year=

work page
[37]

Peng, Bowen and Quesnelle, Jeffrey and Fan, Honglu and Shao, Enrico , journal=

work page
[38]

Fine-tuning or Retrieval? Comparing Knowledge Injection in

Ovadia, Oded and Brief, Menachem and Mishaeli, Moshik and Elisha, Oren , journal=. Fine-tuning or Retrieval? Comparing Knowledge Injection in

work page
[39]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Qwen2.5-1m technical report.ArXiv, abs/2501.15383, 2025

Qwen2.5-1M Technical Report , author=. arXiv preprint arXiv:2501.15383 , year=

work page arXiv
[41]

2024 , publisher=

Introduction to Intellectual Property , author=. 2024 , publisher=

work page 2024