arxiv: 2602.18196 · v3 · submitted 2026-02-20 · 💻 cs.LG

Recognition: no theorem link

RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

Xiuying Wei , Caglar Gulcehre

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:56 UTC · model grok-4.3

classification 💻 cs.LG

keywords dilated attentionrecurrent attentionsparse inferenceefficient transformersmodel adaptationkv cache reductionlong context

0 comments

The pith

RAT+ lets one densely pretrained model switch to dilated sparse attention at inference with only short adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RAT+, an attention architecture that adds full-sequence recurrence and active recurrence learning during standard dense pretraining. A single such model can then be switched at inference to dilated attention patterns, optionally combined with local windows or mixed layer and head configurations. The switch requires only a brief 1-billion-token adaptation phase rather than training separate sparse models from scratch. At scales from 1.5B to 7.6B parameters, the approach keeps accuracy close to the dense baseline at dilation 16 and limits the drop to roughly 1-3 points at dilation 64 on commonsense and long-context benchmarks. This yields large reductions in attention FLOPs and KV cache size while preserving long-range connectivity.

Core claim

A single RAT+ model is pretrained densely once and can then be flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at D=16, and drops by about 2--3 points at D=64 on commonsense reasoning and LongBench tasks. We further scale to 2.6B and 7.6B parameters and observe even more promising performance (e.g., a 1-point average accuracy loss with a 64x reduction in attention FLOPs and KV cache size).

What carries the argument

Recurrence Augmented Attention (RAT+), which augments standard attention with full-sequence recurrence and active recurrence learning during dense pretraining so that the learned representations transfer to dilated sparse patterns after brief adaptation.

If this is right

Attention FLOPs and KV cache size shrink by the dilation factor D at inference
Accuracy stays within 1 point of dense at D=16 across tested scales
At D=64 the accuracy drop stays around 2-3 points on reasoning and long-context tasks for the 1.5B model
Larger models (up to 7.6B) show even smaller relative drops at high dilation
Hybrid layer- and head-level dilation patterns are supported without separate pretraining

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Runtime systems could select dilation on the fly according to current compute budget without reloading weights
The same recurrence pretraining might ease transfer to other sparse patterns such as local windows or block sparsity
Training for adaptability rather than a fixed sparsity pattern could reduce the total compute spent on model families
If adaptation length can be shortened further, zero-shot switching between dense and sparse modes may become feasible

Load-bearing premise

Representations built from dense full-sequence recurrence remain effective when attention is later sparsified by dilation after only short adaptation.

What would settle it

Measuring whether a 7.6B RAT+ model loses more than 4 accuracy points on LongBench after switching to D=64 dilation with only the stated 1B-token adaptation.

Figures

Figures reproduced from arXiv: 2602.18196 by Caglar Gulcehre, Xiuying Wei.

**Figure 1.** Figure 1: (a) For architectural simplicity, we adopt an extreme overlapped setting, i.e., full-sequence recurrence with L = T. (b) Joint training to preserve dense attention capability while enforcing active recurrence learning with desired effective length L ∗ = 64. (c) After pretraining, the resulting model can be efficiently adapted to various sparse inference patterns including effective results on dilated atten… view at source ↗

**Figure 2.** Figure 2: Efficiency results of the temporal-mixing operator on a single GH200 GPU, covering both prefilling and decoding scenarios with hidden dimension H. Prefilling latency is measured on sequences of 262K tokens. Decoding latency is measured for 256 or 128 batches of tokens for the two hidden dimensions, respectively; the baseline runs out of memory beyond 32K tokens. We use FlexAttention (Dong et al., 2024) f… view at source ↗

**Figure 4.** Figure 4: Maximum decoding throughput (tokens/sec) of the full 1.5B and 7B models for decoding 1024 tokens, measured at context lengths of 4096 and 16384, corresponding to prefilling lengths of 3072 and 15360 tokens, respectively [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison with GQA/MQA using different numbers of KV heads. Joint training is also applied to GQA/MQA (D † = 1, W = 64) to match training FLOPs. RAT+ achieves lower PPL and offers greater flexibility, including single pretraining and the ability to preserve local KV cache size. Comparison with GQA and MQA We first compare with widely-used pretraining architectures, grouped-query attention (GQA) and multi-… view at source ↗

**Figure 6.** Figure 6: Scaling-up experiments: we report validation loss on a held-out 0.5B-token subset to illustrate the even smaller loss gap between dense and sparse variants as model size increases. The starred points refer to attention models trained with D † = 1 and W = 64, matched in training FLOPs at the same model scale, and are included as a reference for comparison with dense RAT+ [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 8.** Figure 8: 1B-token adaptation on two pretrained models. It is evident that various dilated patterns quickly achieve stable loss values within a few hundred million tokens. We employed a simple optimization scheme with no warmup, which may explain the slight loss increase of D = 1 at the beginning, after which it recovers. We also ablate other active recurrence lengths, as shown in [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 9.** Figure 9: L2 norm values of recurrence outputs at different time steps. We observe that the outputs at early time steps differ significantly. The first row shows an initialized network using our simple recurrence at layers 0, 6, 18, and 23. The second row corresponds to the same initialized network but with a non-zero initial cell state provided to the recurrence. The third row shows the results of the pretrained ne… view at source ↗

read the original abstract

Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. While prior work studies it by training each configuration from scratch, directly sparsifying a pretrained attention model into a dilated pattern leads to severe accuracy degradation, preventing flexible reuse across inference scenarios. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once and can then be flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at D=16, and drops by about 2--3 points at D=64 on commonsense reasoning and LongBench tasks. We further scale to 2.6B and 7.6B parameters and observe even more promising performance (e.g., a 1-point average accuracy loss with a 64x reduction in attention FLOPs and KV cache size). Code is available at https://github.com/wimh966/rat-plus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAT+ shows recurrence-augmented dense pretraining lets one model adapt quickly to multiple dilated inference settings with modest accuracy loss at scale.

read the letter

The main point is that RAT+ pretrains a model densely with added recurrence so it can switch to dilated attention at inference after only a short 1B-token adaptation, avoiding separate training runs for each dilation factor. This addresses a practical issue where direct sparsification of a pretrained model hurts accuracy badly. The results look decent: at 1.5B parameters trained on 100B tokens, it stays close to dense performance at D=16 and loses 2-3 points at D=64 on commonsense reasoning and LongBench. Scaling to 2.6B and 7.6B models shows even smaller relative drops, such as about 1 point at 64x reduction in attention FLOPs and KV cache size. The code release helps with checking the details. What the paper does well is running the experiments across three model scales and showing the transfer works for hybrid layer and head setups too. The recurrence addition during pretraining appears to create representations that adapt more readily than standard dense models. A soft spot is the baseline comparison. The abstract says direct sparsification causes severe degradation, but it is unclear whether that baseline also got the 1B-token adaptation step or was tested zero-shot. If the latter, adaptation alone could explain much of the gap, and the specific contribution of recurrence needs tighter ablations to stand out. Minor details on exactly how the recurrence is learned and applied would also help. This is aimed at researchers working on efficient long-context inference for large models who want flexibility without retraining everything. A reader focused on practical deployment tradeoffs would find the numbers and scaling results useful. It deserves peer review because the core idea is straightforward to test, the experiments reach relevant sizes, and the practical angle is clear even if the comparisons need sharpening.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single model is pretrained densely once on 100B tokens and can then be switched at inference to dilated attention patterns (with optional local windows or hybrid layer/head compositions) after only a short 1B-token resolution adaptation, rather than retraining separate sparse models. At 1.5B parameters the method closely matches dense accuracy at D=16 and drops by 2-3 points at D=64 on commonsense reasoning and LongBench; scaling results are also reported for 2.6B and 7.6B models with a claimed 1-point average loss at 64x reduction in attention FLOPs and KV cache.

Significance. If the central empirical claim holds under controlled comparisons, the work provides a practical route to inference-time flexibility in attention sparsity without the cost of training multiple dilated models from scratch. The scaling results to 7.6B parameters and the concrete accuracy numbers at multiple dilation factors indicate potential utility for efficient long-context deployment. The public code release supports reproducibility.

major comments (2)

[Abstract] Abstract: the claim that 'directly sparsifying a pretrained attention model into a dilated pattern leads to severe accuracy degradation' is contrasted with RAT+'s short adaptation, but the abstract does not state whether the direct-sparsification baseline also received the same 1B-token resolution adaptation. This comparison is load-bearing for the central claim that recurrence-augmented pretraining (rather than adaptation itself) enables limited loss and flexible reuse.
[Experiments] Experiments section: the reported accuracy numbers at 1.5B parameters (close match at D=16, 2-3 point drop at D=64) and the scaling claims at 2.6B/7.6B are presented without detailed ablations isolating the recurrence component or confirming that all baselines received identical adaptation. This leaves the weakest assumption—that recurrence creates representations that transfer to dilated patterns with only short adaptation—only moderately supported.

minor comments (2)

The description of 'active recurrence learning' during pretraining would benefit from an explicit equation or pseudocode block showing how the recurrence loss is combined with the language-modeling objective.
Tables reporting accuracy across dilation factors should explicitly label the adaptation procedure (or lack thereof) for every compared method to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments. We address each major comment below and have made revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'directly sparsifying a pretrained attention model into a dilated pattern leads to severe accuracy degradation' is contrasted with RAT+'s short adaptation, but the abstract does not state whether the direct-sparsification baseline also received the same 1B-token resolution adaptation. This comparison is load-bearing for the central claim that recurrence-augmented pretraining (rather than adaptation itself) enables limited loss and flexible reuse.

Authors: We agree this clarification is necessary to support the central claim. In the experiments, the direct-sparsification baseline (a standard dense pretrained model without recurrence) received the identical 1B-token resolution adaptation before dilated evaluation. Severe degradation persists even after adaptation, while RAT+ shows limited loss. We will revise the abstract to explicitly note that both conditions use the same adaptation protocol. revision: yes
Referee: [Experiments] Experiments section: the reported accuracy numbers at 1.5B parameters (close match at D=16, 2-3 point drop at D=64) and the scaling claims at 2.6B/7.6B are presented without detailed ablations isolating the recurrence component or confirming that all baselines received identical adaptation. This leaves the weakest assumption—that recurrence creates representations that transfer to dilated patterns with only short adaptation—only moderately supported.

Authors: We acknowledge the value of stronger isolation of the recurrence component. We will add a dedicated ablation subsection comparing RAT+ against a non-recurrent dense model under identical adaptation and dilation settings. We will also add explicit statements confirming that all reported baselines (including direct sparsification) used the same 1B-token adaptation, better supporting the transfer claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on benchmarks with no self-referential derivations

full rationale

The paper presents RAT+ as an architectural modification (recurrence-augmented attention) pretrained densely, followed by short adaptation for sparse inference. All reported outcomes are direct empirical measurements of accuracy on commonsense reasoning and LongBench tasks at various dilation factors. No equations, uniqueness theorems, or fitted parameters are shown that reduce the claimed accuracy retention to quantities defined by the inputs themselves. The 1B-token adaptation is an explicit training step whose effect is measured rather than assumed tautological. No self-citation chains are invoked to justify core premises. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The central claim rests on the empirical effectiveness of the added recurrence components; no new physical entities or unstated mathematical axioms are introduced beyond standard transformer assumptions.

free parameters (1)

dilation factor D
Chosen design parameter controlling sparsity level at inference; not fitted to data but selected for experiments.

invented entities (1)

RAT+ recurrence augmentation no independent evidence
purpose: To enable transfer from dense pretraining to sparse dilated inference
New architectural component introduced by the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5535 in / 1221 out tokens · 46407 ms · 2026-05-15T20:56:31.911959+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 15 internal anchors

[1]

Gqa: Training generalized multi-query transformer models from multi-head check- points

Ainslie, J., Lee-Thorp, J., De Jong, M., Zemlyanskiy, Y ., Lebr´on, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head check- points. InProceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing, pp. 4895–4901,

work page 2023
[2]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Bai, Y ., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., et al. Longbench: A bilingual, multitask benchmark for long context under- standing.arXiv preprint arXiv:2308.14508,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[4]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Command a: An enterprise-ready large language model.arXiv preprint arXiv:2504.00698,

Cohere, T., Ahmadian, A., Ahmed, M., Alammar, J., Al- izadeh, M., Alnumay, Y ., Althammer, S., Arkhangorod- sky, A., Aryabumi, V ., Aumiller, D., et al. Command a: An enterprise-ready large language model.arXiv preprint arXiv:2504.00698,

work page arXiv
[6]

Recurrent Batch Normalization

Cooijmans, T., Ballas, N., Laurent, C., G ¨ulc ¸ehre, C ¸., and Courville, A. Recurrent batch normalization.arXiv preprint arXiv:1603.09025,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Dao, T. and Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

De, S., Smith, S. L., Fernando, A., Botev, A., Cristian- Muraru, G., Gu, A., Haroun, R., Berrada, L., Chen, Y ., Srinivasan, S., et al. Griffin: Mixing gated linear recur- rences with local attention for efficient language models. arXiv preprint arXiv:2402.19427,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Longnet: Scaling transformers to 1,000,000,000 tokens.arXiv preprint arXiv:2307.02486,

Ding, J., Ma, S., Dong, L., Zhang, X., Huang, S., Wang, W., Zheng, N., and Wei, F. Longnet: Scaling transformers to 1,000,000,000 tokens.arXiv preprint arXiv:2307.02486,

work page arXiv
[10]

Flex attention: A programming model for gen- erating optimized attention kernels.arXiv preprint arXiv:2412.05496,

Dong, J., Feng, B., Guessous, D., Liang, Y ., and He, H. Flex attention: A programming model for gen- erating optimized attention kernels.arXiv preprint arXiv:2412.05496,

work page arXiv
[11]

Feng, Y ., Lv, J., Cao, Y ., Xie, X., and Zhou, S. K. Ada- kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550,

work page arXiv
[12]

Moa: Mixture of sparse attention for automatic large language model compression.arXiv preprint arXiv:2406.14909,

Fu, T., Huang, H., Ning, X., Zhang, G., Chen, B., Wu, T., Wang, H., Huang, Z., Li, S., Yan, S., et al. Moa: Mixture of sparse attention for automatic large language model compression.arXiv preprint arXiv:2406.14909,

work page arXiv
[13]

URL https://zenodo.org/records/12608602. Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page arXiv
[14]

Lm-infinite: Zero-shot extreme length generalization for large language models

Han, C., Wang, Q., Peng, H., Xiong, W., Chen, Y ., Ji, H., and Wang, S. Lm-infinite: Zero-shot extreme length generalization for large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 3991–4008,

work page 2024
[15]

and Shi, H

Hassani, A. and Shi, H. Dilated neighborhood attention transformer.arXiv preprint arXiv:2209.15001,

work page arXiv
[16]

Trans- former language models without positional encodings still learn positional information

Haviv, A., Ram, O., Press, O., Izsak, P., and Levy, O. Trans- former language models without positional encodings still learn positional information. InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2022, pp. 1382–1390,

work page 2022
[17]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y ., and Ginsburg, B. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F

Accessed: 2026-01-26. Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on ma- chine learning, pp. 5156–5165. PMLR,

work page 2026
[19]

Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

Lai, X., Lu, J., Luo, Y ., Ma, Y ., and Zhou, X. Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

work page arXiv
[20]

Twilight: Adaptive attention sparsity with hierarchical top-p pruning.arXiv preprint arXiv:2502.02770,

Lin, C., Tang, J., Yang, S., Wang, H., Tang, T., Tian, B., Stoica, I., Han, S., and Gao, M. Twilight: Adaptive attention sparsity with hierarchical top-p pruning.arXiv preprint arXiv:2502.02770,

work page arXiv
[21]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189,

Lu, E., Jiang, Z., Liu, J., Du, Y ., Jiang, T., Hong, C., Liu, S., He, W., Yuan, E., Wang, Y ., et al. Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189,

work page arXiv
[23]

Online normalizer calculation for softmax

Milakov, M. and Gimelshein, N. Online normalizer cal- culation for softmax.arXiv preprint arXiv:1805.02867,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Transformers are multi-state rnns.arXiv preprint arXiv:2401.06104,

Oren, M., Hassid, M., Yarden, N., Adi, Y ., and Schwartz, R. Transformers are multi-state rnns.arXiv preprint arXiv:2401.06104,

work page arXiv
[25]

RWKV: Reinventing RNNs for the Transformer Era

Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Grella, M., et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023a. Peng, B., Quesnelle, J., Fan, H., and Shippole, E. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:230...

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Fast Transformer Decoding: One Write-Head is All You Need

Shazeer, N. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150,

work page internal anchor Pith review Pith/arXiv arXiv 1911
[27]

Retentive Network: A Successor to Transformer for Large Language Models

Sun, Y ., Dong, L., Huang, S., Ma, S., Xia, Y ., Xue, J., Wang, J., and Wei, F. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Tang, J., Zhao, Y ., Zhu, K., Xiao, G., Kasikci, B., and Han, S. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Rat: Bridging rnn efficiency and attention accuracy via chunk-based sequence modeling.arXiv preprint arXiv:2507.04416,

Wei, X., Yadav, A., Pascanu, R., and Gulcehre, C. Rat: Bridging rnn efficiency and attention accuracy via chunk-based sequence modeling.arXiv preprint arXiv:2507.04416,

work page arXiv
[30]

Re- trieval head mechanistically explains long-context factu- ality.arXiv preprint arXiv:2404.15574,

Wu, W., Wang, Y ., Xiao, G., Peng, H., and Fu, Y . Re- trieval head mechanistically explains long-context factu- ality.arXiv preprint arXiv:2404.15574,

work page arXiv
[31]

Efficient Streaming Language Models with Attention Sinks

Xiao, C., Zhang, P., Han, X., Xiao, G., Lin, Y ., Zhang, Z., Liu, Z., and Sun, M. Infllm: Training-free long-context extrapolation for llms with an efficient context memory. Advances in Neural Information Processing Systems, 37: 119638–119661, 2024a. Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention...

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819, 2024b

Xiao, G., Tang, J., Zuo, J., Guo, J., Yang, S., Tang, H., Fu, Y ., and Han, S. Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819, 2024b. Yang, B., Venkitesh, B., Talupuru, D., Lin, H., Cairuz, D., Blunsom, P., and Locatelli, A. Rope to nope and back again: A new hybrid attention strategy....

work page arXiv
[33]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Yang, S., Kautz, J., and Hatamizadeh, A. Gated delta net- works: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Appendix A.1

12 RAT+: Train Dense, Infer Sparse - Recurrence Augmented Attention for Dilated Inference A. Appendix A.1. Implementation details Implementation for Table 1.We largely follow the RAT implementation, using the same model architecture and training dataset but also some modifications. RAT shares linear projections for attention queries and keys across heads ...

work page 2048
[35]

The RoPE based is set to 10,000

instead of the inter-chunk RoPE used in RAT, for easier management of positional encoding, and put it after the recurrence function. The RoPE based is set to 10,000. Model parameters are initialized from a Gaussian distribution with a standard deviation of 0.02. We adopt the LLaMA2 tokenizer in all experiments. The optimization hyperparameters follow the ...

work page 2024
[36]

For the 200B-token training setting, the peak learning rate remains 7.0×10 −4 with a global batch size of 768, following the same rule in Bi et al. (2024). For the 7.6B-parameter model trained on 100B tokens, we adopt a model dimension of 4096, 32 Transformer layers, and a head dimension of

work page 2024
[37]

The third row shows the results of the pretrained network

The second row corresponds to the same initialized network but with a non-zero initial cell state provided to the recurrence. The third row shows the results of the pretrained network. We visualize the outputs of our simple recurrence here; a similar phenomenon has also been observed in standard recurrent networks, as reported in Cooijmans et al. (2016). ...

work page 2016
[38]

The second and third rows report additional design-choice ablations

The first row shows strong performance across dilation settings. The second and third rows report additional design-choice ablations. As can be seen, removing recurrence over the attention keys leads to a slight increase in perplexity. We think this might because the gated representations on the attention values can still be propagated across layers. In c...

work page 2023
[39]

All models perform poorly on MusiQue and exhibit very similar performance on LCC

16 RAT+: Train Dense, Infer Sparse - Recurrence Augmented Attention for Dilated Inference Table 15.Supplementary results on additional LongBench tasks of 1.5B models.We omit these results from the main text as they are less representative. All models perform poorly on MusiQue and exhibit very similar performance on LCC. For the remaining summarization tas...

work page 2024
[40]

We found that this style performs worse than the Quest in the training-free manner, but RAT+ still outperforms the baseline a lot

style top-k block attention, where the critical block is determined by the mean pooling of attention keys within each block. We found that this style performs worse than the Quest in the training-free manner, but RAT+ still outperforms the baseline a lot. MoBA top-k NIAH-S-1 NIAH-S-1 NIAH-S-3 NIAH-MK-1 NIAH-MK-2 NIAH-MK-3 NIAH-MQ NIAH-MV attention D† =1, ...

work page 2024
[41]

D† =1 means the attention operator or block without the recurrence

The latency (ms) is tested on 262K sequences of tokens. D† =1 means the attention operator or block without the recurrence. Latency (H=2048) 4K 8K 16K 32K 65K 131K 262K Temporal-mixing operator D† =116.3 28.5 51.8 101.1 197.9 422.3 897.8 D=132.0 43.9 67.8 118.2 218.4 445.3 942.4 D=230.9 37.9 51.9 83.3 144.2 280.1 587.5 D=426.9 30.6 37.7 54.2 87.6 165.4 33...

work page 2048
[42]

The latency (ms) is tested on generating batches of tokens withB= 128,B= 256,B= 512at specified positions. Latency (H=2048) 4K 8K 16K 32K 65K 131K 262K Temporal-mixing operator B= 128 D† =11.19 2.33 4.62 9.18 18.34 OOM OOM D=11.24 2.38 4.67 9.24 18.38 OOM OOM D=2(vs.D† =1) 0.69 (1.7×) 1.26 (1.8×) 2.37 (1.9×) 4.66 (2.1×) 9.22 (2.0×) 18.35 OOM D=40.49 0.69 ...

work page 2048
[43]

We observe even better speed-up on both operator and block levels compared to H=2048

The latency (ms) is tested on 262K tokens. We observe even better speed-up on both operator and block levels compared to H=2048. Latency (H=4096) 4K 8K 16K 32K 65K 131K 262K Temporal-mixing operator D† =133.23 56.90 106.03 203.94 397.90 839.06 1789.60 D=164.36 88.78 138.34 237.04 429.17 875.68 1839.85 D=261.77 76.53 104.56 165.74 281.48 530.33 1100.72 D=4...

work page 2048