arxiv: 2605.10414 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Remember to Forget: Gated Adaptive Positional Encoding

Riccardo Ali , Alessio Borgi , Christopher Irwin , Mario Severino , Pietro Li\`o

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:40 UTC · model grok-4.3

classification 💻 cs.LG

keywords positional encodingrotary embeddingsattention mechanismslong-context modelingtransformer architecturesgated networkssequence modelingcontext extension

0 comments

The pith

GAPE adds content-aware gates to rotary encodings so important distant tokens stay accessible while irrelevant ones lose attention mass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Rotary positional encodings lose reliability once sequences exceed training lengths, producing diffuse attention and weak retrieval. GAPE inserts a content-aware bias into attention logits that leaves the underlying rotary geometry unchanged. A query-dependent gate contracts irrelevant context while a key-dependent gate shields salient distant tokens. The approach fits inside standard scaled dot-product attention and is shown to produce sharper attention maps and stronger long-context results on retrieval and benchmark tasks.

Core claim

GAPE augments positional encodings by adding a content-aware bias directly into the attention logits while preserving the rotary geometry. It decouples distance-based suppression from token importance through a query-dependent gate that contracts irrelevant context and a key-dependent gate that preserves salient distant tokens. Protected tokens remain accessible, while the attention mass assigned to unprotected distant tokens decays as a function of the query gate. GAPE can be implemented within standard scaled dot-product attention.

What carries the argument

Query-dependent and key-dependent gates that apply a content-aware bias to attention logits, preserving rotary geometry while modulating distance effects according to token salience.

If this is right

Salient tokens at arbitrary distances remain reachable without spurious alignments.
Unprotected distant tokens receive attention mass that decays with the query gate value.
The method requires only local changes inside existing scaled dot-product attention layers.
Empirical results indicate consistently sharper attention and higher scores on long-context and synthetic retrieval benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gate mechanism could be combined with other positional schemes to stabilize performance when context lengths grow further.
Dynamic adjustment of the gates at inference time might allow models to focus compute on relevant spans without fixed context windows.
Similar content-dependent modulation might reduce wasted attention on irrelevant tokens in other transformer components.

Load-bearing premise

The content-aware gates can be trained to distinguish important from unimportant tokens stably and without introducing new biases or instabilities.

What would settle it

If long-context retrieval accuracy or attention sharpness on sequences beyond training length shows no improvement over standard rotary baselines, or if protected tokens lose accessibility in the attention computation, the central claims would fail.

Figures

Figures reproduced from arXiv: 2605.10414 by Alessio Borgi, Christopher Irwin, Mario Severino, Pietro Li\`o, Riccardo Ali.

**Figure 1.** Figure 1: Gated Adaptive Positional Encoding. GAPE augments rotary attention with a mask that separates how strongly the context is contracted from which tokens are allowed to survive. For a query at position ti , the query gate gi controls the forgetting rate and therefore the effective positional horizon: larger gi yields sharper suppression of unprotected distant tokens. The key landmark lj identifies tokens that… view at source ↗

**Figure 2.** Figure 2: NIAH retrieval under context extrapolation. Models are trained at 2048 tokens and evaluated at 1×, 2×, and 4× context lengths, with the target needle placed close to the query (top row) or far from it (bottom row). Columns compare NoPE (Left), RoPE (Middle), and p-RoPE (Right) against their GAPE-augmented variants, with ALiBi included as a fixed recency-bias baseline. GAPE improves length extrapolation acr… view at source ↗

**Figure 3.** Figure 3: Mechanistic behavior of the GAPE gates in the NIAH task. Left: Landmark gate lj at layer 2, head 1. The gate produces sparse peaks at the needle positions, marking them as protected and accessible despite surrounding filler and distractor tokens. Right: Average attention entropy over layers and heads in the Needle-Far setting. Lower entropy indicates sharper and more concentrated attention. GAPE generally … view at source ↗

**Figure 4.** Figure 4: OOD perplexity under context extension (mean ± std). We train models at a fixed context length and evaluate them on progressively longer sequences. Across all document-length regimes, perplexity increases once the evaluation context exceeds the training horizon, but GAPE grows more slowly than RoPE and p-RoPE, indicating that the learned structural gate improves length extrapolation. Long Context Retrieval… view at source ↗

**Figure 5.** Figure 5: right, GAPE learns non-uniform head-wise masks: several heads develop strong biases, while others remain weakly biased. Since Mi,j is added directly to the attention logits, this confirms that the mask actively reshapes attention. The full dynamics further show sparse, layer-dependent landmark activations and bounded amplitudes, suggesting that contraction arises from learned query– key interaction rather … view at source ↗

**Figure 6.** Figure 6: Visualisation of the magnitude-envelope bound. (a) The lower bound on the logit gap ai,i − ai,j grows linearly with the distance ∆ = i − j. At short distances, bounded phase fluctuations can still produce spurious alignments; as ∆ increases, the structural term dominates and unprotected tokens are progressively suppressed. (b) The minimum gate activation g (k) min required to suppress an unprotected token … view at source ↗

**Figure 7.** Figure 7: Visualisation of the GAPE NIAH retrieval threshold. (a) For an unprotected token at relative distance ∆ = i − k, the retrievable region (where semantic similarity can still overcome the penalty) shrinks as ∆ grows. Beyond the elimination horizon ∆elim, the unprotected token is permanently suppressed regardless of its semantic score, and retrieval requires the landmark gate to be activated. (b) Once the seq… view at source ↗

**Figure 8.** Figure 8: Query-gate behavior in the NIAH setting. Average query-gate value gi at the final query token for the two attention heads, reported separately for the NEEDLE-FAR and NEEDLE-CLOSE regimes. When the target needle is close to the query, both heads learn substantially larger gi values, corresponding to stronger contraction of unprotected context. When the target is far, gi remains near zero, indicating that th… view at source ↗

**Figure 9.** Figure 9: Comparison with FoX [12] in the synthetic NIAH task. We report retrieval accuracy under context extrapolation for NEEDLE-CLOSE (left) and NEEDLE-FAR (right). NoPE+GAPE remains near-perfect across context lengths, while FoX degrades for close retrieval and collapses near chance when the needle is far. ALiBi shows the expected fixed-recency pattern, succeeding only when the target is close. These results sho… view at source ↗

**Figure 10.** Figure 10: Comparison with YaRN [12] in the synthetic NIAH task. We report retrieval accuracy under context extrapolation for NEEDLE-CLOSE (left) and NEEDLE-FAR (right). While YaRN reports competitive performance across context lengths, we observe that combining it with GAPE yields further improvements. This suggests that GAPE is highly compatible with rotary interpolation methods. NoPE+GAPE preserves near-perfect r… view at source ↗

**Figure 11.** Figure 11: Query/key norm distribution by frequency. We report the average norm value across rotary frequency channels for GAPE, p-RoPE, and RoPE. RoPE and p-RoPE develop a pronounced spike around a narrow frequency band, while GAPE produces a more distributed norm profile. Error bars indicate variation across the measured samples. the query/key dimensions according to their associated rotary frequency channel and c… view at source ↗

**Figure 12.** Figure 12: Evolution of GAPE routing variables during training. We track the mean structural mask M¯ h, landmark activation ¯lh, query-dependent forgetting gate g¯h, and head amplitude Γh across layers and heads. GAPE learns a heterogeneous routing structure: several heads develop high masks values and act as contractive filters, while a smaller subset preserves non-trivial landmark activations, most prominently in … view at source ↗

**Figure 13.** Figure 13: Layer 5 attention maps across heads. We compare the realised attention maps of GAPE, p-RoPE, and RoPE across the eight heads of the final layer. RoPE and p-RoPE exhibit comparatively smoother causal patterns, whereas GAPE develops more heterogeneous head-wise structures, with some heads showing sharper contraction and others retaining broader or banded access. This provides a qualitative view of the learn… view at source ↗

read the original abstract

Rotary Positional Encoding (RoPE) is widely used in modern large language models. However, when sequences are extended beyond the range seen during training, rotary phases can enter out-of-distribution regimes, leading to spurious long-range alignments, diffuse attention, and degraded retrieval. Existing remedies only partially address these failures, as they often trade local positional resolution for long-context stability. We propose GAPE (Gated Adaptive Positional Encoding), a drop-in augmentation for positional encodings that introduces a content-aware bias directly into the attention logits while preserving the rotary geometry. GAPE decouples distance-based suppression from token importance through a query-dependent gate that contracts irrelevant context and a key-dependent gate that preserves salient distant tokens. We prove that protected tokens remain accessible, while the attention mass assigned to unprotected distant tokens decays as a function of the query gate. We further show that GAPE can be implemented within standard scaled dot-product attention. We validate these properties empirically, finding that GAPE consistently yields sharper attention and improved long-context robustness over rotary baselines across both synthetic retrieval and long-context benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes Gated Adaptive Positional Encoding (GAPE) as a drop-in augmentation to Rotary Positional Encoding (RoPE) for large language models. GAPE adds content-aware biases to attention logits via a query-dependent gate that contracts irrelevant context and a key-dependent gate that preserves salient distant tokens, while preserving the underlying rotary geometry. The authors claim to prove that protected tokens remain accessible and that attention mass assigned to unprotected distant tokens decays as a function of the query gate; they further show that GAPE integrates into standard scaled dot-product attention. Empirical results indicate consistently sharper attention maps and improved robustness on synthetic retrieval tasks and long-context benchmarks relative to rotary baselines.

Significance. If the central proof is correct and the reported empirical gains are reproducible, the work offers a conceptually clean way to mitigate RoPE's out-of-distribution phase issues in long sequences without the usual local-resolution trade-offs. The explicit separation of distance-based suppression from token importance, together with the drop-in implementation, could be practically useful for extending context windows in transformer-based models.

minor comments (3)

[Abstract] Abstract: the specific long-context benchmarks and synthetic retrieval tasks are not named; adding the exact dataset names and sequence lengths would improve clarity for readers scanning the abstract.
[§3] §3 (method): the exact placement of the gates relative to the rotary embedding (before or after the phase rotation) should be stated explicitly in the attention formula to confirm the claimed compatibility with standard scaled dot-product attention.
[Table 2 / Figure 3] Table 2 or Figure 3 (empirical results): the reported attention sharpness metric lacks a precise definition or formula; a short equation or reference to the computation would make the 'sharper attention' claim easier to interpret.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work on Gated Adaptive Positional Encoding (GAPE) as a drop-in augmentation to RoPE. The recommendation for minor revision is noted, and we appreciate the recognition of the conceptual separation of distance-based suppression from token importance as well as the empirical improvements on long-context tasks. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces GAPE as an additive content-aware bias to rotary attention logits, with an explicit claimed proof that protected tokens remain accessible while unprotected distant mass decays as a function of the query gate, plus a demonstration that the mechanism fits inside standard scaled dot-product attention. These elements are presented as independent mathematical and implementation contributions rather than reductions of fitted parameters or self-citations. No self-definitional equations, predictions that are statistically forced by construction, or load-bearing self-citations appear in the abstract or strongest claims; the empirical validation on synthetic retrieval and long-context benchmarks is treated as separate corroboration. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard transformer attention mechanics plus newly introduced gated components whose parameters are learned from data.

free parameters (1)

gate parameters
Query- and key-dependent gates are parameterized and fitted during training to control the content-aware bias.

axioms (1)

standard math Standard scaled dot-product attention framework
The method assumes the usual attention logit computation and rotary geometry can be augmented without breaking core properties.

invented entities (1)

GAPE gates no independent evidence
purpose: Introduce content-aware bias into attention logits to decouple distance and importance
New components proposed in the paper with no independent evidence outside the work itself.

pith-pipeline@v0.9.0 · 5491 in / 1376 out tokens · 52065 ms · 2026-05-12T04:40:59.864391+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 9 internal anchors

[1]

Round and round we go! what makes rotary positional encodings useful?, 2025

Federico Barbero et al. “Round and round we go! what makes rotary positional encodings useful?” In:arXiv preprint arXiv:2410.06205(2024)

work page arXiv 2024
[2]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen et al. “Extending context window of large language models via positional interpolation”. In:arXiv preprint arXiv:2306.15595(2023)

work page internal anchor Pith review arXiv 2023
[3]

Kerple: Kernelized relative positional embedding for length extrapolation

Ta-Chung Chi et al. “Kerple: Kernelized relative positional embedding for length extrapolation”. In:Advances in Neural Information Processing Systems35 (2022), pp. 8386–8399

work page 2022
[4]

Transformer-xl: Attentive language models beyond a fixed-length context

Zihang Dai et al. “Transformer-xl: Attentive language models beyond a fixed-length context”. In:Proceedings of the 57th annual meeting of the association for computational linguistics. 2019, pp. 2978–2988

work page 2019
[5]

Tri Dao.FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

work page
[6]

arXiv:2307.08691 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Tri Dao et al.FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

work page
[8]

arXiv:2205.14135 [cs.LG]

work page internal anchor Pith review arXiv
[9]

Bert: Pre-training of deep bidirectional transformers for language un- derstanding

Jacob Devlin et al. “Bert: Pre-training of deep bidirectional transformers for language un- derstanding”. In:Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 2019, pp. 4171–4186

work page 2019
[10]

LongRoPE: Extending LLM context window beyond 2 million tokens

Yiran Ding et al. “Longrope: Extending llm context window beyond 2 million tokens”. In: arXiv preprint arXiv:2402.13753(2024)

work page arXiv 2024
[11]

Aaron Grattafiori et al.The Llama 3 Herd of Models. 2024. arXiv:2407.21783 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh et al. “RULER: What’s the Real Context Size of Your Long-Context Language Models?” In:arXiv preprint arXiv:2404.06654(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Functional interpolation for relative positions improves long context transformers

Shanda Li et al. “Functional interpolation for relative positions improves long context trans- formers”. In:arXiv preprint arXiv:2310.04418(2023)

work page arXiv 2023
[14]

Zhixuan Lin et al.Forgetting Transformer: Softmax Attention with a Forget Gate. 2025. arXiv: 2503.02130 [cs.LG]

work page arXiv 2025
[15]

Scaling laws of rope-based extrapolation

Xiaoran Liu et al. “Scaling laws of rope-based extrapolation”. In:arXiv preprint arXiv:2310.05209(2023)

work page arXiv 2023
[16]

Base of rope bounds context length

Xin Men et al. “Base of rope bounds context length”. In:arXiv preprint arXiv:2405.14591 (2024)

work page arXiv 2024
[17]

Frequency Bands in RoPE: Base Frequency and Context Length Shape the Interpolation–Extrapolation Trade-off

Yui Oka et al. “Frequency Bands in RoPE: Base Frequency and Context Length Shape the Interpolation–Extrapolation Trade-off”. In:The Fourteenth International Conference on Learning Representations

work page
[18]

Probing Rotary Position Embeddings through Frequency Entropy

Yui Oka et al. “Probing Rotary Position Embeddings through Frequency Entropy”. In:The Fourteenth International Conference on Learning Representations

work page
[19]

Team Olmo et al.Olmo 3. 2025. arXiv:2512.13961 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo et al. “The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale”. In:The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 2024

work page 2024
[21]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng et al. “Yarn: Efficient context window extension of large language models”. In: arXiv preprint arXiv:2309.00071(2023)

work page internal anchor Pith review arXiv 2023
[22]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A Smith, and Mike Lewis. “Train short, test long: Attention with linear biases enables input length extrapolation”. In:arXiv preprint arXiv:2108.12409(2021). 10

work page internal anchor Pith review arXiv 2021
[23]

Exploring the limits of transfer learning with a unified text-to-text trans- former

Colin Raffel et al. “Exploring the limits of transfer learning with a unified text-to-text trans- former”. In:Journal of machine learning research21.140 (2020), pp. 1–67

work page 2020
[24]

Flashattention-3: Fast and accurate attention with asynchrony and low- precision

Jay Shah et al. “Flashattention-3: Fast and accurate attention with asynchrony and low- precision”. In:Advances in Neural Information Processing Systems37 (2024), pp. 68658– 68685

work page 2024
[25]

Self-attention with relative position representations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. “Self-attention with relative position representations”. In:Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018, pp. 464–468

work page 2018
[26]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su et al. “Roformer: Enhanced transformer with rotary position embedding”. In: Neurocomputing568 (2024), p. 127063

work page 2024
[27]

Gemma Team et al.Gemma: Open Models Based on Gemini Research and Technology. 2024. arXiv:2403.08295 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Attention is all you need

Ashish Vaswani et al. “Attention is all you need”. In:Advances in neural information process- ing systems30 (2017)

work page 2017
[29]

Petar Veliˇckovi´c et al.Softmax is not Enough (for Sharp Size Generalisation). 2025. arXiv: 2410.01104 [cs.LG]

work page arXiv 2025
[30]

Frayed RoPE and Long Inputs: A Geometric Perspective

Davis Wertheimer et al. “Frayed RoPE and Long Inputs: A Geometric Perspective”. In:arXiv preprint arXiv:2603.18017(2026)

work page arXiv 2026
[31]

Base of rope bounds context length

Mingyu Xu et al. “Base of rope bounds context length”. In:Advances in Neural Information Processing Systems37 (2024), pp. 87386–87410

work page 2024
[32]

arrive remote, wait local

Ted Zadouri et al.FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmet- ric Hardware Scaling. 2026. arXiv:2603.05451 [cs.CL]. 11 Remember to Forget: Gated Adaptive Positional Encoding Supplementary Material Contents A Gated Adaptive Positional Encoding: Proofs and Further Explanations 13 A.1 Landmark Protection . . . . . . . . . . . . ....

work page arXiv 2026