arxiv: 2512.07805 · v6 · submitted 2025-12-08 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Group Representational Position Encoding

Yifan Zhang , Zixiang Chen , Yifeng Liu , Zhen Qin , Huizhuo Yuan , Kangping Xu , Yang Yuan , Quanquan Gu

show 1 more author

Andrew Chi-Chih Yao

Authors on Pith no claims yet

Pith reviewed 2026-05-17 00:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords positional encodinggroup actionsRoPEALiBitransformerlong-context modelsmatrix exponentialunified framework

0 comments

The pith

GRAPE models positions as group actions on features, recovering RoPE and ALiBi exactly while adding low-cost extensions for cross-feature coupling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GRAPE as a framework that treats positions as elements acting on representation vectors through operations drawn from matrix groups. Multiplicative GRAPE uses the exponential of a scaled skew-symmetric generator to produce relative, norm-preserving rotations that compose cleanly. Additive GRAPE uses low-rank unipotent elements to produce additive logit biases that obey an exact relative law and remain cacheable. Both families recover standard methods as exact special cases: RoPE when the generators act on canonical planes with a log-uniform spectrum, and ALiBi when the unipotent actions are rank-1. The design opens controlled ways to add learned subspaces that couple features across planes at linear or near-linear cost.

Core claim

GRAPE derives positional encodings from group actions: a position n acts via G(n) = exp(n ω L) with L a rank-2 skew-symmetric matrix for the multiplicative case, yielding a relative compositional map in SO(d); this recovers RoPE exactly when the d/2 planes are canonical coordinate pairs with log-uniform eigenvalues. For the additive case, rank-1 or low-rank unipotent actions in GL produce logit biases that recover ALiBi and FoX exactly while preserving relative properties and streaming cacheability. Learned commuting subspaces and compact non-commuting mixtures extend the geometry to capture cross-subspace coupling at O(d) and O(r d) cost per head respectively.

What carries the argument

The group action map G(n) realized either as the matrix exponential exp(n ω L) of a rank-2 skew-symmetric generator in SO(d) or as a low-rank unipotent element in GL that adds a bias to logits.

If this is right

Any choice of group generator produces an encoding that is exactly relative and compositional.
RoPE is recovered precisely when the generators act on fixed coordinate planes with log-uniform spectrum.
ALiBi and FoX arise exactly from rank-1 unipotent actions that add relative logit biases.
Learned commuting subspaces extend the geometry at O(d) cost per head while preserving closed-form evaluation.
Compact non-commuting mixtures allow richer cross-subspace coupling at O(r d) cost per head.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same group-action lens could be applied to other sequence models that currently use hand-designed positional biases.
Closed-form matrix exponentials for low-rank generators may simplify hardware kernels for very long contexts.
Differences in empirical behavior between RoPE and ALiBi may trace directly to the algebraic properties of their underlying groups.
The framework suggests testing whether optimizing the generator spectrum alone, without learned subspaces, already improves length generalization.

Load-bearing premise

The learned commuting subspaces and non-commuting mixtures will produce useful feature coupling in practice without creating new optimization difficulties.

What would settle it

Train identical long-context language models that differ only in replacing standard RoPE with a GRAPE version using learned non-commuting mixtures, then measure whether perplexity on held-out long sequences improves, stays flat, or degrades.

Figures

Figures reproduced from arXiv: 2512.07805 by Andrew Chi-Chih Yao, Huizhuo Yuan, Kangping Xu, Quanquan Gu, Yang Yuan, Yifan Zhang, Yifeng Liu, Zhen Qin, Zixiang Chen.

**Figure 1.** Figure 1: Overview of the GRAPE Framework. We unify positional encodings via group actions G(n) = exp(nωL). Left: Multiplicative GRAPE recovers RoPE via rank-2 skew generators in SO(d). Right: Additive GRAPE recovers ALiBi and FoX via low-rank nilpotent generators in the unipotent subgroup of GL(d + k) (k = 1 or 2). linear-in-offset logit biases (including content-gated and path-integral forms). This perspective re… view at source ↗

**Figure 2.** Figure 2: The training and validation loss of medium-size models (355M), with different positional [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: The training and validation loss of large-size models (770M), with different positional [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

read the original abstract

We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions. GRAPE unifies two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\operatorname{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t \in \mathbb{R}$) acts as $\mathbf{G}(n) = \exp(n \, \omega \, \mathbf{L})$ with a rank-2 skew-symmetric generator $\mathbf{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the $d/2$ planes correspond to canonical coordinate pairs with a log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(r d)$ cost per head, respectively. In Additive GRAPE, additive logits arise from rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Overall, GRAPE provides a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project page: https://github.com/model-architectures/GRAPE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GRAPE, a group-theoretic framework for positional encodings that unifies multiplicative rotations in SO(d) (recovering RoPE exactly via rank-2 skew-symmetric generators and matrix exponentials) with additive logit biases from unipotent actions in GL (recovering ALiBi and FoX). It proposes extensions via learned commuting subspaces at O(d) cost and compact non-commuting mixtures at O(r d) cost to capture cross-subspace feature coupling while preserving closed-form exponentials, relative compositionality, and norm preservation.

Significance. If the algebraic recoveries and preservation properties hold, GRAPE supplies a principled Lie-group design space for long-context positional geometry that subsumes existing methods as exact special cases rather than approximations. The exact algebraic identities for RoPE/ALiBi and the emphasis on relative, compositional, streaming-cacheable maps are strengths; however, the practical value of the learned extensions hinges on whether they deliver measurable gains without new optimization issues.

major comments (2)

[Abstract / Multiplicative GRAPE extensions] Abstract and the section on Multiplicative GRAPE extensions: the assertion that learned non-commuting mixtures 'strictly extend this geometry' while 'preserving an exact relative law' and 'closed-form matrix exponential' lacks an explicit derivation showing that the effective generator remains skew-symmetric (or that the map stays exactly norm-preserving and compositional) when the subspaces fail to commute; without this, the unification risks being a reparametrization whose extra degrees of freedom do not guarantee the claimed properties.
[Section on learned commuting subspaces and non-commuting mixtures] Section on learned commuting subspaces and non-commuting mixtures: the O(d) and O(r d) cost claims and the statement that these capture 'useful cross-subspace feature coupling' are presented without accompanying optimization analysis or ablation results demonstrating absence of performance regressions or increased non-convexity when jointly optimizing the generators L with model weights.

minor comments (2)

[Multiplicative GRAPE definition] Clarify the precise construction of the rank-2 skew-symmetric generator L and the log-uniform spectrum choice that recovers RoPE as an exact special case (include the relevant equation).
[Additive GRAPE] Add a short remark on how the unipotent actions in Additive GRAPE ensure streaming cacheability is preserved exactly when recovering ALiBi.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on the GRAPE framework. We address each major comment below with clarifications grounded in the manuscript's algebraic constructions and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract / Multiplicative GRAPE extensions] Abstract and the section on Multiplicative GRAPE extensions: the assertion that learned non-commuting mixtures 'strictly extend this geometry' while 'preserving an exact relative law' and 'closed-form matrix exponential' lacks an explicit derivation showing that the effective generator remains skew-symmetric (or that the map stays exactly norm-preserving and compositional) when the subspaces fail to commute; without this, the unification risks being a reparametrization whose extra degrees of freedom do not guarantee the claimed properties.

Authors: We thank the referee for this observation. In the non-commuting mixture construction, each subspace is equipped with its own skew-symmetric generator L_k supported on a low-dimensional block. The effective generator is the direct sum L = sum_k L_k. Because the vector space of skew-symmetric matrices is closed under addition, L remains skew-symmetric even when the individual L_k fail to commute. Consequently exp(n ω L) is orthogonal for all n, guaranteeing exact norm preservation. The relative law holds because exp((n+m) ω L) = exp(n ω L) exp(m ω L) for any fixed matrix L (scalar multiples of the same matrix always commute). The closed-form matrix exponential is unchanged. We will insert an explicit derivation of these three properties, including the verification that the Lie-algebra closure and the exponential homomorphism are preserved, in the revised manuscript. revision: yes
Referee: [Section on learned commuting subspaces and non-commuting mixtures] Section on learned commuting subspaces and non-commuting mixtures: the O(d) and O(r d) cost claims and the statement that these capture 'useful cross-subspace feature coupling' are presented without accompanying optimization analysis or ablation results demonstrating absence of performance regressions or increased non-convexity when jointly optimizing the generators L with model weights.

Authors: The stated complexities follow directly from the constructions: commuting subspaces admit simultaneous block-diagonalization, reducing the per-head cost to O(d) independent 2-by-2 rotations; non-commuting mixtures are represented via a rank-r collection of generators whose exponential can be evaluated with O(r d) matrix-vector operations. These are asymptotic operation counts for the forward pass, not training-time claims. The manuscript is primarily a theoretical unification; we therefore did not include joint-optimization ablations. The additional parameters in L are structured and low-dimensional, so their inclusion does not alter the convexity properties of the overall loss beyond those already present in standard transformer training. We will expand the cost derivations with explicit operation counts and add a short discussion of optimization considerations, while noting that comprehensive empirical ablation of training dynamics lies outside the current scope. revision: partial

Circularity Check

0 steps flagged

GRAPE derivation is algebraically self-contained with exact recoveries of RoPE and ALiBi as special cases.

full rationale

The paper constructs positional encodings directly from Lie-group actions: G(n) = exp(n ω L) for skew-symmetric L in SO(d) (Multiplicative GRAPE) and rank-1 unipotent actions in GL (Additive GRAPE). RoPE is recovered exactly when d/2 planes are canonical coordinate pairs with log-uniform spectrum; ALiBi and FoX are recovered as exact rank-1 unipotent special cases. These identities are algebraic, not data-driven fits. Learned commuting subspaces and non-commuting mixtures are introduced as mathematical extensions that preserve closed-form exponentials and relative compositionality by construction. No load-bearing step reduces to a fitted parameter, self-citation chain, or ansatz smuggled from prior work; the framework is self-contained against standard Lie-group mathematics.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard properties of matrix exponentials and group actions; the only free choices are the specific generator L and the choice of subspaces, which are presented as design decisions rather than fitted constants.

free parameters (1)

rank-2 skew-symmetric generator L
Defines the rotation planes and spectrum in Multiplicative GRAPE; chosen to recover RoPE when planes are canonical pairs with log-uniform spectrum.

axioms (2)

standard math Matrix exponential of skew-symmetric generators yields elements of SO(d)
Invoked to guarantee norm preservation and closed-form relative positional maps.
standard math Unipotent actions in GL produce additive logit biases
Used to recover ALiBi and FoX as exact special cases while preserving relative and streaming properties.

pith-pipeline@v0.9.0 · 5613 in / 1373 out tokens · 70019 ms · 2026-05-17T00:00:43.870987+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

G(n)=exp(nωL) with L=ab⊤−ba⊤ ∈so(d), Rodrigues formula exp(L)=I+(sin s/s)L+(1−cos s/s²)L², recovers RoPE on canonical planes with log-uniform spectrum.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Multi-subspace commuting sum LR oPE=∑θiLi, block-diagonal product of planar rotations.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks
cs.LG 2026-05 conditional novelty 7.0

Jordan-RoPE realizes a non-semisimple relative positional operator that produces coupled oscillatory-polynomial features such as d e^{i omega d} for causal query-key lags.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Round and round we go! what makes rotary positional encodings useful? In International Conference on Learning Representations (ICLR 2025),

Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, and Petar Veliˇckovic. Round and round we go! what makes rotary positional encodings useful? In International Conference on Learning Representations (ICLR 2025),

work page 2025
[2]

Also arXiv:2410.06205

URLhttps: //arxiv.org/abs/2410.06205. Also arXiv:2410.06205. Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150,

work page arXiv 2004
[4]

Extending Context Window of Large Language Models via Positional Interpolation

URL https://arxiv.org/abs/2306.15595. Ta-Chung Chi, Ting-Han Fan, Peter J Ramadge, and Alexander Rudnicky. Kerple: Kernelized rel- ative positional embedding for length extrapolation.Advances in Neural Information Processing Systems, 35:8386–8399, 2022a. Ta-Chung Chi, Ting-Han Fan, Alexander I Rudnicky, and Peter J Ramadge. Dissecting transformer length e...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[6]

Brian C Hall

URLhttps://arxiv.org/abs/2405.18719. Brian C Hall. Lie groups, lie algebras, and representations. InQuantum Theory for Mathematicians, pages 333–366. Springer,

work page arXiv
[7]

Transformer language models without positional encodings still learn positional information.arXiv preprint arXiv:2203.16634,

Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information.arXiv preprint arXiv:2203.16634,

work page arXiv
[8]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention.arXiv preprint arXiv:2006.03654,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[9]

Llm maybe longlm: Self-extend llm context window without tuning

Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu. Llm maybe longlm: Self-extend llm context window without tuning. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), volume 235, pages 22099–22114. PMLR,

work page 2024
[10]

Rethinking positional encoding in language pre-training.arXiv preprint arXiv:2006.15595,

11 Published as a conference paper at ICLR 2026 Guolin Ke, Di He, and Tie-Yan Liu. Rethinking positional encoding in language pre-training.arXiv preprint arXiv:2006.15595,

work page arXiv 2026
[11]

Functional interpolation for relative positions improves long context transformers

Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, and Srinadh Bhojanapalli. Functional interpolation for relative positions improves long context transformers.arXiv preprint arXiv:2310.04418,

work page arXiv
[13]

Xuanqing Liu, Hsiang-Fu Yu, Inderjit Dhillon, and Cho-Jui Hsieh

URLhttps://arxiv.org/ abs/2503.02130. Xuanqing Liu, Hsiang-Fu Yu, Inderjit Dhillon, and Cho-Jui Hsieh. Learning to encode position for transformer with continuous dynamical model. InInternational conference on machine learning, pages 6327–6335. PMLR,

work page arXiv
[14]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019,

work page 2019
[15]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation.arXiv preprint arXiv:2108.12409,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Linearized relative positional encoding.arXiv preprint arXiv:2307.09270,

Zhen Qin, Weixuan Sun, Kaiyue Lu, Hui Deng, Dongxu Li, Xiaodong Han, Yuchao Dai, Lingpeng Kong, and Yiran Zhong. Linearized relative positional encoding.arXiv preprint arXiv:2307.09270,

work page arXiv
[18]

Randomized positional encodings boost length generalization of transformers.arXiv preprint arXiv:2305.16843,

Anian Ruoss, Gr ´egoire Del ´etang, Tim Genewein, Jordi Grau-Moya, R ´obert Csord ´as, Mehdi Ben- nani, Shane Legg, and Joel Veness. Randomized positional encodings boost length generalization of transformers.arXiv preprint arXiv:2305.16843,

work page arXiv
[19]

Learning the ropes: Better 2d and 3d position encodings with string.arXiv preprint arXiv:2502.02562,

12 Published as a conference paper at ICLR 2026 Connor Schenck, Isaac Reid, Mithun George Jacob, Alex Bewley, Joshua Ainslie, David Rendle- man, Deepali Jain, Mohit Sharma, Avinava Dubey, Ayzaan Wahid, et al. Learning the ropes: Better 2d and 3d position encodings with string.arXiv preprint arXiv:2502.02562,

work page arXiv 2026
[20]

Self-Attention with Relative Position Representations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representa- tions.arXiv preprint arXiv:1803.02155,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

The curious case of absolute position embeddings.arXiv preprint arXiv:2210.12574,

Koustuv Sinha, Amirhossein Kazemnejad, Siva Reddy, Joelle Pineau, Dieuwke Hupkes, and Adina Williams. The curious case of absolute position embeddings.arXiv preprint arXiv:2210.12574,

work page arXiv
[22]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yuancheng Zhang, Shengfeng Pan, Shengyu Ge, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

A length-extrapolatable transformer.arXiv preprint arXiv:2212.10554,

Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer.arXiv preprint arXiv:2212.10554,

work page arXiv
[24]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko-...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Length generalization of causal transformers without position encoding

Jie Wang, Tao Ji, Yuanbin Wu, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang, and Xiaoling Wang. Length generalization of causal transformers without position encoding. InFindings of the Association for Computational Linguistics: ACL 2024, pages 14024–14040, Bangkok, Thailand, August

work page 2024
[26]

doi: 10.18653/v1/2024.findings-acl

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl

work page doi:10.18653/v1/2024.findings-acl 2024
[27]

Ulme Wennberg and Gustav Eje Henter

URLhttps://aclanthology.org/2024.findings-acl.834/. Ulme Wennberg and Gustav Eje Henter. The case for translation-invariant self-attention in transformer-based language models.arXiv preprint arXiv:2106.01950,

work page arXiv 2024
[28]

Da-transformer: Distance-aware transformer

Chuhan Wu, Fangzhao Wu, and Yongfeng Huang. Da-transformer: Distance-aware transformer. arXiv preprint arXiv:2010.06925,

work page arXiv 2010
[29]

Effective long- context scaling of foundation models

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of foundation models.arXiv preprint arXiv:2309.16039,

work page arXiv
[30]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Songlin Yang, Yikang Shen, Kaiyue Wen, Shawn Tan, Mayank Mishra, Liliang Ren, Rameswar Panda, and Yoon Kim. Path attention: Position encoding via accumulating household...

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Dape: Data-adaptive positional encoding for length extrapolation.Advances in Neural Information Processing Systems, 37:26659–26700,

13 Published as a conference paper at ICLR 2026 Chuanyang Zheng, Yihang Gao, Han Shi, Minbin Huang, Jingyao Li, Jing Xiong, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, et al. Dape: Data-adaptive positional encoding for length extrapolation.Advances in Neural Information Processing Systems, 37:26659–26700,

work page 2026
[32]

Pose: Efficient context window extension of llms via positional skip-wise training.arXiv preprint arXiv:2309.10400,

Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. Pose: Efficient context window extension of llms via positional skip-wise training.arXiv preprint arXiv:2309.10400,

work page arXiv
[33]

21 J.2 Multi-subspace GRAPE-M and RoPE

14 Published as a conference paper at ICLR 2026 Appendix A Related Work 16 B Application in Multi-Head Attention 17 C Forgetting Transformer as a Special Additive GRAPE 17 D Non-Commuting Multiplicative GRAPE 18 E Composition of Additive GRAPE and Multiplicative GRAPE 19 F Comparison with LieRE 19 G 2D and 3D GRAPE for Vision and Multimodal Position Encod...

work page 2026
[34]

pu,h Positional Embedding/Representation: A vector derived from token-local features, obtained via a linear projection followed by RMS normalization. A RELATEDWORK Positional information in Transformers mainly can be categorized into these classes: (a) absolute encodings (sinusoidal or learned) (Vaswani et al., 2017; Devlin et al., 2019; Neishi and Yoshin...

work page 2017
[35]

and with context-scaling procedures (Xiong et al., 2023; Chen et al., 2023; Peng et al., 2023; Zhu et al., 2023; Jin et al., 2024). Beyond 1D language modeling, 2D RoPE and variants adapt rotary encodings to 2D grids by applying rotations along spatial axes, and have been shown to improve high-resolution extrapolation in Vision Transformers and related vi...

work page 2023
[36]

designs separable, translation-invariant RoPE-style encodings that scale to 2D and 3D coordinates in vision and robotics settings (Ostmeier et al., 2025; Schenck et al., 2025).GRAPE-M identifies RoPE as commuting rank-2 exponentials inSO(d)and extends it to learned subspaces and compact non-commuting mixtures in closed form and a much faster way. Compared...

work page 2025
[37]

and related kernelized/randomized forms (Chi et al., 2022a;b; Li et al., 2023; Ruoss et al.,

work page 2023
[38]

are captured exactly by GRAPE-A as unipotent actions in the general linear group GLthat preserve the same relative law and streaming cacheability. Importantly,forgetting mech- anisms are additive: the Forgetting Transformer (FoX) implements a learnable per-head expo- nential decay in the attention logits and is a specific GRAPE-A / GRAPE-AP instance impos...

work page 2025
[39]

Special cases and composition.Iff t,h ≡e −βh (constant per head), thenD ij,h =−β h(i−j) and FoX reduces to exact ALiBi (Section 4.2)

The headwise gatesft,h addO(1)parameters and negligible computation. Special cases and composition.Iff t,h ≡e −βh (constant per head), thenD ij,h =−β h(i−j) and FoX reduces to exact ALiBi (Section 4.2). More generally, FoX composes additively with the multiplicative (orthogonal) GRAPE acting on(q,k)as in Eq. (5.3), preserving norm-preservation of the rota...

work page 2026
[40]

The method then applies the matrix exponential of this generator to get a rotational position map

encode positional information by learning a skew-symmetric generator inSO(d). The method then applies the matrix exponential of this generator to get a rotational position map. For each attention head, the method learns one skew matrix. Its exponential gives a dense orthogonal operator on queries and keys. Positions then match elements of a one-parameter ...

work page 2026
[41]

This gives a clear way to impose axis-aligned or radial recency bias in vision and multimodal models

The update matrix then stays unipotent, and the exact relative composition law still holds. This gives a clear way to impose axis-aligned or radial recency bias in vision and multimodal models. H ALGORITHMICDETAILS ANDPSEUDOCODE This appendix contains the detailed pseudocode. Algorithm 1Commuting Multi-Subspace GRAPE-M Require:Q,K∈R B×L×H×d , orthogonalE∈...

work page 2026
[42]

Ifb=Ja(Section 2.4) and∥a∥= 1, thens= 1andθ=η

Corollary J.2(Phase bounds and orthogonality).The per-step rotation angle ofexp(ηL)onU equalsθ=ηsand satisfies0≤θ≤η∥a∥∥b∥, with equality whena⊥b. Ifb=Ja(Section 2.4) and∥a∥= 1, thens= 1andθ=η. Exponential spectrum.For anyn∈Z, σ exp(nL) ={e ±ins} ∪ {1}d−2. Henceρ(exp(nL)) = 1, the map is unitary (orthogonal), and all Lyapunov exponents are zero. Periodicit...

work page 2026
[43]

(4.7), letE:=e d+2e⊤ d+1 so thatA h =−β hE

Corollary J.5(ALiBi and Additive GRAPE(GRAPE-A) conditioning numbers).For the exact AL- iBi generator in Eq. (4.7), letE:=e d+2e⊤ d+1 so thatA h =−β hE. ThenG add,h(m) =I+mA h = I−m β h E=I+sEwiths=−m β h, and the only nontrivial singular values follow from Eq. (J.1). For the single-vector additive lift Eq. (4.1) withA= 0 u shift 0⊤ 0 and∥u shift∥= 1, the...

work page 2026
[44]

In the canonical rank-1case of Lemma J.4 with∥A∥ 2 = 1, one has the sharper small-|s|behaviorσ max(I+sA) = 1 + |s| 2 +O(s 2)and σmin(I+sA) = 1− |s| 2 +O(s 2)

These bounds are conservative but dimension-free. In the canonical rank-1case of Lemma J.4 with∥A∥ 2 = 1, one has the sharper small-|s|behaviorσ max(I+sA) = 1 + |s| 2 +O(s 2)and σmin(I+sA) = 1− |s| 2 +O(s 2). Proof.Use the triangle inequality∥(I+sA)x∥ 2 ≤ ∥x∥ 2 +|s| ∥A∥ 2∥x∥2 and its reverse form applied to(I+sA) −1 =I−sA; see also Weyl inequalities for s...

work page 2026