arxiv: 2605.04217 · v1 · submitted 2026-05-05 · 💻 cs.LG · cs.CL

Recognition: unknown

Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks

Yaobo Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:06 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords relative positional encodingJordan blocksnon-semisimplerotary embeddingsattention mechanismslanguage modeling

0 comments

The pith

Non-semisimple Jordan blocks realize a distance-modulated phase basis for relative positional encoding

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies non-semisimple representations for relative positional encodings by placing a complex rotary eigenvalue and a nilpotent part inside one defective Jordan block. This structure causes the relative operator to generate features that multiply lag distance by oscillatory terms, such as damped sines and cosines scaled by the separation d. The result is a single coupled basis rather than independent phase and distance channels. A reader would care if this coupling supplies modeling capacity for sequential patterns in which phase shift interacts with token distance.

Core claim

The non-semisimple Jordan block structure for the relative translation operator produces oscillatory-polynomial features such as e to the minus gamma d times cos of omega d, the same times sin, and the lag d times each of those, for causal lag d. This exactly realizes the distance-modulated phase basis d times e to the i omega d when the block is defective, and the authors give the exact one-parameter representation, its real block form, and the required contragredient query map.

What carries the argument

The defective complex Jordan block that unites the rotary eigenvalue with the nilpotent response inside the generator of the relative positional operator

Load-bearing premise

The non-semisimple Jordan structure supplies an inductive bias for coupled phase-distance interactions that cannot be obtained by simply combining existing rotary and additive channels.

What would settle it

A direct-sum model that applies RoPE for phase and an additive bias for distance would achieve equal performance to Jordan-RoPE on the synthetic task built around distance-modulated phase interactions.

Figures

Figures reproduced from arXiv: 2605.04217 by Yaobo Zhang.

**Figure 1.** Figure 1: Primitive pre-softmax relative-position bases. RoPE supplies phase features, direct-sum view at source ↗

**Figure 2.** Figure 2: Kernel-level mixed-target extrapolation. The raw/exact Jordan basis fits the unbounded view at source ↗

**Figure 3.** Figure 3: Jordan-friendly synthetic LM accuracy. Stabilized Jordan-RoPE preserves high long-lag view at source ↗

**Figure 4.** Figure 4: WikiText-103 byte LM validation loss. The Scaled-exact variant with view at source ↗

read the original abstract

Relative positional encodings determine which functions of query-key lag can enter the primitive attention logit. RoPE supplies a rotary phase, while ALiBi supplies an additive distance bias. Motivated by group-theoretic views of linear translation-invariant positional encodings, we study a non-semisimple case in which a complex rotary eigenvalue and a nilpotent response live in the same defective Jordan block. The resulting relative operator generates oscillatory-polynomial features such as $e^{-\gamma d}\cos(\omega d)$, $e^{-\gamma d}\sin(\omega d)$, $d e^{-\gamma d}\cos(\omega d)$, and $d e^{-\gamma d}\sin(\omega d)$, for causal lag $d=i-j\geq 0$. Thus the construction realizes a distance-modulated phase basis $d e^{i\omega d}$, rather than merely adding a separate distance channel to RoPE. We formulate Exact Jordan-RoPE as a non-semisimple one-parameter representation, give its real block form, and specify the contragredient query action required by non-orthogonal positional maps. We also distinguish this exact representation from stabilized variants whose bounded shear improves numerical behavior but breaks the exact group law. Kernel-level diagnostics and a Jordan-friendly synthetic language-model task show that the coupled Jordan basis is useful when the target contains distance-modulated phase interactions. On a small WikiText-103 byte language model, a scaled-exact variant improves over RoPE and direct-sum baselines within the Jordan family, while RoPE+ALiBi remains strongest overall. The evidence is structural rather than a broad performance claim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Jordan-RoPE gives a clean non-semisimple extension to RoPE for coupled distance-modulated phases, but lacks evidence it outperforms combinations of existing methods.

read the letter

Jordan-RoPE uses a non-semisimple complex Jordan block to create relative positional encodings that couple oscillatory phases with distance-dependent modulation. The key output is features like d times e to the i omega d, which the paper argues come directly from the defective block rather than from adding separate terms. The work is solid on the algebraic side. The derivation from the translation group representation is straightforward, the exact form and real block version are presented clearly, and the distinction from stabilized variants is noted. The kernel diagnostics and synthetic language model task back up that the coupled basis helps when the data has those specific interactions. Credit for keeping the construction parameter-light and not fitting the eigenvalue to the task. The main limitation is empirical. The WikiText-103 byte-level model shows gains over RoPE and direct-sum versions, but RoPE combined with ALiBi still performs better overall. There is no direct comparison or ablation that rules out recovering similar behavior from existing channels with more parameters. Everything stays at small scale, so the inductive bias advantage remains unproven at practical sizes. Readers working on theoretical extensions of positional encodings or group-based designs for transformers will get the most from this. It is a targeted contribution rather than a drop-in replacement. The paper deserves serious referee attention because the construction is new and the math holds up, even if the performance story needs more development to convince a broad audience. I recommend putting it through peer review.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes Jordan-RoPE, a relative positional encoding obtained from non-semisimple representations of complex Jordan blocks. It derives an exact one-parameter operator whose real block form produces coupled oscillatory-polynomial features (e.g., d e^{-γd} cos(ωd), d e^{-γd} sin(ωd)) for causal lag d, formulates the required contragredient query action, and distinguishes this from stabilized variants that break the exact group law. Kernel diagnostics and a synthetic language-modeling task demonstrate utility when the target contains distance-modulated phase interactions; on a small WikiText-103 byte-level model a scaled-exact variant improves over RoPE and direct-sum baselines within the Jordan family, while RoPE+ALiBi remains strongest overall. The authors frame the contribution as structural rather than a broad performance claim.

Significance. If the non-semisimple Jordan structure supplies an inductive bias for distance-modulated phases that cannot be recovered by direct-sum combinations of existing channels, the construction supplies a principled, parameter-light extension of rotary encodings. The algebraic derivation of the exact operator and its real block form is cleanly executed from the defective Jordan block, and the synthetic task provides direct structural support for the coupled-feature claim. These elements constitute the primary strengths of the work.

major comments (2)

[§3] §3 (exact Jordan-RoPE construction): the central claim that the defective block realizes a distance-modulated phase basis d e^{iωd} 'rather than merely adding a separate distance channel' is algebraically correct, yet the manuscript provides no expressivity comparison or ablation against a parameterized linear combination of RoPE and an independent distance bias with matching degrees of freedom; the synthetic task in §4.1 matches the Jordan form by construction and therefore does not rule out equivalent logit functions from other parameterizations.
[§4.2] §4.2 (WikiText-103 experiment): results are reported exclusively for the scaled-exact variant, which §3.3 explicitly states breaks the exact group law; no performance numbers are given for the pure exact Jordan-RoPE operator despite its role as the load-bearing theoretical object, weakening the link between the algebraic construction and the empirical evidence.

minor comments (3)

[§2] §2 (related work): the discussion of group-theoretic views of translation-invariant encodings would benefit from an explicit pointer to the specific representation-theoretic references that motivate the non-semisimple case.
[Figure 2] Figure 2 (kernel diagnostics): the plotted attention logits would be clearer if the x-axis were labeled with explicit lag values and if the curves for different (ω, γ) pairs were distinguished by line style as well as color.
[Notation] Notation throughout: the symbol E_p for the positional embedding matrix is introduced without an immediate reminder of its dimension and the precise action of the contragredient map on the query side; a one-line definition or small matrix example would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and for highlighting both the algebraic strengths of the Jordan-RoPE construction and the areas where additional evidence would strengthen the manuscript. We address each major comment below and describe the revisions we will implement.

read point-by-point responses

Referee: [§3] §3 (exact Jordan-RoPE construction): the central claim that the defective block realizes a distance-modulated phase basis d e^{iωd} 'rather than merely adding a separate distance channel' is algebraically correct, yet the manuscript provides no expressivity comparison or ablation against a parameterized linear combination of RoPE and an independent distance bias with matching degrees of freedom; the synthetic task in §4.1 matches the Jordan form by construction and therefore does not rule out equivalent logit functions from other parameterizations.

Authors: We agree that an explicit comparison would better isolate the contribution of the coupled structure. The defective Jordan block produces a specific non-semisimple coupling (the nilpotent term multiplies the oscillatory factor inside the same representation) that is not equivalent to an arbitrary linear combination of independent RoPE and distance channels; the latter would require separate parameters and would not satisfy the same group law. Nevertheless, we acknowledge that the synthetic task is constructed around the Jordan features and therefore cannot alone rule out other parameterizations. In the revision we will add an ablation that compares Jordan-RoPE against a parameterized linear combination of RoPE plus an independent distance bias with matched degrees of freedom, both on the synthetic task and on kernel diagnostics. This will clarify whether the coupled basis supplies an inductive bias beyond what separate channels can achieve. revision: partial
Referee: [§4.2] §4.2 (WikiText-103 experiment): results are reported exclusively for the scaled-exact variant, which §3.3 explicitly states breaks the exact group law; no performance numbers are given for the pure exact Jordan-RoPE operator despite its role as the load-bearing theoretical object, weakening the link between the algebraic construction and the empirical evidence.

Authors: We accept this criticism. The scaled-exact variant was used in the main WikiText-103 runs for numerical stability, since the exact operator’s nilpotent component produces unbounded shear for large lags. To tighten the connection between the theoretical object and the experiments, we will add performance numbers for the pure exact Jordan-RoPE on the synthetic language-modeling task (where distances are controlled) and, where feasible, on the byte-level WikiText-103 model (with appropriate truncation or reduced scale). We will also expand the discussion in §3.3 and §4.2 to explain the stability trade-off and to state explicitly that the scaled variant is a practical approximation rather than a replacement for the exact construction. revision: yes

Circularity Check

0 steps flagged

No circularity: Jordan-RoPE is a direct algebraic construction from non-semisimple blocks

full rationale

The paper defines Exact Jordan-RoPE explicitly as the action of a defective complex Jordan block containing both a rotary eigenvalue and a nilpotent component. The resulting features (d e^{-γd} cos(ωd), d e^{-γd} sin(ωd), etc.) are obtained by direct matrix exponentiation or linear action on the lag d, without any parameter fitting, data-dependent calibration, or reduction to a prior result via self-citation. The group-theoretic motivation is external; the central operator is not defined in terms of its own outputs, nor is any uniqueness theorem imported from the authors' previous work. The WikiText and synthetic experiments are downstream evaluations, not part of the derivation chain. This is a self-contained mathematical construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The construction rests on the existence of a one-parameter non-semisimple representation of the translation group whose complex eigenvalue and nilpotent part are chosen by hand; no new physical entities are postulated.

free parameters (1)

complex eigenvalue and nilpotent coefficient
The single free parameter of the exact Jordan-RoPE representation that controls the oscillation frequency and the polynomial shear; chosen to define the block rather than fitted to data.

axioms (1)

standard math Existence of a non-semisimple one-parameter representation of the additive group of integers
Invoked in the opening motivation and used to derive the Jordan block form in the Exact Jordan-RoPE section.

pith-pipeline@v0.9.0 · 5584 in / 1359 out tokens · 18818 ms · 2026-05-08T17:06:24.228375+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Advances in Neural Information Processing Systems , year=

Attention is All You Need , author=. Advances in Neural Information Processing Systems , year=
[2]

Proceedings of NAACL-HLT , year=

Self-Attention with Relative Position Representations , author=. Proceedings of NAACL-HLT , year=
[3]

and Salakhutdinov, Ruslan , booktitle=

Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc V. and Salakhutdinov, Ruslan , booktitle=. Transformer-
[4]

RoFormer: Enhanced Transformer with Rotary Position Embedding

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. arXiv preprint arXiv:2104.09864 , year=

work page internal anchor Pith review arXiv
[5]

International Conference on Learning Representations , year=

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author=. International Conference on Learning Representations , year=
[6]

A length-extrapolatable transformer

A Length-Extrapolatable Transformer , author=. arXiv preprint arXiv:2212.10554 , year=

work page arXiv
[7]

Extending Context Window of Large Language Models via Positional Interpolation

Extending Context Window of Large Language Models via Positional Interpolation , author=. arXiv preprint arXiv:2306.15595 , year=

work page internal anchor Pith review arXiv
[8]

Peng, Bowen and Quesnelle, Jeffrey and Fan, Honglu and Shippole, Enrico , journal=
[9]

LongRoPE: Extending

Ding, Yiran and Zhang, Li Lyna and Zhang, Chengruidong and Xu, Yuanyuan and Shang, Ning and Xu, Jiahang and Yang, Fan and Yang, Mao , journal=. LongRoPE: Extending
[10]

Advances in Neural Information Processing Systems , year=

Algebraic Positional Encodings , author=. Advances in Neural Information Processing Systems , year=
[11]

and Langlotz, Curtis , booktitle=

Ostmeier, Sophie and Axelrod, Brian and Varma, Maya and Moseley, Michael and Chaudhari, Akshay S. and Langlotz, Curtis , booktitle=
[12]

2026 , month=

Using Group Theory to Explore the Space of Positional Encodings for Attention , author=. 2026 , month=

2026
[13]

arXiv preprint arXiv:2512.07805 , year=

Group Representational Position Encoding , author=. arXiv preprint arXiv:2512.07805 , year=

work page arXiv
[14]

International Conference on Learning Representations , year=

Efficiently Modeling Long Sequences with Structured State Spaces , author=. International Conference on Learning Representations , year=
[15]

Advances in Neural Information Processing Systems , year=

Diagonal State Spaces are as Effective as Structured State Spaces , author=. Advances in Neural Information Processing Systems , year=
[16]

International Conference on Machine Learning , year=

Hyena Hierarchy: Towards Larger Convolutional Language Models , author=. International Conference on Machine Learning , year=
[17]

International Conference on Learning Representations , year=

Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=