Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity

Tan Lai Ngoc; Tan M. Nguyen; Van-Hoan Trinh; Viet-Hoang Tran; Vinh Khanh Bui

arxiv: 2606.17830 · v1 · pith:R262CX4Bnew · submitted 2026-06-16 · 💻 cs.LG · cs.AI

Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity

Viet-Hoang Tran , Vinh Khanh Bui , Van-Hoan Trinh , Tan Lai Ngoc , Tan M. Nguyen This is my paper

Pith reviewed 2026-06-27 01:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords functional equivalencepositional encodingsrotary positional encodingtransformersattention mechanismslinear mode connectivitysymmetry groupexpressivity

0 comments

The pith

Rotary positional encodings reduce the symmetry group of attention compared to sinusoidal ones, increasing expressivity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how positional encodings change functional equivalence among parameters in attention layers. Sinusoidal encodings leave the set of equivalent configurations unchanged from vanilla attention, but rotary encodings break many of those equivalences. The resulting smaller symmetry group means the same number of parameters can realize more distinct functions. This supplies a formal account for why rotary encodings are now standard. The work also shows that the degree of linear mode connectivity between trained models depends on which encoding is used.

Core claim

Focusing on the two most widely used variants—sinusoidal and rotary positional encodings—we show that sinusoidal encodings preserve the equivalence structure of vanilla attention, whereas rotary encodings significantly reduce the symmetry group, thereby enhancing expressivity. This offers a principled explanation for the growing prominence of RoPE in practice. We further examine how positional encodings affect linear mode connectivity, and through an alignment algorithm, empirically demonstrate that the presence and variability of connectivity across Transformer settings crucially depend on the positional encoding.

What carries the argument

The symmetry group of functional equivalence in multi-head attention under a given positional encoding; rotary encodings shrink this group relative to the sinusoidal or vanilla case.

If this is right

Models using rotary encodings realize a larger set of distinct functions for any fixed parameter budget.
Linear interpolation between independently trained models is more likely to produce high-loss points when rotary encodings are used.
The choice of positional encoding directly controls how many parameter settings map to the same input-output behavior.
Alignment procedures for finding connected modes must account for the encoding type to succeed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The symmetry reduction may flatten or reshape the loss landscape in ways that improve optimization trajectories.
The same analysis could be applied to other positional schemes such as ALiBi to predict their symmetry properties before empirical testing.
Reduced equivalence might also affect the sample complexity needed to reach a given performance level.

Load-bearing premise

The formal analysis of equivalence is limited to the interaction between attention and the two chosen positional encodings and treats vanilla attention symmetries as the relevant baseline.

What would settle it

An explicit enumeration or group-order calculation that finds the same number of functionally distinct parameter configurations under rotary encodings as under sinusoidal encodings would refute the claimed reduction in symmetry group size.

Figures

Figures reproduced from arXiv: 2606.17830 by Tan Lai Ngoc, Tan M. Nguyen, Van-Hoan Trinh, Viet-Hoang Tran, Vinh Khanh Bui.

**Figure 1.** Figure 1: Illustration of Linear Mode Connectivity (Linear) Mode Connectivity. One influential perspective on this phenomenon is offered by the concept of mode connectivity (MC) (Frankle, 2020; Keskar et al., 2017; Sagun et al., 2018; Venturi et al., 2019; Neyshabur et al., 2020; Tatro et al., 2020; Yunis et al., 2022; Zhou et al., 2023), which reveals that solutions discovered through independent optimization traj… view at source ↗

**Figure 2.** Figure 2: LMC interpolation plots for ViT on ImageNet-1K (subplots 3 and 4) and GPT-2 on WikiText103 (subplots 1 and 2), with APE and RoPE under first attention layer re-initialization. Theorem 4.2. Given two MHARoPE maps with h and h¯ heads, parameterized by θ = (W Q i , W K i , WV i , WO i ) h i=1 ∈ GAtt(dh, h), and ¯θ = (W¯ Q i , W¯ K i , W¯ V i , W¯ O i ) h¯ i=1 ∈ GAtt(dh, h¯), respectively. Define A 0 i := sym … view at source ↗

**Figure 3.** Figure 3: Performance degradation in ViT-Base on ImageNet due to attention reinitialization at different layers. 0 1 2 3 4 5 6 7 8 9 10 11 Layer 0 2 4 6 8 Loss Validation Loss Pretrained: 3.7800 0 1 2 3 4 5 6 7 8 9 10 11 Layer 0 2 4 6 8 Loss Test Loss Pretrained: 3.7900 [PITH_FULL_IMAGE:figures/full_fig_p048_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of attention reinitialization on GPT-2 perplexity across layers on WikiText103. I. Experimental Details and Hyperparameters Our experiments assess Linear Mode Connectivity (LMC) across a broad spectrum of benchmarks in both vision and natural language processing. The vision suite covers MNIST, CIFAR-10, CIFAR-100, ImageNet-1k, and transfer from ImageNet-21k to smaller classification datasets. For la… view at source ↗

**Figure 5.** Figure 5: Linear Mode Connectivity for ViT on MNIST with 1 layer Model 1 Model 2 0.5 1.0 1.5 Test Loss Naive Match Model 1 Model 3 0.25 0.50 0.75 1.00 1.25 Model 2 Model 3 0 1 2 3 Model 1 Model 2 0.7 0.8 0.9 Test Accuracy Model 1 Model 3 0.7 0.8 0.9 Model 2 Model 3 0.5 0.6 0.7 0.8 0.9 (a) 4 attention heads Model 1 Model 2 0.0 0.5 1.0 1.5 2.0 2.5 Test Loss Naive Match Model 1 Model 3 0.2 0.4 0.6 0.8 Model 2 Model 3 0… view at source ↗

**Figure 6.** Figure 6: Linear Mode Connectivity for ViT on MNIST with 2 layers 50 [PITH_FULL_IMAGE:figures/full_fig_p050_6.png] view at source ↗

**Figure 7.** Figure 7: Linear Mode Connectivity for ViT on CIFAR-10 with 2 layers Model 1 Model 2 0.75 1.00 1.25 1.50 1.75 Test Loss Naive Match Model 1 Model 3 0.8 1.0 1.2 1.4 1.6 Model 2 Model 3 0.8 1.0 1.2 1.4 1.6 Model 1 Model 2 0.55 0.60 0.65 0.70 0.75 0.80 Test Accuracy Model 1 Model 3 0.60 0.65 0.70 0.75 Model 2 Model 3 0.60 0.65 0.70 0.75 (a) 4 attention heads Model 1 Model 2 0.8 1.0 1.2 1.4 1.6 Test Loss Naive Match Mod… view at source ↗

**Figure 8.** Figure 8: Linear Mode Connectivity for ViT on CIFAR-10 with 4 layers Model 1 Model 2 1.0 1.5 2.0 2.5 Test Loss Naive Match Model 1 Model 3 1.0 1.5 2.0 2.5 3.0 Model 2 Model 3 1.0 1.5 2.0 2.5 3.0 Model 1 Model 2 0.5 0.6 0.7 0.8 Test Accuracy Model 1 Model 3 0.4 0.5 0.6 0.7 0.8 Model 2 Model 3 0.5 0.6 0.7 0.8 (a) 4 attention heads Model 1 Model 2 2 4 6 8 Test Loss Naive Match Model 1 Model 3 5 10 15 20 Model 2 Model 3… view at source ↗

**Figure 9.** Figure 9: Linear Mode Connectivity for ViT on CIFAR-10 with 6 layers Model 1 Model 2 4 6 8 Test Loss Naive Match Model 1 Model 3 3 4 5 6 7 Model 2 Model 3 4 6 8 Model 1 Model 2 0.2 0.3 0.4 0.5 Test Accuracy Model 1 Model 3 0.2 0.3 0.4 0.5 Model 2 Model 3 0.2 0.3 0.4 0.5 (a) 4 attention heads Model 1 Model 2 2.5 3.0 3.5 4.0 Test Loss Naive Match Model 1 Model 3 2.5 3.0 3.5 4.0 4.5 Model 2 Model 3 2.5 3.0 3.5 4.0 4.5 … view at source ↗

**Figure 10.** Figure 10: Linear Mode Connectivity for ViT on CIFAR-100 with 6 layers 51 [PITH_FULL_IMAGE:figures/full_fig_p051_10.png] view at source ↗

**Figure 11.** Figure 11: Linear Mode Connectivity for ViT on ImageNet21k!CIFAR-10/100 with 12 layers and 6 heads Model 1 Model 2 1.0 1.5 2.0 2.5 3.0 Valid Loss Naive Match Model 1 Model 3 1.0 1.5 2.0 2.5 3.0 Model 2 Model 3 1.0 1.5 2.0 2.5 Model 1 Model 2 40 50 60 70 80 Valid Accuracy (%) Model 1 Model 3 40 50 60 70 80 Model 2 Model 3 60 70 80 (a) 8 attention heads Model 1 Model 2 0.70 0.75 0.80 Valid Loss Naive Match Model 1 Mod… view at source ↗

**Figure 12.** Figure 12: Linear Mode Connectivity for ViT on ImageNet with 12 layers. Model 1 Model 2 0.30 0.32 0.34 0.36 0.38 Test Loss Naive Match Model 1 Model 3 0.30 0.35 0.40 0.45 Model 2 Model 3 0.300 0.325 0.350 0.375 0.400 0.425 Model 1 Model 2 0.87 0.88 0.89 0.90 Test Accuracy Model 1 Model 3 0.85 0.86 0.87 0.88 0.89 0.90 Model 2 Model 3 0.84 0.86 0.88 0.90 (a) 4 attention heads Model 1 Model 2 0.30 0.35 0.40 Test Loss N… view at source ↗

**Figure 13.** Figure 13: Linear Mode Connectivity for BERT on AGnews with 2 layers 52 [PITH_FULL_IMAGE:figures/full_fig_p052_13.png] view at source ↗

**Figure 14.** Figure 14: Linear Mode Connectivity for BERT on AGnews with 6 layers Model 1 Model 2 0.4 0.5 0.6 Test Loss Naive Match Model 1 Model 3 0.35 0.40 0.45 0.50 Model 2 Model 3 0.34 0.36 0.38 0.40 Model 1 Model 2 0.70 0.75 0.80 0.85 Test Accuracy Model 1 Model 3 0.750 0.775 0.800 0.825 0.850 Model 2 Model 3 0.82 0.84 0.86 (a) 4 attention heads Model 1 Model 2 0.325 0.350 0.375 0.400 0.425 Test Loss Naive Match Model 1 Mod… view at source ↗

**Figure 15.** Figure 15: Linear Mode Connectivity for BERT on IMDBreview with 2 layers Model 1 Model 2 0.34 0.36 0.38 0.40 0.42 Test Loss Naive Match Model 1 Model 3 0.34 0.35 0.36 0.37 Model 2 Model 3 0.35 0.40 0.45 0.50 0.55 Model 1 Model 2 0.840 0.845 0.850 0.855 0.860 Test Accuracy Model 1 Model 3 0.845 0.850 0.855 0.860 Model 2 Model 3 0.80 0.82 0.84 0.86 (a) 4 attention heads Model 1 Model 2 0.350 0.375 0.400 0.425 0.450 Te… view at source ↗

**Figure 16.** Figure 16: Linear Mode Connectivity for BERT on IMDBreview with 6 layers Model 1 Model 2 0.5 1.0 1.5 2.0 2.5 Test Loss Naive Match Model 1 Model 3 0.5 1.0 1.5 Model 2 Model 3 0.5 1.0 1.5 Model 1 Model 2 0.4 0.6 0.8 Test Accuracy Model 1 Model 3 0.5 0.6 0.7 0.8 0.9 Model 2 Model 3 0.6 0.7 0.8 0.9 (a) 4 attention heads Model 1 Model 2 0.5 1.0 1.5 Test Loss Naive Match Model 1 Model 3 0.5 1.0 1.5 2.0 2.5 Model 2 Model … view at source ↗

**Figure 17.** Figure 17: Linear Mode Connectivity for BERT on DBPedia with 2 layers 53 [PITH_FULL_IMAGE:figures/full_fig_p053_17.png] view at source ↗

**Figure 18.** Figure 18: Linear Mode Connectivity for BERT on DBPedia with 6 layers Model 1 Model 2 1.00 1.01 1.02 1.03 1.04 Test Loss Naive Match Model 1 Model 3 1.00 1.01 1.02 1.03 Model 2 Model 3 1.00 1.01 1.02 1.03 Model 1 Model 2 1.44 1.46 1.48 1.50 Test BPC Model 1 Model 3 1.44 1.46 1.48 Model 2 Model 3 1.44 1.46 1.48 (a) 4 attention heads Model 1 Model 2 0.975 1.000 1.025 1.050 1.075 Test Loss Naive Match Model 1 Model 3 0… view at source ↗

**Figure 19.** Figure 19: Linear Mode Connectivity for GPT2 on Enwik8 with 12 layers. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_19.png] view at source ↗

**Figure 20.** Figure 20: Linear Mode Connectivity for GPT2 on Wikitext103 with 12 layers. Model 1 Model 2 3.398 3.399 3.400 3.401 3.402 Test Loss Naive Match Model 1 Model 3 3.398 3.399 3.400 3.401 3.402 Model 2 Model 3 3.3975 3.3980 3.3985 3.3990 Model 1 Model 2 29.90 29.95 30.00 Test PPL Model 1 Model 3 29.900 29.925 29.950 29.975 30.000 30.025 Model 2 Model 3 29.88 29.90 29.92 29.94 (a) 8 attention heads Model 1 Model 2 3.400 … view at source ↗

**Figure 21.** Figure 21: Linear Mode Connectivity for GPT2 on One Billion Words with 12 layers. 55 [PITH_FULL_IMAGE:figures/full_fig_p055_21.png] view at source ↗

**Figure 22.** Figure 22: Linear Mode Connectivity for ViT-RoPE on MNIST with 1 layer Model 1 Model 2 0.5 1.0 1.5 2.0 Test Loss Naive Match Model 1 Model 3 1 2 3 Model 2 Model 3 1 2 3 Model 1 Model 2 0.6 0.7 0.8 0.9 Test Accuracy Model 1 Model 3 0.5 0.6 0.7 0.8 0.9 Model 2 Model 3 0.5 0.6 0.7 0.8 0.9 (a) 4 attention heads Model 1 Model 2 0 2 4 6 8 Test Loss Naive Match Model 1 Model 3 0 1 2 3 4 Model 2 Model 3 0 1 2 3 4 Model 1 Mo… view at source ↗

**Figure 23.** Figure 23: Linear Mode Connectivity for ViT-RoPE on MNIST with 2 layers Model 1 Model 2 1.0 1.2 1.4 1.6 1.8 Test Loss Naive Match Model 1 Model 3 1.0 1.2 1.4 1.6 1.8 Model 2 Model 3 1.0 1.2 1.4 1.6 1.8 Model 1 Model 2 0.45 0.50 0.55 0.60 0.65 Test Accuracy Model 1 Model 3 0.45 0.50 0.55 0.60 0.65 Model 2 Model 3 0.45 0.50 0.55 0.60 0.65 (a) 4 attention heads Model 1 Model 2 1.0 1.5 2.0 2.5 Test Loss Naive Match Mode… view at source ↗

**Figure 24.** Figure 24: Linear Mode Connectivity for ViT-RoPE on CIFAR-10 with 2 layers Model 1 Model 2 1.0 1.2 1.4 1.6 1.8 2.0 Test Loss Naive Match Model 1 Model 3 1.0 1.5 2.0 2.5 Model 2 Model 3 1.0 1.5 2.0 2.5 3.0 3.5 Model 1 Model 2 0.50 0.55 0.60 0.65 0.70 Test Accuracy Model 1 Model 3 0.4 0.5 0.6 0.7 Model 2 Model 3 0.4 0.5 0.6 0.7 (a) 4 attention heads Model 1 Model 2 1.0 1.5 2.0 2.5 3.0 Test Loss Naive Match Model 1 Mod… view at source ↗

**Figure 25.** Figure 25: Linear Mode Connectivity for ViT-RoPE on CIFAR-10 with 4 layers 56 [PITH_FULL_IMAGE:figures/full_fig_p056_25.png] view at source ↗

**Figure 26.** Figure 26: Linear Mode Connectivity for ViT-RoPE on CIFAR-10 with 6 layers Model 1 Model 2 3 4 5 6 7 Test Loss Naive Match Model 1 Model 3 3 4 5 6 7 Model 2 Model 3 3 4 5 Model 1 Model 2 0.1 0.2 0.3 0.4 Test Accuracy Model 1 Model 3 0.1 0.2 0.3 0.4 Model 2 Model 3 0.15 0.20 0.25 0.30 0.35 0.40 (a) 4 attention heads Model 1 Model 2 3 4 5 6 7 Test Loss Naive Match Model 1 Model 3 3 4 5 6 7 Model 2 Model 3 3 4 5 6 7 Mo… view at source ↗

**Figure 27.** Figure 27: Linear Mode Connectivity for ViT-RoPE on CIFAR-100 with 6 layers Model 1 Model 2 0.2 0.4 0.6 0.8 Test Loss Naive Match Model 1 Model 3 0.2 0.3 0.4 0.5 Model 2 Model 3 0.2 0.4 0.6 0.8 Model 1 Model 2 0.75 0.80 0.85 0.90 0.95 Test Accuracy Model 1 Model 3 0.85 0.90 0.95 Model 2 Model 3 0.70 0.75 0.80 0.85 0.90 0.95 (a) CIFAR-10 Model 1 Model 2 1.25 1.50 1.75 2.00 2.25 Test Loss Naive Match Model 1 Model 3 1… view at source ↗

**Figure 28.** Figure 28: Linear Mode Connectivity for ViT-RoPE on ImageNet21k!CIFAR-10/100 with 12 layers and 6 heads 57 [PITH_FULL_IMAGE:figures/full_fig_p057_28.png] view at source ↗

**Figure 29.** Figure 29: Linear Mode Connectivity for ViT-RoPE on ImageNet with 12 layers. Model 1 Model 2 0.35 0.40 0.45 0.50 0.55 Test Loss Naive Match Model 1 Model 3 0.4 0.6 0.8 1.0 Model 2 Model 3 0.4 0.6 0.8 1.0 1.2 Model 1 Model 2 0.80 0.82 0.84 0.86 0.88 0.90 Test Accuracy Model 1 Model 3 0.7 0.8 0.9 Model 2 Model 3 0.5 0.6 0.7 0.8 0.9 (a) 4 attention heads Model 1 Model 2 0.4 0.5 0.6 0.7 Test Loss Naive Match Model 1 Mod… view at source ↗

**Figure 30.** Figure 30: Linear Mode Connectivity for BERT-RoPE on AGnews with 2 layers Model 1 Model 2 0.32 0.34 0.36 0.38 0.40 Test Loss Naive Match Model 1 Model 3 0.30 0.35 0.40 0.45 Model 2 Model 3 0.3 0.4 0.5 0.6 Model 1 Model 2 0.86 0.87 0.88 0.89 Test Accuracy Model 1 Model 3 0.82 0.84 0.86 0.88 0.90 Model 2 Model 3 0.80 0.85 0.90 (a) 4 attention heads Model 1 Model 2 0.32 0.34 0.36 0.38 0.40 Test Loss Naive Match Model 1… view at source ↗

**Figure 31.** Figure 31: Linear Mode Connectivity for BERT-RoPE on AGnews with 6 layers 58 [PITH_FULL_IMAGE:figures/full_fig_p058_31.png] view at source ↗

**Figure 32.** Figure 32: Linear Mode Connectivity for BERT-RoPE on IMDBreview with 2 layers Model 1 Model 2 0.350 0.375 0.400 0.425 0.450 Test Loss Naive Match Model 1 Model 3 0.4 0.5 0.6 0.7 Model 2 Model 3 0.4 0.5 0.6 Model 1 Model 2 0.78 0.80 0.82 0.84 0.86 Test Accuracy Model 1 Model 3 0.65 0.70 0.75 0.80 0.85 Model 2 Model 3 0.70 0.75 0.80 0.85 (a) 4 attention heads Model 1 Model 2 0.35 0.40 0.45 0.50 0.55 Test Loss Naive Ma… view at source ↗

**Figure 33.** Figure 33: Linear Mode Connectivity for BERT-RoPE on IMDBreview with 6 layers Model 1 Model 2 1 2 3 4 5 Test Loss Naive Match Model 1 Model 3 0.5 1.0 1.5 2.0 2.5 Model 2 Model 3 1 2 3 4 5 Model 1 Model 2 0.2 0.4 0.6 0.8 Test Accuracy Model 1 Model 3 0.4 0.6 0.8 Model 2 Model 3 0.2 0.4 0.6 0.8 (a) 4 attention heads Model 1 Model 2 2 4 6 Test Loss Naive Match Model 1 Model 3 1 2 3 4 5 Model 2 Model 3 2 4 6 Model 1 Mod… view at source ↗

**Figure 34.** Figure 34: Linear Mode Connectivity for BERT-RoPE on DBPedia with 2 layers Model 1 Model 2 1 2 3 4 Test Loss Naive Match Model 1 Model 3 1 2 3 4 5 Model 2 Model 3 2 4 6 Model 1 Model 2 0.2 0.4 0.6 0.8 Test Accuracy Model 1 Model 3 0.2 0.4 0.6 0.8 Model 2 Model 3 0.2 0.4 0.6 0.8 (a) 4 attention heads Model 1 Model 2 1 2 3 4 5 Test Loss Naive Match Model 1 Model 3 2 4 6 Model 2 Model 3 1 2 3 4 5 Model 1 Model 2 0.2 0.… view at source ↗

**Figure 35.** Figure 35: Linear Mode Connectivity for BERT-RoPE on DBPedia with 6 layers 59 [PITH_FULL_IMAGE:figures/full_fig_p059_35.png] view at source ↗

**Figure 36.** Figure 36: Linear Mode Connectivity for GPT2-RoPE on Enwik8 with 12 layers. Model 1 Model 2 0.98 1.00 1.02 1.04 1.06 Test Loss Naive Match Model 1 Model 3 0.98 1.00 1.02 1.04 1.06 Model 2 Model 3 0.98 1.00 1.02 1.04 1.06 Model 1 Model 2 2.70 2.75 2.80 2.85 Test PPL Model 1 Model 3 2.70 2.75 2.80 2.85 2.90 Model 2 Model 3 2.70 2.75 2.80 2.85 2.90 (a) 4 attention heads Model 1 Model 2 0.98 1.00 1.02 1.04 Test Loss Nai… view at source ↗

**Figure 37.** Figure 37: Linear Mode Connectivity for Llama on Enwik8 with 12 layers. 60 [PITH_FULL_IMAGE:figures/full_fig_p060_37.png] view at source ↗

**Figure 38.** Figure 38: Linear Mode Connectivity for GPT2-RoPE on Wikitext103 with 12 layers. Model 1 Model 2 3.64 3.65 3.66 3.67 3.68 Test Loss Naive Match Model 1 Model 3 3.64 3.65 3.66 3.67 Model 2 Model 3 3.64 3.65 3.66 3.67 3.68 Model 1 Model 2 38.0 38.5 39.0 39.5 Test PPL Model 1 Model 3 38.0 38.5 39.0 Model 2 Model 3 38.0 38.5 39.0 39.5 (a) 2 attention heads Model 1 Model 2 3.64 3.65 3.66 3.67 3.68 Test Loss Naive Match M… view at source ↗

**Figure 39.** Figure 39: Linear Mode Connectivity for LLama on Wikitext103 with 12 layers. 61 [PITH_FULL_IMAGE:figures/full_fig_p061_39.png] view at source ↗

**Figure 40.** Figure 40: Linear Mode Connectivity for GPT2-RoPE on OneBillionWord with 12 layers. J.2. Linear Mode Connectivity for Attention at All Layers [PITH_FULL_IMAGE:figures/full_fig_p062_40.png] view at source ↗

**Figure 41.** Figure 41: Linear Mode Connectivity for ViT on MNIST with 2 layers 62 [PITH_FULL_IMAGE:figures/full_fig_p062_41.png] view at source ↗

**Figure 42.** Figure 42: Linear Mode Connectivity for ViT on CIFAR-10 with 2 layers Model 1 Model 2 0.75 1.00 1.25 1.50 1.75 Test Loss Naive Match Model 1 Model 3 0.8 1.0 1.2 1.4 1.6 Model 2 Model 3 0.75 1.00 1.25 1.50 1.75 Model 1 Model 2 0.4 0.5 0.6 0.7 0.8 Test Accuracy Model 1 Model 3 0.4 0.5 0.6 0.7 Model 2 Model 3 0.4 0.5 0.6 0.7 0.8 (a) 4 attention heads Model 1 Model 2 0.75 1.00 1.25 1.50 1.75 2.00 Test Loss Naive Match M… view at source ↗

**Figure 43.** Figure 43: Linear Mode Connectivity for ViT on CIFAR-10 with 4 layers Model 1 Model 2 0.8 1.0 1.2 Test Loss Naive Match Model 1 Model 3 0.8 1.0 1.2 1.4 Model 2 Model 3 0.8 1.0 1.2 Model 1 Model 2 0.55 0.60 0.65 0.70 0.75 Test Accuracy Model 1 Model 3 0.5 0.6 0.7 Model 2 Model 3 0.55 0.60 0.65 0.70 0.75 (a) 4 attention heads Model 1 Model 2 0.8 1.0 1.2 1.4 Test Loss Naive Match Model 1 Model 3 0.8 1.0 1.2 Model 2 Mod… view at source ↗

**Figure 44.** Figure 44: Linear Mode Connectivity for ViT on CIFAR-10 with 6 layers Model 1 Model 2 2.6 2.8 3.0 3.2 3.4 Test Loss Naive Match Model 1 Model 3 2.6 2.8 3.0 3.2 3.4 Model 2 Model 3 2.6 2.8 3.0 3.2 3.4 Model 1 Model 2 0.2 0.3 0.4 Test Accuracy Model 1 Model 3 0.25 0.30 0.35 0.40 0.45 Model 2 Model 3 0.2 0.3 0.4 (a) 4 attention heads Model 1 Model 2 2.25 2.50 2.75 3.00 3.25 Test Loss Naive Match Model 1 Model 3 2.25 2.… view at source ↗

**Figure 45.** Figure 45: Linear Mode Connectivity for ViT on CIFAR-100 with 6 layers 63 [PITH_FULL_IMAGE:figures/full_fig_p063_45.png] view at source ↗

**Figure 46.** Figure 46: Linear Mode Connectivity for ViT on ImageNet21k!CIFAR-10/100 with 12 layers and 6 heads Model 1 Model 2 2 4 6 8 Validation Loss Naive WM Model 1 Model 3 2 4 6 8 Model 2 Model 3 2 4 6 8 Model 1 Model 2 0 20 40 60 80 Validation Accuracy (%) Model 1 Model 3 0 20 40 60 80 Model 2 Model 3 0 20 40 60 80 (a) APE Model 1 Model 2 2 4 6 8 Validation Loss Naive WM Model 1 Model 3 2 4 6 Model 2 Model 3 2 4 6 Model 1 … view at source ↗

**Figure 47.** Figure 47: Linear Mode Connectivity for ViT with APE and RoPE on ImageNet-1k with 12 layers Model 1 Model 2 0.30 0.35 0.40 0.45 Test Loss Naive Match Model 1 Model 3 0.30 0.35 0.40 0.45 Model 2 Model 3 0.30 0.35 0.40 0.45 0.50 0.55 Model 1 Model 2 0.86 0.88 0.90 Test Accuracy Model 1 Model 3 0.88 0.89 0.90 Model 2 Model 3 0.82 0.84 0.86 0.88 0.90 (a) 4 attention heads Model 1 Model 2 0.30 0.32 0.34 0.36 0.38 0.40 Te… view at source ↗

**Figure 48.** Figure 48: Linear Mode Connectivity for BERT on AGnews with 2 layers Model 1 Model 2 0.3 0.4 0.5 0.6 0.7 Test Loss Naive Match Model 1 Model 3 0.3 0.4 0.5 0.6 0.7 0.8 Model 2 Model 3 0.3 0.4 0.5 0.6 Model 1 Model 2 0.82 0.84 0.86 0.88 0.90 Test Accuracy Model 1 Model 3 0.800 0.825 0.850 0.875 0.900 Model 2 Model 3 0.80 0.82 0.84 0.86 0.88 0.90 (a) 4 attention heads Model 1 Model 2 0.30 0.35 0.40 0.45 Test Loss Naive… view at source ↗

**Figure 49.** Figure 49: Linear Mode Connectivity for BERT on AGnews with 6 layers 64 [PITH_FULL_IMAGE:figures/full_fig_p064_49.png] view at source ↗

**Figure 50.** Figure 50: Linear Mode Connectivity for BERT on IMDBreview with 2 layers Model 1 Model 2 0.35 0.40 0.45 0.50 0.55 Test Loss Naive Match Model 1 Model 3 0.35 0.40 0.45 0.50 0.55 Model 2 Model 3 0.35 0.40 0.45 0.50 Model 1 Model 2 0.84 0.85 0.86 Test Accuracy Model 1 Model 3 0.75 0.80 0.85 Model 2 Model 3 0.83 0.84 0.85 0.86 (a) 4 attention heads Model 1 Model 2 0.35 0.40 0.45 Test Loss Naive Match Model 1 Model 3 0.3… view at source ↗

**Figure 51.** Figure 51: Linear Mode Connectivity for BERT on IMDBreview with 6 layers Model 1 Model 2 0.5 1.0 1.5 2.0 Test Loss Naive Match Model 1 Model 3 0.50 0.75 1.00 1.25 1.50 Model 2 Model 3 0.5 1.0 1.5 2.0 Model 1 Model 2 0.4 0.5 0.6 0.7 0.8 0.9 Test Accuracy Model 1 Model 3 0.6 0.7 0.8 0.9 Model 2 Model 3 0.5 0.6 0.7 0.8 0.9 (a) 4 attention heads Model 1 Model 2 1 2 3 Test Loss Naive Match Model 1 Model 3 0.5 1.0 1.5 2.0… view at source ↗

**Figure 52.** Figure 52: Linear Mode Connectivity for BERT on DBPedia with 2 layers Model 1 Model 2 0.5 1.0 1.5 2.0 Test Loss Naive Match Model 1 Model 3 0.5 1.0 1.5 2.0 2.5 Model 2 Model 3 0.5 1.0 1.5 2.0 2.5 Model 1 Model 2 0.4 0.5 0.6 0.7 0.8 0.9 Test Accuracy Model 1 Model 3 0.4 0.5 0.6 0.7 0.8 0.9 Model 2 Model 3 0.4 0.6 0.8 (a) 4 attention heads Model 1 Model 2 0.5 1.0 1.5 2.0 2.5 Test Loss Naive Match Model 1 Model 3 0.5 1… view at source ↗

**Figure 53.** Figure 53: Linear Mode Connectivity for BERT on DBPedia with 6 layers 65 [PITH_FULL_IMAGE:figures/full_fig_p065_53.png] view at source ↗

**Figure 54.** Figure 54: Linear Mode Connectivity for GPT2 with APE and RoPE on Enwik8 with 12 layers and 8 heads Model 1 Model 2 3.75 4.00 4.25 4.50 4.75 Test Loss Naive Match Model 1 Model 3 3.75 4.00 4.25 4.50 4.75 5.00 Model 2 Model 3 3.75 4.00 4.25 4.50 4.75 Model 1 Model 2 40 60 80 100 120 140 Test PPL Model 1 Model 3 50 75 100 125 150 Model 2 Model 3 40 60 80 100 120 (a) APE Model 1 Model 2 3.75 4.00 4.25 4.50 4.75 Test Lo… view at source ↗

**Figure 55.** Figure 55: Linear Mode Connectivity for GPT2 with APE and RoPE on Wikitext103 with 12 layers and 3 heads Model 1 Model 2 1 2 3 4 Test Loss Naive Match Model 1 Model 3 1 2 3 Model 2 Model 3 1 2 3 Model 1 Model 2 0.4 0.6 0.8 Test Accuracy Model 1 Model 3 0.2 0.4 0.6 0.8 Model 2 Model 3 0.2 0.4 0.6 0.8 (a) 4 attention heads Model 1 Model 2 0 1 2 3 4 5 Test Loss Naive Match Model 1 Model 3 1 2 3 Model 2 Model 3 1 2 3 4 … view at source ↗

**Figure 56.** Figure 56: Linear Mode Connectivity for ViT-RoPE on MNIST with 2 layers Model 1 Model 2 1.0 1.2 1.4 1.6 1.8 2.0 Test Loss Naive Match Model 1 Model 3 1.0 1.2 1.4 1.6 1.8 Model 2 Model 3 1.2 1.4 1.6 Model 1 Model 2 0.4 0.5 0.6 Test Accuracy Model 1 Model 3 0.4 0.5 0.6 Model 2 Model 3 0.40 0.45 0.50 0.55 0.60 0.65 (a) 4 attention heads Model 1 Model 2 1.0 1.5 2.0 2.5 Test Loss Naive Match Model 1 Model 3 1.0 1.5 2.0 2… view at source ↗

**Figure 57.** Figure 57: Linear Mode Connectivity for ViT-RoPE on CIFAR-10 with 2 layers 66 [PITH_FULL_IMAGE:figures/full_fig_p066_57.png] view at source ↗

**Figure 58.** Figure 58: Linear Mode Connectivity for ViT-RoPE on CIFAR-10 with 4 layers Model 1 Model 2 1.0 1.5 2.0 2.5 3.0 3.5 Test Loss Naive Match Model 1 Model 3 1.0 1.5 2.0 2.5 3.0 Model 2 Model 3 1.0 1.5 2.0 2.5 3.0 3.5 Model 1 Model 2 0.2 0.3 0.4 0.5 0.6 0.7 Test Accuracy Model 1 Model 3 0.2 0.3 0.4 0.5 0.6 0.7 Model 2 Model 3 0.2 0.3 0.4 0.5 0.6 0.7 (a) 4 attention heads Model 1 Model 2 1.0 1.5 2.0 2.5 3.0 Test Loss Naiv… view at source ↗

**Figure 59.** Figure 59: Linear Mode Connectivity for ViT-RoPE on CIFAR-10 with 6 layers Model 1 Model 2 4 6 8 Test Loss Naive Match Model 1 Model 3 3 4 5 6 7 Model 2 Model 3 4 6 8 10 Model 1 Model 2 0.1 0.2 0.3 0.4 Test Accuracy Model 1 Model 3 0.1 0.2 0.3 0.4 Model 2 Model 3 0.1 0.2 0.3 0.4 (a) 4 attention heads Model 1 Model 2 3 4 5 6 Test Loss Naive Match Model 1 Model 3 3 4 5 6 Model 2 Model 3 3 4 5 6 Model 1 Model 2 0.1 0.2… view at source ↗

**Figure 60.** Figure 60: Linear Mode Connectivity for ViT-RoPE on CIFAR-100 with 6 layers Model 1 Model 2 1.0 1.5 2.0 Test Loss Naive Match Model 1 Model 3 1.0 1.5 2.0 Model 2 Model 3 1.0 1.5 2.0 Model 1 Model 2 0.2 0.4 0.6 0.8 Test Accuracy Model 1 Model 3 0.3 0.4 0.5 0.6 0.7 0.8 Model 2 Model 3 0.3 0.4 0.5 0.6 0.7 0.8 (a) CIFAR-10 Model 1 Model 2 1 2 3 4 Test Loss Naive Match Model 1 Model 3 1 2 3 4 Model 2 Model 3 1 2 3 4 Mode… view at source ↗

**Figure 61.** Figure 61: Linear Mode Connectivity for ViT-RoPE on ImageNet21k!CIFAR-10/100 with 12 layers and 6 heads 67 [PITH_FULL_IMAGE:figures/full_fig_p067_61.png] view at source ↗

**Figure 62.** Figure 62: Linear Mode Connectivity for BERT-RoPE on AGnews with 2 layers Model 1 Model 2 0.3 0.4 0.5 0.6 Test Loss Naive Match Model 1 Model 3 0.4 0.5 0.6 Model 2 Model 3 0.3 0.4 0.5 0.6 Model 1 Model 2 0.75 0.80 0.85 0.90 Test Accuracy Model 1 Model 3 0.84 0.86 0.88 0.90 Model 2 Model 3 0.82 0.84 0.86 0.88 0.90 (a) 4 attention heads Model 1 Model 2 0.3 0.4 0.5 0.6 0.7 Test Loss Naive Match Model 1 Model 3 0.3 0.4 … view at source ↗

**Figure 63.** Figure 63: Linear Mode Connectivity for BERT-RoPE on AGnews with 6 layers Model 1 Model 2 0.35 0.40 0.45 0.50 0.55 0.60 Test Loss Naive Match Model 1 Model 3 0.35 0.40 0.45 Model 2 Model 3 0.325 0.350 0.375 0.400 0.425 0.450 Model 1 Model 2 0.65 0.70 0.75 0.80 0.85 Test Accuracy Model 1 Model 3 0.78 0.80 0.82 0.84 0.86 Model 2 Model 3 0.80 0.82 0.84 0.86 (a) 4 attention heads Model 1 Model 2 0.35 0.40 0.45 0.50 0.55… view at source ↗

**Figure 64.** Figure 64: Linear Mode Connectivity for BERT-RoPE on IMDBreview with 2 layers Model 1 Model 2 0.35 0.40 0.45 0.50 Test Loss Naive Match Model 1 Model 3 0.35 0.40 0.45 0.50 Model 2 Model 3 0.35 0.40 0.45 0.50 Model 1 Model 2 0.75 0.80 0.85 Test Accuracy Model 1 Model 3 0.750 0.775 0.800 0.825 0.850 Model 2 Model 3 0.725 0.750 0.775 0.800 0.825 0.850 (a) 4 attention heads Model 1 Model 2 0.35 0.40 0.45 0.50 Test Loss … view at source ↗

**Figure 65.** Figure 65: Linear Mode Connectivity for BERT-RoPE on IMDBreview with 6 layers 68 [PITH_FULL_IMAGE:figures/full_fig_p068_65.png] view at source ↗

**Figure 66.** Figure 66: Linear Mode Connectivity for BERT-RoPE on DBPedia with 2 layers Model 1 Model 2 1 2 3 4 Test Loss Naive Match Model 1 Model 3 1 2 3 4 Model 2 Model 3 1 2 3 4 Model 1 Model 2 0.2 0.4 0.6 0.8 Test Accuracy Model 1 Model 3 0.2 0.4 0.6 0.8 Model 2 Model 3 0.2 0.4 0.6 0.8 (a) 4 attention heads Model 1 Model 2 1 2 3 4 5 Test Loss Naive Match Model 1 Model 3 1 2 3 4 5 Model 2 Model 3 1 2 3 4 Model 1 Model 2 0.2 … view at source ↗

**Figure 67.** Figure 67: Linear Mode Connectivity for BERT-RoPE on DBPedia with 6 layers Model 1 Model 2 3.50 3.75 4.00 4.25 4.50 4.75 Test Loss Naive Match Model 1 Model 3 3.50 3.75 4.00 4.25 4.50 Model 2 Model 3 3.50 3.75 4.00 4.25 4.50 Model 1 Model 2 40 60 80 100 Test PPL Model 1 Model 3 40 60 80 100 Model 2 Model 3 40 60 80 100 (a) APE Model 1 Model 2 3.5 4.0 4.5 Test Loss Naive Match Model 1 Model 3 3.5 4.0 4.5 5.0 Model 2 … view at source ↗

**Figure 68.** Figure 68: Linear Mode Connectivity for GPT2 on OneBillionWord with 12 layers Model 1 Model 2 4.0 4.5 5.0 5.5 Test Loss Naive Match Model 1 Model 3 4.0 4.5 5.0 5.5 Model 2 Model 3 4.0 4.5 5.0 5.5 Model 1 Model 2 100 200 300 Test PPL Model 1 Model 3 100 200 300 Model 2 Model 3 100 200 300 [PITH_FULL_IMAGE:figures/full_fig_p069_68.png] view at source ↗

**Figure 69.** Figure 69: Linear Mode Connectivity for Llama on Wikitext103 with 12 layers 69 [PITH_FULL_IMAGE:figures/full_fig_p069_69.png] view at source ↗

**Figure 70.** Figure 70: Linear Mode Connectivity for ViT with APE and RoPE on CIFAR-10 with 6 layers and 8 heads Model 1 Model 2 2.5 3.0 3.5 4.0 4.5 Test Loss Naive Match Model 1 Model 3 3 4 5 Model 2 Model 3 3 4 5 Model 1 Model 2 0.3 0.4 0.5 Test Accuracy Model 1 Model 3 0.2 0.3 0.4 0.5 Model 2 Model 3 0.2 0.3 0.4 0.5 (a) APE Model 1 Model 2 3 4 5 6 7 8 Test Loss Naive Match Model 1 Model 3 4 6 8 Model 2 Model 3 3 4 5 6 7 8 Mod… view at source ↗

**Figure 71.** Figure 71: Linear Mode Connectivity for ViT with APE and RoPE on CIFAR-100 with 6 layers and 8 heads Model 1 Model 2 1 2 3 4 5 6 7 8 Valid Loss Naive Match Model 1 Model 2 0 20 40 60 80 Valid Accuracy Naive Match (a) APE Model 1 Model 2 1.0 1.5 2.0 2.5 3.0 3.5 Valid Loss Naive Match Model 1 Model 2 30 40 50 60 70 80 Valid Accuracy Naive Match (b) RoPE [PITH_FULL_IMAGE:figures/full_fig_p070_71.png] view at source ↗

**Figure 72.** Figure 72: Linear Mode Connectivity for ViT with APE and RoPE on ImageNet-1k with 12 layers 70 [PITH_FULL_IMAGE:figures/full_fig_p070_72.png] view at source ↗

**Figure 73.** Figure 73: Linear Mode Connectivity for BERT with APE and RoPE on AGNews with 6 layers and 8 heads Model 1 Model 2 0.25 0.50 0.75 1.00 1.25 1.50 Test Loss Naive Match Model 1 Model 3 0.5 1.0 1.5 2.0 2.5 Model 2 Model 3 0.5 1.0 1.5 2.0 2.5 Model 1 Model 2 0.6 0.7 0.8 0.9 Test Accuracy Model 1 Model 3 0.4 0.5 0.6 0.7 0.8 0.9 Model 2 Model 3 0.4 0.5 0.6 0.7 0.8 0.9 (a) APE Model 1 Model 2 1 2 3 4 5 Test Loss Naive Matc… view at source ↗

**Figure 74.** Figure 74: Linear Mode Connectivity for BERT with APE and RoPE on DBPedia with 6 layers and 8 heads Model 1 Model 2 1 2 3 4 5 Test Loss Naive Match Model 1 Model 3 1.0 1.5 2.0 2.5 3.0 Model 2 Model 3 1.0 1.2 1.4 1.6 1.8 Model 1 Model 2 2 4 6 Test BPC Model 1 Model 3 2 3 4 Model 2 Model 3 1.50 1.75 2.00 2.25 2.50 2.75 (a) APE Model 1 Model 2 1.0 1.5 2.0 2.5 3.0 Test Loss Naive Match Model 1 Model 3 1 2 3 4 Model 2 Mo… view at source ↗

**Figure 75.** Figure 75: Linear Mode Connectivity for GPT2 with APE and RoPE on Enwik8 with 12 layers Model 1 Model 2 3.8 4.0 4.2 4.4 4.6 Test Loss Naive Match Model 1 Model 3 3.8 4.0 4.2 4.4 Model 2 Model 3 3.8 4.0 4.2 4.4 Model 1 Model 2 40 60 80 100 Test PPL Model 1 Model 3 40 50 60 70 80 90 Model 2 Model 3 50 60 70 80 (a) APE Model 1 Model 2 3.7 3.8 3.9 4.0 4.1 4.2 Test Loss Naive Match Model 1 Model 3 3.7 3.8 3.9 4.0 4.1 Mod… view at source ↗

**Figure 76.** Figure 76: Linear Mode Connectivity for GPT2 with APE and RoPE on Wikitext103 with 12 layers 71 [PITH_FULL_IMAGE:figures/full_fig_p071_76.png] view at source ↗

**Figure 77.** Figure 77: Linear Mode Connectivity for GPT2 with APE and RoPE on OneBillionWord with 12 layers J.4. Linear Mode Connectivity for Full Model Model 1 Model 2 4 5 6 7 8 Validation Loss WikiText103 APE Naive Match Model 1 Model 2 4 5 6 7 WikiText103 RoPE Model 1 Model 2 2 4 6 8 ImageNet APE Model 1 Model 2 2 4 6 ImageNet RoPE [PITH_FULL_IMAGE:figures/full_fig_p072_77.png] view at source ↗

**Figure 78.** Figure 78: LMC interpolation plots for ViT on ImageNet-1K (subplots 3 and 4) and GPT-2 on WikiText103 (subplots 1 and 2), with APE and RoPE under full Transformer re-initialization. Model 1 Model 2 2 4 6 8 10 Test Loss Naive Match Model 1 Model 3 2 4 6 Model 2 Model 3 2.5 5.0 7.5 10.0 12.5 Model 1 Model 2 0.2 0.3 0.4 0.5 0.6 Test Accuracy Model 1 Model 3 0.2 0.3 0.4 0.5 0.6 Model 2 Model 3 0.2 0.3 0.4 0.5 0.6 (a) AP… view at source ↗

**Figure 79.** Figure 79: Linear Mode Connectivity for ViT with APE and RoPE on CIFAR-10 with 6 layers and 8 heads 72 [PITH_FULL_IMAGE:figures/full_fig_p072_79.png] view at source ↗

**Figure 80.** Figure 80: Linear Mode Connectivity for ViT with APE and RoPE on CIFAR-100 with 6 layers and 8 heads Model 1 Model 2 2 4 6 8 10 Test Loss Naive Match Model 1 Model 3 0 10 20 30 40 Model 2 Model 3 0 10 20 30 Model 1 Model 2 0.2 0.4 0.6 0.8 Test Accuracy Model 1 Model 3 0.2 0.4 0.6 0.8 Model 2 Model 3 0.2 0.4 0.6 0.8 (a) APE Model 1 Model 2 1.0 1.5 2.0 2.5 Test Loss Naive Match Model 1 Model 3 1.0 1.5 2.0 2.5 3.0 Mode… view at source ↗

**Figure 81.** Figure 81: Linear Mode Connectivity for BERT with APE and RoPE on AGNews with 6 layers and 8 heads Model 1 Model 2 1.0 1.5 2.0 2.5 3.0 Test Loss Naive Match Model 1 Model 3 1 2 3 Model 2 Model 3 1.0 1.5 2.0 2.5 3.0 Model 1 Model 2 0.4 0.5 0.6 0.7 0.8 Test Accuracy Model 1 Model 3 0.4 0.5 0.6 0.7 0.8 Model 2 Model 3 0.4 0.5 0.6 0.7 0.8 (a) APE Model 1 Model 2 0.75 1.00 1.25 1.50 1.75 2.00 Test Loss Naive Match Model … view at source ↗

**Figure 82.** Figure 82: Linear Mode Connectivity for BERT with APE and RoPE on DBPedia with 6 layers and 8 heads Model 1 Model 2 2 4 6 8 Validation Loss Naive WM Model 1 Model 3 2 4 6 8 Model 2 Model 3 2 4 6 Model 1 Model 2 0 20 40 60 80 Validation Accuracy (%) Model 1 Model 3 0 20 40 60 80 Model 2 Model 3 0 20 40 60 80 (a) APE Model 1 Model 2 2 4 6 Validation Loss Naive WM Model 1 Model 3 1 2 3 4 5 Model 2 Model 3 2 4 6 8 Model… view at source ↗

**Figure 83.** Figure 83: Linear Mode Connectivity for ViT with APE and RoPE on ImageNet-1k with 12 layers 73 [PITH_FULL_IMAGE:figures/full_fig_p073_83.png] view at source ↗

**Figure 84.** Figure 84: Linear Mode Connectivity for GPT2 with APE and RoPE on Wikitext103 with 12 layers J.5. Ablation study on Head Permutation We plot 24 head permutations, including the one selected by Stage 1 our method, with Stage 2 applied post-reordering for all permutation. For the 4-head case, this encompasses all possible permutations (4! = 24). For the 8-head case, it includes 23 randomly sampled permutations along w… view at source ↗

**Figure 85.** Figure 85: Linear Mode Connectivity for ViT on CIFAR-10 with 2 layers (all head permutations) 74 [PITH_FULL_IMAGE:figures/full_fig_p074_85.png] view at source ↗

**Figure 86.** Figure 86: Linear Mode Connectivity for ViT on CIFAR-10 with 6 layers (all head permutations) Model 1 Model 2 3 4 5 6 7 8 Test Loss Other Permutations Naive Matching Model 1 Model 3 3 4 5 6 7 Model 2 Model 3 3 4 5 6 7 8 9 Model 1 Model 2 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Test Accuracy Model 1 Model 3 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Model 2 Model 3 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 (a) 4 attention head… view at source ↗

**Figure 87.** Figure 87: Linear Mode Connectivity for ViT on CIFAR-100 with 6 layers (all head permutations) Model 1 Model 2 0.35 0.40 0.45 0.50 0.55 0.60 Test Loss Other Permutations Naive Matching Model 1 Model 3 0.325 0.350 0.375 0.400 0.425 0.450 0.475 0.500 Model 2 Model 3 0.325 0.350 0.375 0.400 0.425 0.450 0.475 0.500 Model 1 Model 2 0.700 0.725 0.750 0.775 0.800 0.825 0.850 0.875 Test Accuracy Model 1 Model 3 0.74 0.76 0.… view at source ↗

**Figure 88.** Figure 88: Linear Mode Connectivity for BERT on IMDBreview with 2 layers (all head permutations) Model 1 Model 2 0.34 0.36 0.38 0.40 0.42 Test Loss Other Permutations Naive Matching Model 1 Model 3 0.34 0.35 0.36 0.37 Model 2 Model 3 0.35 0.40 0.45 0.50 0.55 Model 1 Model 2 0.840 0.845 0.850 0.855 0.860 Test Accuracy Model 1 Model 3 0.845 0.850 0.855 0.860 Model 2 Model 3 0.79 0.80 0.81 0.82 0.83 0.84 0.85 0.86 (a) … view at source ↗

**Figure 89.** Figure 89: Linear Mode Connectivity for BERT on IMDBreview with 6 layers (all head permutations) 75 [PITH_FULL_IMAGE:figures/full_fig_p075_89.png] view at source ↗

**Figure 90.** Figure 90: Linear Mode Connectivity for BERT on DBPedia with 2 layers (all head permutations) Model 1 Model 2 0.4 0.6 0.8 1.0 1.2 1.4 Test Loss Other Permutations Naive Matching Model 1 Model 3 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Model 2 Model 3 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 Model 1 Model 2 0.65 0.70 0.75 0.80 0.85 0.90 Test Accuracy Model 1 Model 3 0.5 0.6 0.7 0.8 0.9 Model 2 Model 3 0.5 0.6 0.7 … view at source ↗

**Figure 91.** Figure 91: Linear Mode Connectivity for BERT on DBPedia with 6 layers (all head permutations) Model 1 Model 2 1.0 1.2 1.4 1.6 1.8 Test Loss Other Permutations Naive Matching Model 1 Model 3 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Model 2 Model 3 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Model 1 Model 2 0.45 0.50 0.55 0.60 0.65 Test Accuracy Model 1 Model 3 0.450 0.475 0.500 0.525 0.550 0.575 0.600 0.625 0.650 Model 2 Model 3 … view at source ↗

**Figure 92.** Figure 92: Linear Mode Connectivity for ViT-RoPE on CIFAR-10 with 2 layers (all head permutations) Model 1 Model 2 1.0 1.2 1.4 1.6 1.8 2.0 2.2 Test Loss Other Permutations Naive Matching Model 1 Model 3 1.0 1.2 1.4 1.6 1.8 2.0 2.2 Model 2 Model 3 1.0 1.2 1.4 1.6 1.8 2.0 Model 1 Model 2 0.40 0.45 0.50 0.55 0.60 0.65 0.70 Test Accuracy Model 1 Model 3 0.40 0.45 0.50 0.55 0.60 0.65 0.70 Model 2 Model 3 0.45 0.50 0.55 0… view at source ↗

**Figure 93.** Figure 93: Linear Mode Connectivity for ViT-RoPE on CIFAR-10 with 6 layers (all head permutations) 76 [PITH_FULL_IMAGE:figures/full_fig_p076_93.png] view at source ↗

**Figure 94.** Figure 94: Linear Mode Connectivity for ViT-RoPE on CIFAR-100 with 6 layers (all head permutations) Model 1 Model 2 0.35 0.40 0.45 0.50 0.55 0.60 Test Loss Other Permutations Naive Matching Model 1 Model 3 0.4 0.5 0.6 0.7 Model 2 Model 3 0.35 0.40 0.45 0.50 Model 1 Model 2 0.72 0.74 0.76 0.78 0.80 0.82 0.84 0.86 Test Accuracy Model 1 Model 3 0.675 0.700 0.725 0.750 0.775 0.800 0.825 0.850 Model 2 Model 3 0.76 0.78 0… view at source ↗

**Figure 95.** Figure 95: Linear Mode Connectivity for BERT-RoPE on IMDBreview with 2 layers (all head permutations) Model 1 Model 2 0.34 0.36 0.38 0.40 0.42 0.44 0.46 Test Loss Other Permutations Naive Matching Model 1 Model 3 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 Model 2 Model 3 0.35 0.40 0.45 0.50 0.55 0.60 Model 1 Model 2 0.78 0.79 0.80 0.81 0.82 0.83 0.84 0.85 0.86 Test Accuracy Model 1 Model 3 0.65 0.70 0.75 0.80 0.85… view at source ↗

**Figure 96.** Figure 96: Linear Mode Connectivity for BERT-RoPE on IMDBreview with 6 layers (all head permutations) Model 1 Model 2 0.5 1.0 1.5 2.0 2.5 Test Loss Other Permutations Naive Matching Model 1 Model 3 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Model 2 Model 3 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Model 1 Model 2 0.4 0.5 0.6 0.7 0.8 0.9 Test Accuracy Model 1 Model 3 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 Model 2 Model 3 0.55 0.60 0.65… view at source ↗

**Figure 97.** Figure 97: Linear Mode Connectivity for BERT-RoPE on DBPedia with 2 layers (all head permutations) 77 [PITH_FULL_IMAGE:figures/full_fig_p077_97.png] view at source ↗

**Figure 98.** Figure 98: Linear Mode Connectivity for BERT-RoPE on DBPedia with 6 layers (all head permutations) K. Ablation of Distance Metrics for Head Matching As described in the previous sections, attention-head matching between two transformer models is formulated as a bipartite assignment problem and solved using the Hungarian algorithm. Specifically, a cost matrix is constructed by computing pairwise distances between hea… view at source ↗

read the original abstract

Neural network parameter spaces are inherently non-injective, as distinct parameter configurations can realize identical functions through functional equivalence. While this symmetry is well understood in classical fully connected and convolutional models, it becomes substantially more intricate in modern attention-based architectures. Existing analyses of multihead attention have largely focused on the vanilla formulation, overlooking positional encodings that fundamentally reshape architectural symmetries. In this work, we provide a formal study of functional equivalence in Transformers with positional encodings. Focusing on the two most widely used variants--sinusoidal and rotary positional encodings (RoPE)--we show that sinusoidal encodings preserve the equivalence structure of vanilla attention, whereas rotary encodings significantly reduce the symmetry group, thereby enhancing expressivity. This offers a principled explanation for the growing prominence of RoPE in practice. We further examine how positional encodings affect linear mode connectivity, and through an alignment algorithm, empirically demonstrate that the presence and variability of connectivity across Transformer settings crucially depend on the positional encoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paper extends attention symmetry analysis to positional encodings and ties RoPE's smaller symmetry group to expressivity and mode connectivity, but isolates the module from the rest of the Transformer.

read the letter

The main new element is the direct comparison of sinusoidal versus rotary positional encodings on the functional equivalence group of attention, plus the empirical check via alignment algorithm showing that linear mode connectivity varies with the encoding choice. Sinusoidal keeps the vanilla attention symmetries while RoPE shrinks them, which the authors present as a reason for its practical edge.

This is a straightforward extension of existing symmetry work on multi-head attention. The specific contrast between the two encodings is timely and the link to mode connectivity gives it an applied angle that prior papers in this line did not have. The alignment procedure looks like a usable tool for testing connectivity claims.

The soft spot is scope. Everything formal stays inside the attention block; the stress-test note is correct that layer norms, residuals, and FFNs can add or remove identifications, so the claimed symmetry reduction for RoPE may not survive in the full architecture used for the connectivity experiments. Without the actual equations it is also difficult to gauge how tight the equivalence proofs are, though the abstract indicates they exist. The experiments appear to test full models, but the theory-experiment gap on interactions is real.

The paper is aimed at people already working on Transformer symmetries or linear mode connectivity. A reader in that niche would get a clear, testable comparison and a new empirical angle. It shows honest engagement with the literature and no circular definitions or invented quantities.

Recommendation: send to peer review. The claims are narrow enough to be checked and the empirical piece is worth referee scrutiny even if the full-model implications need more work.

Referee Report

2 major / 2 minor

Summary. The paper provides a formal analysis of functional equivalence (symmetries) in multi-head attention, claiming that sinusoidal positional encodings preserve the equivalence structure of vanilla attention while rotary encodings (RoPE) reduce the size of the symmetry group and thereby increase expressivity. It further studies the consequences for linear mode connectivity (LMC) across Transformer variants and introduces an alignment algorithm to empirically demonstrate that connectivity patterns depend on the choice of positional encoding.

Significance. If the central claims hold, the work supplies a symmetry-based explanation for the empirical preference for RoPE and directly links architectural symmetries to the practical phenomenon of LMC. The combination of a formal symmetry analysis with an alignment-based empirical study of connectivity is a strength; the paper also supplies reproducible code for the alignment procedure.

major comments (2)

[Formal analysis of attention with positional encodings] Formal analysis section (attention module only): the equivalence derivations and symmetry-group comparison are performed on the isolated attention block. The LMC experiments, however, are run on complete Transformer stacks that include layer norms, residual connections, and position-wise FFNs. No argument or ablation is given showing that the claimed reduction in symmetry group for RoPE survives these additional components; if the reduction is sensitive to the couplings, the headline claim that RoPE enhances expressivity via a smaller symmetry group does not necessarily transfer to the models studied in the connectivity experiments.
[Linear mode connectivity experiments] LMC experiments and alignment algorithm: the paper asserts that connectivity 'crucially depend[s] on the positional encoding,' yet the reported results compare only sinusoidal versus RoPE without a controlled ablation that isolates the symmetry reduction from other architectural differences (e.g., different initialization or training schedules). A direct test—e.g., measuring connectivity after explicitly breaking or restoring the identified symmetries—would be needed to establish the causal link.

minor comments (2)

Notation for the symmetry group and equivalence relation is introduced without a compact summary table; a single table listing the generators or orbit sizes for vanilla, sinusoidal, and RoPE cases would improve readability.
[Abstract] The abstract states that the study is 'comprehensive,' yet only two positional-encoding families are treated; a brief discussion of why other common variants (e.g., ALiBi, learned embeddings) fall outside the scope would clarify the scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify that the formal analysis targets the attention module and that the LMC results rely on comparisons between two positional-encoding families. We address both points below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Formal analysis of attention with positional encodings] Formal analysis section (attention module only): the equivalence derivations and symmetry-group comparison are performed on the isolated attention block. The LMC experiments, however, are run on complete Transformer stacks that include layer norms, residual connections, and position-wise FFNs. No argument or ablation is given showing that the claimed reduction in symmetry group for RoPE survives these additional components; if the reduction is sensitive to the couplings, the headline claim that RoPE enhances expressivity via a smaller symmetry group does not necessarily transfer to the models studied in the connectivity experiments.

Authors: The symmetry analysis is performed on the attention block because that is the locus where positional encodings modify the functional equivalences between parameter configurations. Layer normalization, residual connections, and position-wise FFNs are equivariant with respect to the same parameter transformations (permutations for sinusoidal encodings, rotations for RoPE) that leave the attention output unchanged; consequently the reduced symmetry group identified for RoPE is expected to carry over to the full stack. We will add a short paragraph and a supporting remark in the revised manuscript making this invariance explicit and noting that an exhaustive ablation of every coupling lies outside the current scope. revision: yes
Referee: [Linear mode connectivity experiments] LMC experiments and alignment algorithm: the paper asserts that connectivity 'crucially depend[s] on the positional encoding,' yet the reported results compare only sinusoidal versus RoPE without a controlled ablation that isolates the symmetry reduction from other architectural differences (e.g., different initialization or training schedules). A direct test—e.g., measuring connectivity after explicitly breaking or restoring the identified symmetries—would be needed to establish the causal link.

Authors: All models in the LMC study share identical initialization distributions, optimizer settings, and training schedules; the sole controlled difference is the positional-encoding mechanism whose symmetry groups were derived in the formal section. The alignment procedure is constructed precisely to factor out the equivalences that remain under each encoding, and the observed connectivity patterns track the predicted group sizes. While an explicit symmetry-breaking intervention would constitute stronger causal evidence, such an experiment is not present in the current manuscript. We will expand the discussion to clarify the controls that were applied and to acknowledge that a direct interventional test remains future work. revision: partial

Circularity Check

0 steps flagged

Formal analysis of positional encoding symmetries is self-contained with no detected circularity

full rationale

The paper conducts a direct formal study of functional equivalence under sinusoidal and rotary positional encodings, comparing them to the vanilla attention baseline. No equations, fitted parameters, or predictions are shown that reduce by construction to the inputs (e.g., no self-definitional scaling, no fitted-input-called-prediction, no load-bearing self-citation chains). The central claims rest on explicit symmetry analysis rather than renaming or smuggling ansatzes. The empirical linear mode connectivity experiments are presented as separate demonstrations, not as forced outputs of the theoretical part. This matches the default expectation of non-circularity for theoretical architecture studies.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The analysis implicitly assumes standard definitions of functional equivalence from prior attention literature.

pith-pipeline@v0.9.1-grok · 5708 in / 1023 out tokens · 29960 ms · 2026-06-27T01:35:18.048880+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 6 canonical work pages · 3 internal anchors

[3]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

doi: 10.18653/V1/P19-1285. URL https: //doi.org/10.18653/v1/p19-1285. DeepSeek-AI. Deepseek-v2: A strong, economi- cal, and efficient mixture-of-experts language model. CoRR, abs/2405.04434, 2024. doi: 10.48550/ARXIV . 2405.04434. URLhttps://doi.org/10.48550/ arXiv.2405.04434. DeepSeek-AI. Deepseek-r1: Incentivizing reasoning ca- pability in llms via rein...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p19-1285 2024
[4]

Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F

URL https://openreview.net/forum? id=YicbFdNTTy. Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. A. Essentially no barriers in neural network energy landscape. In Dy, J. G. and Krause, A. (eds.),Proceed- ings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm ¨assan, Stockholm, Sweden, July 10-15, 2018, volume 80...

2018
[5]

Du, S., Lee, J., Li, H., Wang, L., and Zhai, X

URL http://proceedings.mlr.press/ v80/draxler18a.html. Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pp. 1675–
[6]

Entezari, R., Sedghi, H., Saukh, O., and Neyshabur, B

PMLR, 2019. Entezari, R., Sedghi, H., Saukh, O., and Neyshabur, B. The role of permutation invariance in linear mode con- nectivity of neural networks. InThe Tenth Interna- tional Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net,

2019
[7]

Fefferman, C

URL https://openreview.net/forum? id=dNigytemkL. Fefferman, C. and Markel, S. Recovering a feed-forward net from its output. In Cowan, J. D., Tesauro, G., and Alspec- tor, J. (eds.),Advances in Neural Information Processing Systems 6, [7th NIPS Conference, Denver, Colorado, USA, 1993], pp. 335–342. Morgan Kaufmann, 1993. Ferbach, D., Goujaud, B., Gidel, G...

1993
[8]

Frankle, J

URL https://proceedings.mlr.press/ v238/ferbach24a.html. Frankle, J. Revisiting ”qualitatively characterizing neural network optimization problems”.CoRR, abs/2012.06898,

arXiv 2012
[9]

URL https://arxiv.org/abs/2012. 06898. Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In7th Inter- national Conference on Learning Representations, ICLR 10 Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity 2019, New Orleans, LA, USA, May 6-9, 2019....

2012
[10]

Using Mode Connectivity for Loss Landscape Analysis

PMLR, 2017. Gotmare, A., Keskar, N. S., Xiong, C., and Socher, R. Using mode connectivity for loss landscape analysis.CoRR, abs/1806.06977, 2018. URL http://arxiv.org/ abs/1806.06977. Guerrero-Pe˜na, F. A., Medeiros, H. R., Dubail, T., Aminbei- dokhti, M., Granger, E., and Pedersoli, M. Re-basin via implicit sinkhorn differentiation. InIEEE/CVF Confer- en...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/cvpr52729 2017
[11]

Keskar, N

URL https://openreview.net/forum? id=UqYNPyotxL. Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learn- ing: Generalization gap and sharp minima. In5th Inter- national Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview...

2017
[12]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 3347–3356

URL https://openreview.net/forum? id=cUFIil6hEG. Kozal, J., Wasilewski, J., Krawczyk, B., and Wozniak, M. Continual learning with weight interpolation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Workshops, Seattle, WA, USA, June 17-18, 2024, pp. 4187–4195. IEEE, 2024. doi: 10.1109/CVPRW63382.2024.00422. URL https:// doi...

work page doi:10.1109/cvprw63382.2024.00422 2024
[13]

URL https: //doi.org/10.1162/neco.1994.6.3.543

doi: 10.1162/NECO.1994.6.3.543. URL https: //doi.org/10.1162/neco.1994.6.3.543. LeCun, Y ., Bottou, L., Bengio, Y ., and Haffner, P. Gradient- based learning applied to document recognition.Proc. IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791. URLhttps://doi.org/10.1109/5.726791. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mende...

work page doi:10.1162/neco.1994.6.3.543 1994
[14]

Pittorino, F., Ferraro, A., Perugini, G., Feinauer, C., Bal- dassi, C., and Zecchina, R

URL https://openreview.net/forum? id=Bylx-TNKvH. Pittorino, F., Ferraro, A., Perugini, G., Feinauer, C., Bal- dassi, C., and Zecchina, R. Deep networks on toroids: Removing symmetries reveals the structure of flat re- gions in the landscape geometry. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv ´ari, C., Niu, G., and Sabato, S. (eds.),International Co...

2022
[15]

Piziak, R

URL https://proceedings.mlr.press/ v162/pittorino22a.html. Piziak, R. and Odell, P. L. Full rank factorization of matrices. Mathematics magazine, 72(3):193–201, 1999. Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. InThe Tenth International Conference on Learning Representatio...

work page doi:10.18653/v1/n18-2074 1999
[17]

Vlaar, T

URL https://jmlr.org/papers/v20/ 18-674.html. Vlaar, T. J. and Frankle, J. What can linear interpolation of neural network loss landscapes tell us? In Chaud- huri, K., Jegelka, S., Song, L., Szepesv ´ari, C., Niu, G., and Sabato, S. (eds.),International Conference on Ma- chine Learning, ICML 2022, 17-23 July 2022, Balti- more, Maryland, USA, volume 162 of...

2022
[18]

Wen, H., Cheng, H., Qiu, H., Wang, L., Pan, L., and Li, H

URL https://proceedings.mlr.press/ v162/vlaar22a.html. Wen, H., Cheng, H., Qiu, H., Wang, L., Pan, L., and Li, H. Optimizing mode connectivity for class incremental learn- ing. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.),International Con- ference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Haw...

2023
[19]

URL https://proceedings.mlr.press/ v162/wortsman22a.html. Xiao, T. Z., Liu, W., and Bamler, R. A compact repre- sentation for bayesian neural networks by removing per- mutation symmetry.CoRR, abs/2401.00611, 2024. doi: 10.48550/ARXIV .2401.00611. URL https://doi. org/10.48550/arXiv.2401.00611. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu,...

work page internal anchor Pith review doi:10.48550/arxiv 2024
[20]

Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity

URL https://openreview.net/forum? id=SJgwzCEKwH. Zheng, C., Gao, Y ., Shi, H., Huang, M., Li, J., Xiong, J., Ren, X., Ng, M. K., Jiang, X., Li, Z., and Li, Y . DAPE: data-adaptive positional encoding for length extrapola- tion. In Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J. M., and Zhang, C. (eds.), Advances in Neural Inform...

2024
[21]

Related concepts, such as functional equivalence and alignment methods, are also introduced in connection with prior literature

Section 1 provides an introduction and related work on Linear Mode Connectivity. Related concepts, such as functional equivalence and alignment methods, are also introduced in connection with prior literature
[22]

Section 2 reviews vanilla attention, including its parameter space, symmetry group, and a result from literature– Theorem 2.1–which establishes complete functional equivalence for vanilla attention
[23]

While absolute PEs of the additive type do not affect the structure, relative PEs (with particular emphasis on Rotary PE) fundamentally change the attention mechanism

Section 3 examines how positional encodings may alter the internal structure of attention, thereby rendering the analysis from the vanilla case no longer directly applicable. While absolute PEs of the additive type do not affect the structure, relative PEs (with particular emphasis on Rotary PE) fundamentally change the attention mechanism. The correspond...
[24]

First, we extend the RoPE setting to a general attention formulation that accommodates all cases of interest

Section 4 focuses primarily on the RoPE case. First, we extend the RoPE setting to a general attention formulation that accommodates all cases of interest. In this formulation, the similarity score between two tokens at their specific positional indices is expressed as a bilinear form or quadratic norm. The result on functional equivalence of this setting...
[25]

We propose a two-stage alignment algorithm for multi-head attention layers, applicable to both standard MHA and MHA with RoPE

Section 5 introduces an alignment method that serves as a tool for examining linear mode connectivity (LMC) in attention-based models. We propose a two-stage alignment algorithm for multi-head attention layers, applicable to both standard MHA and MHA with RoPE. The first stage matches the ordering of attention heads between two models by solving a linear ...
[26]

Experiments are conducted across diverse Vision and NLP tasks

Section 6 examines LMC under four re-initialization strategies, with emphasis on the first attention layer and full model resets, while intermediate cases are reported in the Appendix. Experiments are conducted across diverse Vision and NLP tasks. Ablation studies confirm the effectiveness of the two-stage matching algorithm in reducing barriers: Ablation...
[27]

Section 7 summarizes our findings, discusses limitations, and outlines future directions. 18 Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity Appendix.The appendices provide complete proofs of the theoretical results in the main paper, the proposed matching algorithms, as well as additional experimen...
[28]

Appendix B formally defines the attention mechanism and its parameter space, followed by a description of how positional encodings are incorporated into attention
[29]

Appendix C briefly describes the symmetry structures of vanilla attention, attention with absolute PEs, and attention with relative PEs (with emphasis on RoPE)
[30]

Theorem D.1, which is Theorem 4.1 in the main paper, establishes the functional equivalence of this general setting

Appendix D introduces the general attention formulation. Theorem D.1, which is Theorem 4.1 in the main paper, establishes the functional equivalence of this general setting. The proof can be sketched as follows: starting from the softmax operator, we multiply through the denominators to rewrite the expression as an exponential polynomial, and then apply r...
[31]

Theorem F.1, corresponding to Theorem 4.2 in the main paper, provides the full details of this analysis

Appendix F applies the functional equivalence analysis of the general attention case to the specific setting of RoPE. Theorem F.1, corresponding to Theorem 4.2 in the main paper, provides the full details of this analysis. The proof proceeds as follows: RoPE is first reformulated as a special case of the general attention formulation via reparameterizatio...
[32]

Appendix J.1 reports experiments on re-initializing only the first attention layer, highlighting its dominant role in shaping early representations
[33]

Appendix J.2 investigates re-initialization of all attention layers, showing the cumulative effect of disrupting contextual interactions across the network
[34]

Appendix J.3 studies re-initialization of the first Transformer layer, coupling attention and its adjacent feedforward block to examine early-layer sensitivity
[35]

19 Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity

Appendix J.4 evaluates the most extreme setting where the entire Transformer is re-initialized, quantifying the magnitude of barriers introduced by full resets. 19 Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity
[36]

Appendix J.5 presents ablation studies on head permutation, including the two-stage matching algorithm. Stage 1 demonstrates the necessity of optimal head alignment for preserving linear mode connectivity, while Stage 2 leverages gradient refinement to further reduce interpolation barriers. The experimental findings indicate that linear mode connectivity ...

2017
[37]

Everyd×d h matrix appearing inθand ¯θhas full column rankd h
[38]

If the twoMHAmaps are identical, thenh= ¯h, and there existsg∈G Att(dh, h)such that ¯θ=gθ

Thehmatrices{W Q i (W K i )⊤}h i=1 are pairwise distinct; and, the ¯hmatrices{ ¯W Q i ( ¯W K i )⊤}¯h i=1 are pairwise distinct. If the twoMHAmaps are identical, thenh= ¯h, and there existsg∈G Att(dh, h)such that ¯θ=gθ. Remark C.2.While the theorem imposes certain assumptions on the parameters of the MHA maps, it is important to emphasize that these condit...
[39]

(Relative positional encoding assumption.) For allm, n≥1and for all shiftsk≥0, we assume Am,n =A m+k,n+k.(33) This corresponds to the natural stationarity condition imposed by relative positional encodings
[40]

(Diagonal self-similarity terms are symmetric.) For each m≥1 , the matrix Am,m i parameterizes the function f that computes the similarity score of the m-th token with itself at the i-th head, namely xmAm,m i x⊤ m. Since every quadratic form corresponds uniquely to a symmetric matrix, we may, without loss of generality, symmetrizeA m,m i : sym(Am,m i ) :=...
[41]

In particular, we note that it suffices to show that at least one of the coefficientsBi must vanish

Preliminary setup.We first record some initial observations and introduce the necessary notation in preparation for the proof. In particular, we note that it suffices to show that at least one of the coefficientsBi must vanish. Once this is established, symmetry in the construction allows us to conclude that in fact all Bi must be equal to zero, thereby p...
[42]

Specifically, the symmetry conditions imposed by the Ak,t i on admissible permutations force the Bi to satisfy a family of linear relations indexed by i∈[h]

Structural constraints on the Bi.By applying the above linear independence principle, we identify a fundamental structural constraint on the coefficients Bi. Specifically, the symmetry conditions imposed by the Ak,t i on admissible permutations force the Bi to satisfy a family of linear relations indexed by i∈[h] . These constraints form the core of the a...
[43]

This step is preparatory: it shows that the relations identified in the previous step are not only necessary but also sufficient to deduce that at least one Bi must vanish

Partition-based refinement.We next examine the equalities that occur within the sets of h elements {Ak,t i }h i=1. This step is preparatory: it shows that the relations identified in the previous step are not only necessary but also sufficient to deduce that at least one Bi must vanish. The analysis exploits the partition structure {Up}, together with the...
[44]

The linear relations obtained inStep 3, when applied to the partition refinement ofStep 4, imply that one of the Bi’s must equal zero

Conclusion.Finally, we combine the above ingredients to conclude the proof. The linear relations obtained inStep 3, when applied to the partition refinement ofStep 4, imply that one of the Bi’s must equal zero. By the initial reduction inStep 1, this suffices to deduce that in fact allB i = 0. This completes the proof of the theorem. We proceed to present...
[45]

In particular, by reindexing the head indices, we may assume U1 ={1,

For all t∈S , the partition {U t p}αt p=1, defined inStep 4, coincides with {Up}α p=1. In particular, by reindexing the head indices, we may assume U1 ={1, . . . , m}. This guarantees that the structure of the partition is stable across infinitely manyt∈S, providing us with a consistent reference framework
[46]

For all ti with i∈[γ] , where γ < m , recall that V ti =U ti(1)∩ {1, . . . , m} . One can select γ head indices vi ∈V ti such that they are pairwise distinct. This property will be crucial later when we need to ensure that certain representatives can be chosen without overlap. We also recall the main result fromStep 3, namely Equation (57): for any (s1, ....
[47]

In other words, the positions corresponding to T are aligned with the distinguished token indicest i

If j=v i for some i∈T , then set sj =s vi =t i. In other words, the positions corresponding to T are aligned with the distinguished token indicest i
[48]

, m} \ {v i :i∈T} , take sj to be an arbitrary element of S

If j∈ {1, . . . , m} \ {v i :i∈T} , take sj to be an arbitrary element of S. This ensures consistency with the partition structure while leaving us flexibility in the assignment
[49]

Again, this choice respects the partitioning of indices into classesU p

If j∈U p for some 2≤p≤α , then take sj to be an arbitrary element of S. Again, this choice respects the partitioning of indices into classesU p. For the chosen (s1, . . . , sh)∈[L] h, we analyze which σ∈S h satisfy the condition Ak,sj j =A k,sj σ(j) for all j∈[h] . We make the following observations, case by case:
[50]

Hence σ(U2 ⊔U 3 ⊔ · · · ⊔U α) =U 2 ⊔U 3 ⊔ · · · ⊔U α,(69) and consequentlyσ(U 1) =U 1

Forj∈U 2 ⊔U 3 ⊔ · · · ⊔U α, sayj∈U p with2≤p≤α, the conditionA k,sj j =A k,sj σ(j) impliesσ(j)∈U p. Hence σ(U2 ⊔U 3 ⊔ · · · ⊔U α) =U 2 ⊔U 3 ⊔ · · · ⊔U α,(69) and consequentlyσ(U 1) =U 1. In particular, ifj∈U 1, thenσ(j)∈U 1
[51]

, m} \ {v i :i∈T} , if Ak,sj j =A k,sj σ(j), then necessarily σ(j)∈U 1 ={1,

For j∈ {1, . . . , m} \ {v i :i∈T} , if Ak,sj j =A k,sj σ(j), then necessarily σ(j)∈U 1 ={1, . . . , m} . Thus the entire set U1 is stable underσ, but the specific images of these indices may vary withinU 1
[52]

From the previous point, we also know σ(j)∈U 1

For j=v i with i∈T , if Ak,sj j =A k,sj σ(j), then σ(j)∈U svi (1) =U ti(1). From the previous point, we also know σ(j)∈U 1. Taken together, these conditions imply that σ(j)∈V ti =U ti(1)∩U 1. In other words, the image of vi underσis constrained to lie inside the restricted setV ti. Therefore, specifying aσ∈S h that satisfiesA k,sj j =A k,sj σ(j) for allj∈...
[53]

For eachj=v i withi∈T, choosingσ(j) =σ(v i)∈V ti,
[54]

, m} \ {v i :i∈T}, choosingσ(j)∈U 1 \ {σ(vi) :i∈T}arbitrarily,

For eachj∈ {1, . . . , m} \ {v i :i∈T}, choosingσ(j)∈U 1 \ {σ(vi) :i∈T}arbitrarily,
[55]

In conclusion, the structure of admissible permutations σ in Equation (67) is fully determined by the subset T⊂[γ] and the representatives vi ∈V ti chosen inStep 4

For eachj∈U p with2≤p≤α, choosingσ(j)∈U p. In conclusion, the structure of admissible permutations σ in Equation (67) is fully determined by the subset T⊂[γ] and the representatives vi ∈V ti chosen inStep 4. This description clarifies how the constraints arising from the partition classes Up and the distinguished representatives vi together restrict the a...

1935
[56]

All matricesA n i and ¯An i , for feasibleiandn∈Z, are nonzero
[57]

,{An h}n∈Z are pairwise distinct

Fromθ, thehfamilies{A n 1 }n∈Z, . . . ,{An h}n∈Z are pairwise distinct. The same condition holds for ¯θ
[58]

If the twoMHA RoPE maps are identical, thenh= ¯h

All matricesW Q i , W K i , W V i , W O i and ¯W Q i , ¯W K i , ¯W V i , W O i , for feasiblei, are of rankd h. If the twoMHA RoPE maps are identical, thenh= ¯h. Moreover, there existsg∈G RoPE(dh, h)such that ¯θ=gθ. Proof.Fori∈[h]andm, n≥1, defineA m,n i =A m−n i andB i :=W V i (W O i )⊤. Same for ¯Am,n i and ¯Bi. Then, one has MHA x;{{A m,n i }m,n, Bi}h ...

1999
[59]

Compute the scalar constantsη Q, ηK and the complex constantsγ Q, γK
[60]

Define the objective functiong(x) =xη Q + ηK x −4 q |γQ|2x+ |γK |2 x + 2Re(γQ¯γK)
[61]

Here we use the Brent’s method (Brent, 2013)

Find the minimizer x⋆ = arg min x>0 g(x) using a numerical optimization routine. Here we use the Brent’s method (Brent, 2013). The optimal solution is computed as r⋆ = √ x⋆, θ⋆ =−arg r⋆γQ + 1 r⋆ γK , and finally a=r ⋆ cos(θ⋆),b=r ⋆ sin(θ⋆). This yields the optimal alignment matrixU j for each subspacej. G.2. Algorithm Description Algorithm 1Attention Laye...

2013
[62]

During fine-tuning, we replace the pretrained attention modules with variants containing 4, 8, or 16 heads, and train for 60000 steps

Pretraining is performed using the Adam optimizer with a batch size of 24 and an initial learning rate of 2.5·10 −4, following a cosine decay schedule without warmup, for a total of 60000 steps. During fine-tuning, we replace the pretrained attention modules with variants containing 4, 8, or 16 heads, and train for 60000 steps. WikiText103.For the WikiTex...

2000

[1] [3]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

doi: 10.18653/V1/P19-1285. URL https: //doi.org/10.18653/v1/p19-1285. DeepSeek-AI. Deepseek-v2: A strong, economi- cal, and efficient mixture-of-experts language model. CoRR, abs/2405.04434, 2024. doi: 10.48550/ARXIV . 2405.04434. URLhttps://doi.org/10.48550/ arXiv.2405.04434. DeepSeek-AI. Deepseek-r1: Incentivizing reasoning ca- pability in llms via rein...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p19-1285 2024

[2] [4]

Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F

URL https://openreview.net/forum? id=YicbFdNTTy. Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. A. Essentially no barriers in neural network energy landscape. In Dy, J. G. and Krause, A. (eds.),Proceed- ings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm ¨assan, Stockholm, Sweden, July 10-15, 2018, volume 80...

2018

[3] [5]

Du, S., Lee, J., Li, H., Wang, L., and Zhai, X

URL http://proceedings.mlr.press/ v80/draxler18a.html. Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pp. 1675–

[4] [6]

Entezari, R., Sedghi, H., Saukh, O., and Neyshabur, B

PMLR, 2019. Entezari, R., Sedghi, H., Saukh, O., and Neyshabur, B. The role of permutation invariance in linear mode con- nectivity of neural networks. InThe Tenth Interna- tional Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net,

2019

[5] [7]

Fefferman, C

URL https://openreview.net/forum? id=dNigytemkL. Fefferman, C. and Markel, S. Recovering a feed-forward net from its output. In Cowan, J. D., Tesauro, G., and Alspec- tor, J. (eds.),Advances in Neural Information Processing Systems 6, [7th NIPS Conference, Denver, Colorado, USA, 1993], pp. 335–342. Morgan Kaufmann, 1993. Ferbach, D., Goujaud, B., Gidel, G...

1993

[6] [8]

Frankle, J

URL https://proceedings.mlr.press/ v238/ferbach24a.html. Frankle, J. Revisiting ”qualitatively characterizing neural network optimization problems”.CoRR, abs/2012.06898,

arXiv 2012

[7] [9]

URL https://arxiv.org/abs/2012. 06898. Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In7th Inter- national Conference on Learning Representations, ICLR 10 Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity 2019, New Orleans, LA, USA, May 6-9, 2019....

2012

[8] [10]

Using Mode Connectivity for Loss Landscape Analysis

PMLR, 2017. Gotmare, A., Keskar, N. S., Xiong, C., and Socher, R. Using mode connectivity for loss landscape analysis.CoRR, abs/1806.06977, 2018. URL http://arxiv.org/ abs/1806.06977. Guerrero-Pe˜na, F. A., Medeiros, H. R., Dubail, T., Aminbei- dokhti, M., Granger, E., and Pedersoli, M. Re-basin via implicit sinkhorn differentiation. InIEEE/CVF Confer- en...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/cvpr52729 2017

[9] [11]

Keskar, N

URL https://openreview.net/forum? id=UqYNPyotxL. Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learn- ing: Generalization gap and sharp minima. In5th Inter- national Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview...

2017

[10] [12]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 3347–3356

URL https://openreview.net/forum? id=cUFIil6hEG. Kozal, J., Wasilewski, J., Krawczyk, B., and Wozniak, M. Continual learning with weight interpolation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Workshops, Seattle, WA, USA, June 17-18, 2024, pp. 4187–4195. IEEE, 2024. doi: 10.1109/CVPRW63382.2024.00422. URL https:// doi...

work page doi:10.1109/cvprw63382.2024.00422 2024

[11] [13]

URL https: //doi.org/10.1162/neco.1994.6.3.543

doi: 10.1162/NECO.1994.6.3.543. URL https: //doi.org/10.1162/neco.1994.6.3.543. LeCun, Y ., Bottou, L., Bengio, Y ., and Haffner, P. Gradient- based learning applied to document recognition.Proc. IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791. URLhttps://doi.org/10.1109/5.726791. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mende...

work page doi:10.1162/neco.1994.6.3.543 1994

[12] [14]

Pittorino, F., Ferraro, A., Perugini, G., Feinauer, C., Bal- dassi, C., and Zecchina, R

URL https://openreview.net/forum? id=Bylx-TNKvH. Pittorino, F., Ferraro, A., Perugini, G., Feinauer, C., Bal- dassi, C., and Zecchina, R. Deep networks on toroids: Removing symmetries reveals the structure of flat re- gions in the landscape geometry. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv ´ari, C., Niu, G., and Sabato, S. (eds.),International Co...

2022

[13] [15]

Piziak, R

URL https://proceedings.mlr.press/ v162/pittorino22a.html. Piziak, R. and Odell, P. L. Full rank factorization of matrices. Mathematics magazine, 72(3):193–201, 1999. Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. InThe Tenth International Conference on Learning Representatio...

work page doi:10.18653/v1/n18-2074 1999

[14] [17]

Vlaar, T

URL https://jmlr.org/papers/v20/ 18-674.html. Vlaar, T. J. and Frankle, J. What can linear interpolation of neural network loss landscapes tell us? In Chaud- huri, K., Jegelka, S., Song, L., Szepesv ´ari, C., Niu, G., and Sabato, S. (eds.),International Conference on Ma- chine Learning, ICML 2022, 17-23 July 2022, Balti- more, Maryland, USA, volume 162 of...

2022

[15] [18]

Wen, H., Cheng, H., Qiu, H., Wang, L., Pan, L., and Li, H

URL https://proceedings.mlr.press/ v162/vlaar22a.html. Wen, H., Cheng, H., Qiu, H., Wang, L., Pan, L., and Li, H. Optimizing mode connectivity for class incremental learn- ing. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.),International Con- ference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Haw...

2023

[16] [19]

URL https://proceedings.mlr.press/ v162/wortsman22a.html. Xiao, T. Z., Liu, W., and Bamler, R. A compact repre- sentation for bayesian neural networks by removing per- mutation symmetry.CoRR, abs/2401.00611, 2024. doi: 10.48550/ARXIV .2401.00611. URL https://doi. org/10.48550/arXiv.2401.00611. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu,...

work page internal anchor Pith review doi:10.48550/arxiv 2024

[17] [20]

Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity

URL https://openreview.net/forum? id=SJgwzCEKwH. Zheng, C., Gao, Y ., Shi, H., Huang, M., Li, J., Xiong, J., Ren, X., Ng, M. K., Jiang, X., Li, Z., and Li, Y . DAPE: data-adaptive positional encoding for length extrapola- tion. In Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J. M., and Zhang, C. (eds.), Advances in Neural Inform...

2024

[18] [21]

Related concepts, such as functional equivalence and alignment methods, are also introduced in connection with prior literature

Section 1 provides an introduction and related work on Linear Mode Connectivity. Related concepts, such as functional equivalence and alignment methods, are also introduced in connection with prior literature

[19] [22]

Section 2 reviews vanilla attention, including its parameter space, symmetry group, and a result from literature– Theorem 2.1–which establishes complete functional equivalence for vanilla attention

[20] [23]

While absolute PEs of the additive type do not affect the structure, relative PEs (with particular emphasis on Rotary PE) fundamentally change the attention mechanism

Section 3 examines how positional encodings may alter the internal structure of attention, thereby rendering the analysis from the vanilla case no longer directly applicable. While absolute PEs of the additive type do not affect the structure, relative PEs (with particular emphasis on Rotary PE) fundamentally change the attention mechanism. The correspond...

[21] [24]

First, we extend the RoPE setting to a general attention formulation that accommodates all cases of interest

Section 4 focuses primarily on the RoPE case. First, we extend the RoPE setting to a general attention formulation that accommodates all cases of interest. In this formulation, the similarity score between two tokens at their specific positional indices is expressed as a bilinear form or quadratic norm. The result on functional equivalence of this setting...

[22] [25]

We propose a two-stage alignment algorithm for multi-head attention layers, applicable to both standard MHA and MHA with RoPE

Section 5 introduces an alignment method that serves as a tool for examining linear mode connectivity (LMC) in attention-based models. We propose a two-stage alignment algorithm for multi-head attention layers, applicable to both standard MHA and MHA with RoPE. The first stage matches the ordering of attention heads between two models by solving a linear ...

[23] [26]

Experiments are conducted across diverse Vision and NLP tasks

Section 6 examines LMC under four re-initialization strategies, with emphasis on the first attention layer and full model resets, while intermediate cases are reported in the Appendix. Experiments are conducted across diverse Vision and NLP tasks. Ablation studies confirm the effectiveness of the two-stage matching algorithm in reducing barriers: Ablation...

[24] [27]

Section 7 summarizes our findings, discusses limitations, and outlines future directions. 18 Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity Appendix.The appendices provide complete proofs of the theoretical results in the main paper, the proposed matching algorithms, as well as additional experimen...

[25] [28]

Appendix B formally defines the attention mechanism and its parameter space, followed by a description of how positional encodings are incorporated into attention

[26] [29]

Appendix C briefly describes the symmetry structures of vanilla attention, attention with absolute PEs, and attention with relative PEs (with emphasis on RoPE)

[27] [30]

Theorem D.1, which is Theorem 4.1 in the main paper, establishes the functional equivalence of this general setting

Appendix D introduces the general attention formulation. Theorem D.1, which is Theorem 4.1 in the main paper, establishes the functional equivalence of this general setting. The proof can be sketched as follows: starting from the softmax operator, we multiply through the denominators to rewrite the expression as an exponential polynomial, and then apply r...

[28] [31]

Theorem F.1, corresponding to Theorem 4.2 in the main paper, provides the full details of this analysis

Appendix F applies the functional equivalence analysis of the general attention case to the specific setting of RoPE. Theorem F.1, corresponding to Theorem 4.2 in the main paper, provides the full details of this analysis. The proof proceeds as follows: RoPE is first reformulated as a special case of the general attention formulation via reparameterizatio...

[29] [32]

Appendix J.1 reports experiments on re-initializing only the first attention layer, highlighting its dominant role in shaping early representations

[30] [33]

Appendix J.2 investigates re-initialization of all attention layers, showing the cumulative effect of disrupting contextual interactions across the network

[31] [34]

Appendix J.3 studies re-initialization of the first Transformer layer, coupling attention and its adjacent feedforward block to examine early-layer sensitivity

[32] [35]

19 Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity

Appendix J.4 evaluates the most extreme setting where the entire Transformer is re-initialized, quantifying the magnitude of barriers introduced by full resets. 19 Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity

[33] [36]

Appendix J.5 presents ablation studies on head permutation, including the two-stage matching algorithm. Stage 1 demonstrates the necessity of optimal head alignment for preserving linear mode connectivity, while Stage 2 leverages gradient refinement to further reduce interpolation barriers. The experimental findings indicate that linear mode connectivity ...

2017

[34] [37]

Everyd×d h matrix appearing inθand ¯θhas full column rankd h

[35] [38]

If the twoMHAmaps are identical, thenh= ¯h, and there existsg∈G Att(dh, h)such that ¯θ=gθ

Thehmatrices{W Q i (W K i )⊤}h i=1 are pairwise distinct; and, the ¯hmatrices{ ¯W Q i ( ¯W K i )⊤}¯h i=1 are pairwise distinct. If the twoMHAmaps are identical, thenh= ¯h, and there existsg∈G Att(dh, h)such that ¯θ=gθ. Remark C.2.While the theorem imposes certain assumptions on the parameters of the MHA maps, it is important to emphasize that these condit...

[36] [39]

(Relative positional encoding assumption.) For allm, n≥1and for all shiftsk≥0, we assume Am,n =A m+k,n+k.(33) This corresponds to the natural stationarity condition imposed by relative positional encodings

[37] [40]

(Diagonal self-similarity terms are symmetric.) For each m≥1 , the matrix Am,m i parameterizes the function f that computes the similarity score of the m-th token with itself at the i-th head, namely xmAm,m i x⊤ m. Since every quadratic form corresponds uniquely to a symmetric matrix, we may, without loss of generality, symmetrizeA m,m i : sym(Am,m i ) :=...

[38] [41]

In particular, we note that it suffices to show that at least one of the coefficientsBi must vanish

Preliminary setup.We first record some initial observations and introduce the necessary notation in preparation for the proof. In particular, we note that it suffices to show that at least one of the coefficientsBi must vanish. Once this is established, symmetry in the construction allows us to conclude that in fact all Bi must be equal to zero, thereby p...

[39] [42]

Specifically, the symmetry conditions imposed by the Ak,t i on admissible permutations force the Bi to satisfy a family of linear relations indexed by i∈[h]

Structural constraints on the Bi.By applying the above linear independence principle, we identify a fundamental structural constraint on the coefficients Bi. Specifically, the symmetry conditions imposed by the Ak,t i on admissible permutations force the Bi to satisfy a family of linear relations indexed by i∈[h] . These constraints form the core of the a...

[40] [43]

This step is preparatory: it shows that the relations identified in the previous step are not only necessary but also sufficient to deduce that at least one Bi must vanish

Partition-based refinement.We next examine the equalities that occur within the sets of h elements {Ak,t i }h i=1. This step is preparatory: it shows that the relations identified in the previous step are not only necessary but also sufficient to deduce that at least one Bi must vanish. The analysis exploits the partition structure {Up}, together with the...

[41] [44]

The linear relations obtained inStep 3, when applied to the partition refinement ofStep 4, imply that one of the Bi’s must equal zero

Conclusion.Finally, we combine the above ingredients to conclude the proof. The linear relations obtained inStep 3, when applied to the partition refinement ofStep 4, imply that one of the Bi’s must equal zero. By the initial reduction inStep 1, this suffices to deduce that in fact allB i = 0. This completes the proof of the theorem. We proceed to present...

[42] [45]

In particular, by reindexing the head indices, we may assume U1 ={1,

For all t∈S , the partition {U t p}αt p=1, defined inStep 4, coincides with {Up}α p=1. In particular, by reindexing the head indices, we may assume U1 ={1, . . . , m}. This guarantees that the structure of the partition is stable across infinitely manyt∈S, providing us with a consistent reference framework

[43] [46]

For all ti with i∈[γ] , where γ < m , recall that V ti =U ti(1)∩ {1, . . . , m} . One can select γ head indices vi ∈V ti such that they are pairwise distinct. This property will be crucial later when we need to ensure that certain representatives can be chosen without overlap. We also recall the main result fromStep 3, namely Equation (57): for any (s1, ....

[44] [47]

In other words, the positions corresponding to T are aligned with the distinguished token indicest i

If j=v i for some i∈T , then set sj =s vi =t i. In other words, the positions corresponding to T are aligned with the distinguished token indicest i

[45] [48]

, m} \ {v i :i∈T} , take sj to be an arbitrary element of S

If j∈ {1, . . . , m} \ {v i :i∈T} , take sj to be an arbitrary element of S. This ensures consistency with the partition structure while leaving us flexibility in the assignment

[46] [49]

Again, this choice respects the partitioning of indices into classesU p

If j∈U p for some 2≤p≤α , then take sj to be an arbitrary element of S. Again, this choice respects the partitioning of indices into classesU p. For the chosen (s1, . . . , sh)∈[L] h, we analyze which σ∈S h satisfy the condition Ak,sj j =A k,sj σ(j) for all j∈[h] . We make the following observations, case by case:

[47] [50]

Hence σ(U2 ⊔U 3 ⊔ · · · ⊔U α) =U 2 ⊔U 3 ⊔ · · · ⊔U α,(69) and consequentlyσ(U 1) =U 1

Forj∈U 2 ⊔U 3 ⊔ · · · ⊔U α, sayj∈U p with2≤p≤α, the conditionA k,sj j =A k,sj σ(j) impliesσ(j)∈U p. Hence σ(U2 ⊔U 3 ⊔ · · · ⊔U α) =U 2 ⊔U 3 ⊔ · · · ⊔U α,(69) and consequentlyσ(U 1) =U 1. In particular, ifj∈U 1, thenσ(j)∈U 1

[48] [51]

, m} \ {v i :i∈T} , if Ak,sj j =A k,sj σ(j), then necessarily σ(j)∈U 1 ={1,

For j∈ {1, . . . , m} \ {v i :i∈T} , if Ak,sj j =A k,sj σ(j), then necessarily σ(j)∈U 1 ={1, . . . , m} . Thus the entire set U1 is stable underσ, but the specific images of these indices may vary withinU 1

[49] [52]

From the previous point, we also know σ(j)∈U 1

For j=v i with i∈T , if Ak,sj j =A k,sj σ(j), then σ(j)∈U svi (1) =U ti(1). From the previous point, we also know σ(j)∈U 1. Taken together, these conditions imply that σ(j)∈V ti =U ti(1)∩U 1. In other words, the image of vi underσis constrained to lie inside the restricted setV ti. Therefore, specifying aσ∈S h that satisfiesA k,sj j =A k,sj σ(j) for allj∈...

[50] [53]

For eachj=v i withi∈T, choosingσ(j) =σ(v i)∈V ti,

[51] [54]

, m} \ {v i :i∈T}, choosingσ(j)∈U 1 \ {σ(vi) :i∈T}arbitrarily,

For eachj∈ {1, . . . , m} \ {v i :i∈T}, choosingσ(j)∈U 1 \ {σ(vi) :i∈T}arbitrarily,

[52] [55]

In conclusion, the structure of admissible permutations σ in Equation (67) is fully determined by the subset T⊂[γ] and the representatives vi ∈V ti chosen inStep 4

For eachj∈U p with2≤p≤α, choosingσ(j)∈U p. In conclusion, the structure of admissible permutations σ in Equation (67) is fully determined by the subset T⊂[γ] and the representatives vi ∈V ti chosen inStep 4. This description clarifies how the constraints arising from the partition classes Up and the distinguished representatives vi together restrict the a...

1935

[53] [56]

All matricesA n i and ¯An i , for feasibleiandn∈Z, are nonzero

[54] [57]

,{An h}n∈Z are pairwise distinct

Fromθ, thehfamilies{A n 1 }n∈Z, . . . ,{An h}n∈Z are pairwise distinct. The same condition holds for ¯θ

[55] [58]

If the twoMHA RoPE maps are identical, thenh= ¯h

All matricesW Q i , W K i , W V i , W O i and ¯W Q i , ¯W K i , ¯W V i , W O i , for feasiblei, are of rankd h. If the twoMHA RoPE maps are identical, thenh= ¯h. Moreover, there existsg∈G RoPE(dh, h)such that ¯θ=gθ. Proof.Fori∈[h]andm, n≥1, defineA m,n i =A m−n i andB i :=W V i (W O i )⊤. Same for ¯Am,n i and ¯Bi. Then, one has MHA x;{{A m,n i }m,n, Bi}h ...

1999

[56] [59]

Compute the scalar constantsη Q, ηK and the complex constantsγ Q, γK

[57] [60]

Define the objective functiong(x) =xη Q + ηK x −4 q |γQ|2x+ |γK |2 x + 2Re(γQ¯γK)

[58] [61]

Here we use the Brent’s method (Brent, 2013)

Find the minimizer x⋆ = arg min x>0 g(x) using a numerical optimization routine. Here we use the Brent’s method (Brent, 2013). The optimal solution is computed as r⋆ = √ x⋆, θ⋆ =−arg r⋆γQ + 1 r⋆ γK , and finally a=r ⋆ cos(θ⋆),b=r ⋆ sin(θ⋆). This yields the optimal alignment matrixU j for each subspacej. G.2. Algorithm Description Algorithm 1Attention Laye...

2013

[59] [62]

During fine-tuning, we replace the pretrained attention modules with variants containing 4, 8, or 16 heads, and train for 60000 steps

Pretraining is performed using the Adam optimizer with a batch size of 24 and an initial learning rate of 2.5·10 −4, following a cosine decay schedule without warmup, for a total of 60000 steps. During fine-tuning, we replace the pretrained attention modules with variants containing 4, 8, or 16 heads, and train for 60000 steps. WikiText103.For the WikiTex...

2000