Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling
Pith reviewed 2026-06-29 04:56 UTC · model grok-4.3
The pith
Layer-specific scaling of positional embeddings balances attention in transformers for long contexts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce layer-specific positional embedding scaling (LPES) that applies distinct scaling factors to each layer of the model. These factors are selected using a genetic algorithm that incorporates Bézier curves to reduce the search space. This leads to a more balanced attention distribution and consistent performance improvements on multiple long-context benchmarks without any fine-tuning or increase in inference delay.
What carries the argument
Layer-specific positional embedding scaling (LPES), which assigns distinct scaling factors per layer optimized via a genetic algorithm with Bézier curves to balance attention.
If this is right
- More balanced attention distribution across positions in long inputs.
- Consistent accuracy gains on long-context benchmarks.
- Up to 11.2% improvement on the key-value retrieval dataset.
- No need for fine-tuning model parameters or increased inference latency.
Where Pith is reading between the lines
- The scaling factors optimized for specific models may require re-optimization when applied to different architectures or longer contexts.
- LPES could potentially be combined with other position embedding techniques beyond RoPE.
- Further testing on diverse tasks might reveal if the method introduces biases in certain scenarios.
Load-bearing premise
The scaling factors selected by the genetic algorithm on the tested benchmarks will generalize to other models, tasks, and context lengths without needing per-model re-optimization.
What would settle it
Applying the reported scaling factors to a new transformer model on a long-context benchmark and finding no improvement or a performance drop compared to the baseline.
Figures
read the original abstract
Large Language Models (LLMs) still struggle with the ``lost-in-the-middle'' problem, where critical information located in the middle of long-context inputs is often underrepresented or lost. While existing methods attempt to address this by combining multi-scale rotary position embeddings (RoPE), they typically suffer from high latency or rely on suboptimal hand-crafted scaling strategies. To overcome these limitations, we introduce a layer-specific positional embedding scaling~(LPES) method that assigns distinct scaling factors to each layer. LPES achieves a more balanced attention distribution without fine-tuning model parameters or increasing inference delay. A specially designed genetic algorithm is employed to efficiently select the optimal scaling factors for each layer by incorporating B\'{e}zier curves to significantly reduce the search space. Extensive experiments demonstrate that LPES effectively mitigates positional attention bias and delivers consistent improvements across multiple long-context benchmarks, yielding up to an $11.2$\% accuracy gain on the key-value retrieval dataset.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Layer-Specific Positional Embedding Scaling (LPES) to mitigate the lost-in-the-middle problem in LLMs. It assigns distinct scaling factors to each transformer layer's positional embeddings, optimized via a genetic algorithm that uses Bézier curves to reduce the search space. The method claims to produce more balanced attention distributions, yield up to 11.2% accuracy gains on key-value retrieval, and require no model fine-tuning or added inference latency, with consistent improvements reported across long-context benchmarks.
Significance. If the empirical gains and generalization hold, LPES would provide a low-overhead, training-free intervention for positional bias that could be applied post-hoc to existing models. The use of a GA with smoothness constraints is a pragmatic search strategy, but its value depends on whether the discovered factors transfer beyond the evaluated tasks.
major comments (2)
- [Abstract, §4] Abstract and §4 (Experiments): the central claim of 'consistent improvements across multiple long-context benchmarks' and 'up to an 11.2% accuracy gain' rests on unreported experimental protocols, baseline comparisons, statistical significance tests, and ablation studies. Without these details, it is impossible to assess whether the reported gains are robust or attributable to LPES rather than benchmark-specific tuning.
- [§3] §3 (Method): the genetic algorithm selects per-layer scaling factors on the reported benchmarks, yet no evidence is supplied that these factors generalize to new models, tasks, or context lengths without re-optimization. The Bézier parametrization reduces the search space but may exclude better configurations; the manuscript provides no cross-model transfer experiments or failure-mode analysis to support the claim that LPES works 'without fine-tuning model parameters' in a plug-and-play manner.
minor comments (2)
- [§3] Notation for the per-layer scaling factors and the exact form of the Bézier parametrization should be defined with equations in §3 to allow reproducibility.
- [Abstract] The abstract states performance gains but supplies no experimental protocol; this should be summarized with key numbers (e.g., number of models, context lengths, baselines) already in the abstract.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the concerns regarding experimental reporting and generalization claims below, and have revised the paper to incorporate additional details and experiments where feasible.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (Experiments): the central claim of 'consistent improvements across multiple long-context benchmarks' and 'up to an 11.2% accuracy gain' rests on unreported experimental protocols, baseline comparisons, statistical significance tests, and ablation studies. Without these details, it is impossible to assess whether the reported gains are robust or attributable to LPES rather than benchmark-specific tuning.
Authors: We agree that additional transparency is needed. In the revised manuscript, Section 4 and a new Appendix C now include: full experimental protocols (5 random seeds, evaluation details, hardware), explicit baseline comparisons (vanilla RoPE, YaRN, LongRoPE) with updated tables, statistical significance via paired t-tests (p<0.05 for key gains), and ablations on GA population size and Bézier parameters. The 11.2% figure is the peak on KV retrieval at 32k context, averaged across runs, confirming attribution to LPES. revision: yes
-
Referee: [§3] §3 (Method): the genetic algorithm selects per-layer scaling factors on the reported benchmarks, yet no evidence is supplied that these factors generalize to new models, tasks, or context lengths without re-optimization. The Bézier parametrization reduces the search space but may exclude better configurations; the manuscript provides no cross-model transfer experiments or failure-mode analysis to support the claim that LPES works 'without fine-tuning model parameters' in a plug-and-play manner.
Authors: LPES optimizes scaling factors once via GA on a validation set per model; the factors are then applied plug-and-play with no model parameter updates or inference overhead. We added experiments showing transfer to longer contexts (up to 128k) on the same model without re-optimization. We partially address cross-model transfer with a new study between similar 7B models retaining ~65% of gains. An ablation shows Bézier-constrained search yields performance within 1% of unconstrained, and a failure-mode appendix notes diminished benefits below 4k contexts. revision: partial
Circularity Check
No circularity: empirical search procedure with benchmark-dependent outputs
full rationale
The paper presents LPES as a method that uses a genetic algorithm (with Bézier parametrization) to search for per-layer scaling factors on long-context tasks. No derivation chain, equation, or uniqueness claim reduces a result to its own inputs by construction. The reported gains are explicitly tied to the optimization procedure on the evaluated benchmarks rather than presented as first-principles predictions. Self-citations, if present, are not load-bearing for any central mathematical step. This is a standard empirical contribution whose validity rests on external benchmark performance rather than internal tautology.
Axiom & Free-Parameter Ledger
free parameters (1)
- per-layer scaling factors
axioms (1)
- domain assumption Bézier curves can be used to parameterize and reduce the search space for scaling factors without excluding optimal solutions
Reference graph
Works this paper leans on
-
[1]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[2]
arXiv preprint arXiv:2407.01219 , year=
Searching for best practices in retrieval-augmented generation , author=. arXiv preprint arXiv:2407.01219 , year=
-
[3]
Retrieval-Augmented Generation for Large Language Models: A Survey
Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
-
[5]
2016 , publisher=
Deep learning , author=. 2016 , publisher=
2016
-
[6]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Dialoglm: Pre-trained model for long dialogue understanding and summarization , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[10]
arXiv preprint arXiv:2404.02060 , year=
Long-context llms struggle with long in-context learning , author=. arXiv preprint arXiv:2404.02060 , year=
-
[11]
arXiv preprint arXiv:2107.03175 , year=
A survey on dialogue summarization: Recent advances and new frontiers , author=. arXiv preprint arXiv:2107.03175 , year=
-
[12]
CoRR abs/2303.17568 (2023) , author=
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X. CoRR abs/2303.17568 (2023) , author=. arXiv preprint arXiv:2303.17568 , volume=
-
[13]
Advances in Neural Information Processing Systems , volume=
Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation , author=. Advances in Neural Information Processing Systems , volume=
-
[14]
Zhang, Yusen and Ni, Ansong and Mao, Ziming and Wu, Chen Henry and Zhu, Chenguang and Deb, Budhaditya and Awadallah, Ahmed H and Radev, Dragomir and Zhang, Rui , journal=. Summ\^
-
[15]
Transactions of the Association for Computational Linguistics , volume=
Lost in the middle: How language models use long contexts , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=
2024
-
[16]
arXiv preprint arXiv:2311.04939 , year=
LooGLE: Can Long-Context Language Models Understand Long Contexts? , author=. arXiv preprint arXiv:2311.04939 , year=
-
[17]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Never Lost in the Middle: Mastering Long-Context Question Answering with Position-Agnostic Decompositional Training , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[18]
arXiv preprint arXiv:2404.16811 , year=
Make Your LLM Fully Utilize the Context , author=. arXiv preprint arXiv:2404.16811 , year=
-
[19]
arXiv preprint arXiv:2312.04455 , year=
Fortify the shortest stave in attention: Enhancing context awareness of large language models for effective tool use , author=. arXiv preprint arXiv:2312.04455 , year=
-
[20]
Advances in Neural Information Processing Systems , volume=
Mixture of in-context experts enhance llms' long context awareness , author=. Advances in Neural Information Processing Systems , volume=
-
[21]
arXiv preprint arXiv:2307.11088 , year=
L-eval: Instituting standardized evaluation for long context language models , author=. arXiv preprint arXiv:2307.11088 , year=
-
[22]
An approximation of B
Nuntawisuttiwong, Taweechai and Dejdumrong, Natasha , journal=. An approximation of B
-
[23]
arXiv preprint arXiv:2310.01427 , year=
Attention sorting combats recency bias in long context language models , author=. arXiv preprint arXiv:2310.01427 , year=
-
[24]
Neurocomputing , volume=
Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=
2024
-
[25]
arXiv preprint arXiv:2406.16008 , year=
Found in the middle: Calibrating positional attention bias improves long context utilization , author=. arXiv preprint arXiv:2406.16008 , year=
-
[26]
1999 , publisher=
Mathematics for Computer Graphics Applications , author=. 1999 , publisher=
1999
-
[27]
A shape controled fitting method for B
Mineur, Yves and Lichah, Tony and Castelain, Jean Marie and Giaume, Henri , journal=. A shape controled fitting method for B. 1998 , publisher=
1998
-
[28]
Teknika , volume=
Studi Pembentukan Huruf Font Dengan Kurva Bezier , author=. Teknika , volume=
-
[29]
and Kaiser, Lukasz and Polosukhin, Illia , year=
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, AidanN. and Kaiser, Lukasz and Polosukhin, Illia , year=. Attention is All you Need , journal=
-
[30]
and Manning, ChristopherD
Clark, Kevin and Luong, Minh-Thang and Le, QuocV. and Manning, ChristopherD. , year=. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , journal=
-
[31]
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , journal=
Lan, Zhenzhong and Chen, Mingda and Goodman, Sebastian and Gimpel, Kevin and Sharma, Piyush and Soricut, Radu , year=. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , journal=
-
[32]
2024 , month=
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning , author=. 2024 , month=
2024
-
[33]
Stable Beluga models , year=
-
[34]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
arXiv preprint arXiv:2305.14196 , year=
ZeroSCROLLS: A zero-shot benchmark for long text understanding , author=. arXiv preprint arXiv:2305.14196 , year=
-
[36]
Vicuna: An open-source chatbot impressing gpt-4 with , volume=
Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90 author=. Vicuna: An open-source chatbot impressing gpt-4 with , volume=
-
[37]
See https://vicuna
Vicuna: An open-source chatbot impressing gpt-4 with 90\ author=. See https://vicuna. lmsys. org (accessed 14 April 2023) , volume=
2023
-
[38]
and Potapenko, Anna and Jayakumar, SiddhantM
Rae, JackW. and Potapenko, Anna and Jayakumar, SiddhantM. and Lillicrap, TimothyP. , year=. Compressive Transformers for Long-Range Sequence Modelling , journal=
-
[39]
Extending Context Window of Large Language Models via Positional Interpolation
Extending context window of large language models via positional interpolation , author=. arXiv preprint arXiv:2306.15595 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
2023 , url=
Dynamically scaled rope further increases performance of long context LLaMA with zero fine-tuning , author=. 2023 , url=
2023
-
[41]
RoFormer: Enhanced Transformer with Rotary Position Embedding , journal=
Su, Jianlin and Lu, Yu and Pan, Shengfeng and Wen, Bo and Liu, Yunfeng , year=. RoFormer: Enhanced Transformer with Rotary Position Embedding , journal=
-
[42]
arXiv preprint arXiv:2009.13658 , year=
Improve transformer models with better relative position embeddings , author=. arXiv preprint arXiv:2009.13658 , year=
-
[43]
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
Longrope: Extending llm context window beyond 2 million tokens , author=. arXiv preprint arXiv:2402.13753 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Self-Attention with Relative Position Representations
Self-attention with relative position representations , author=. arXiv preprint arXiv:1803.02155 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Animation of deformable bodies with quadratic B
Bargteil, Adam W and Cohen, Elaine , journal=. Animation of deformable bodies with quadratic B. 2014 , publisher=
2014
-
[46]
Zheng, Ling and Zeng, Pengyun and Yang, Wei and Li, Yinong and Zhan, Zhenfei , journal=. B. 2020 , publisher=
2020
-
[47]
arXiv preprint arXiv:2407.09298 , year=
Transformer layers as painters , author=. arXiv preprint arXiv:2407.09298 , year=
-
[48]
Lost in the middle: How language models use long contexts, 2023 , author=
2023
-
[49]
arXiv preprint arXiv:2403.03853 , year=
Shortgpt: Layers in large language models are more redundant than you expect , author=. arXiv preprint arXiv:2403.03853 , year=
-
[50]
Slow Thinking: A Gradient Perspective , author=
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective , author=. arXiv preprint arXiv:2410.23743 , year=
-
[51]
arXiv preprint arXiv:2403.04797 , year=
Found in the middle: How language models use long contexts better via plug-and-play positional encoding , author=. arXiv preprint arXiv:2403.04797 , year=
-
[52]
Aho and Jeffrey D
Alfred V. Aho and Jeffrey D. Ullman , title =. 1972
1972
-
[53]
Publications Manual , year = "1983", publisher =
1983
-
[54]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
-
[55]
Scalable training of
Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
-
[56]
Dan Gusfield , title =. 1997
1997
-
[57]
Tetreault , title =
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
2015
-
[58]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[59]
Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[60]
Measuring Massive Multitask Language Understanding
Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[61]
Advances in Neural Information Processing Systems , volume=
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models , author=. Advances in Neural Information Processing Systems , volume=
-
[62]
Proceedings of Machine Learning and Systems , volume=
Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=
-
[63]
Sun, Qi and Cetin, Edoardo and Tang, Yujin , journal=
-
[64]
2019 , url=
Understanding the Scaling Operation in Attention from the Perspective of Entropy Invariance , author=. 2019 , url=
2019
-
[65]
Understanding Neural Machine Translation by Simplification: The Case of Encoder-free Models
Understanding neural machine translation by simplification: The case of encoder-free models , author=. arXiv preprint arXiv:1907.08158 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[66]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Flashattention-2: Faster attention with better parallelism and work partitioning , author=. arXiv preprint arXiv:2307.08691 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[67]
Analyzing the Structure of Attention in a Transformer Language Model
Analyzing the structure of attention in a transformer language model , author=. arXiv preprint arXiv:1906.04284 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[68]
arXiv e-prints , pages=
AttEntropy: On the Generalization Ability of Supervised Semantic Segmentation Transformers to New Objects in New Domains , author=. arXiv e-prints , pages=
-
[69]
International Conference on Machine Learning , pages=
Stabilizing transformer training by preventing attention entropy collapse , author=. International Conference on Machine Learning , pages=. 2023 , organization=
2023
-
[70]
arXiv preprint arXiv:2502.20082 , year=
LongRoPE2: Near-Lossless LLM Context Window Scaling , author=. arXiv preprint arXiv:2502.20082 , year=
-
[71]
International conference on machine learning , pages=
Similarity of neural network representations revisited , author=. International conference on machine learning , pages=. 2019 , organization=
2019
-
[72]
Frontiers in systems neuroscience , volume=
Representational similarity analysis-connecting the branches of systems neuroscience , author=. Frontiers in systems neuroscience , volume=. 2008 , publisher=
2008
-
[73]
IEEE transactions on pattern analysis and machine intelligence , volume=
A continual learning survey: Defying forgetting in classification tasks , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2021 , publisher=
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.