Position Bias Correction is Insufficient for One-Pass Attention Sorting

Qiong Tang; Xiangkun Hu; Xiangyang Liu; Yiran Chen; Yunfan Shao

arxiv: 2606.27793 · v1 · pith:UBLI4KTNnew · submitted 2026-06-26 · 💻 cs.CL · cs.AI

Position Bias Correction is Insufficient for One-Pass Attention Sorting

Qiong Tang , Xiangkun Hu , Xiangyang Liu , Yiran Chen , Yunfan Shao This is my paper

Pith reviewed 2026-06-29 04:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords position biasattention sortinglong-context modelsone-pass sortingdebiasingdocument reorderingcontainment accuracyiterative refinement

0 comments

The pith

Position bias correction fails to let one-pass attention sorting match iterative reordering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether estimating a position-bias curve from low-attention documents and subtracting or dividing it from raw attention scores can turn single-pass sorting into a cheap substitute for repeated sort-and-generate cycles. Experiments on LLaMA-2-7B-32K-Instruct and YaRN-Llama-2-7b-64k show the correction either leaves accuracy unchanged or raises it by only 8.67 points, still leaving a 14.84-point gap to full iterative sorting. This matters for deployment because multiple passes add latency and cost; if bias were the sole obstacle, one corrected pass would suffice, but the results indicate repeated reordering captures something extra.

Core claim

The authors hypothesize that position bias is the main reason single-pass attention sorting underperforms iterative versions. They implement Debiased One-Pass Attention Sorting by fitting a per-prompt bias curve exclusively on the low-attention majority of documents and applying subtraction or division corrections to the attention scores of all documents. On LLaMA-2-7B-32K-Instruct the corrected single pass yields exactly the same 94.83% containment accuracy as the uncorrected version. On YaRN-Llama-2-7b-64k the correction improves accuracy by 8.67 points yet remains 14.84 points behind iterative sorting, closing only 37% of the gap. The authors therefore conclude that position-bias correcti

What carries the argument

Debiased One-Pass Attention Sorting, which derives a position-bias curve from the low-attention majority of a prompt and applies it to correct raw attention scores before a single sort.

If this is right

On LLaMA-2-7B-32K-Instruct, debiasing produces identical containment accuracy to uncalibrated single-pass sorting.
On YaRN-Llama-2-7b-64k, debiasing raises accuracy by 8.67 percentage points but leaves a 14.84-point shortfall relative to iterative sorting.
Position-bias correction accounts for only 37% of the performance difference between one-pass and iterative Attention Sorting.
Repeated reordering must be capturing ordering information that survives after bias is removed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other mechanisms such as attention redistribution across passes or cumulative context updating may be at work and could be isolated in follow-up ablations.
If the extra benefit of iteration is real, hybrid methods that run a few cheap passes only on the top-k candidates might recover most of the gain at lower cost than full iteration.
The finding suggests that long-context retrieval systems may need to retain some form of multi-pass refinement rather than relying on static per-prompt bias tables.

Load-bearing premise

The position-bias curve measured on low-attention documents accurately describes the bias that affects the high-attention documents whose order determines the final result.

What would settle it

A direct measurement showing that the corrected attention ranking on the top-attended documents differs from the ranking produced by the second or third iteration of full Attention Sorting on the same prompt.

Figures

Figures reproduced from arXiv: 2606.27793 by Qiong Tang, Xiangkun Hu, Xiangyang Liu, Yiran Chen, Yunfan Shao.

**Figure 2.** Figure 2: Accuracy and mean gold document position across sorting iterations ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Long-context language models suffer from position bias, where information in middle positions is underutilized. Attention Sorting addresses this by iteratively reordering documents based on attention patterns, but its multiple sort-and-generate cycles increase deployment cost. We hypothesize that position bias is the primary bottleneck and propose Debiased One-Pass Attention Sorting, which estimates a per-prompt position-bias curve from the low-attention majority of documents and uses it to correct raw attention scores (via subtraction or division) to enable single-pass sorting. Our experiments on two models refute this hypothesis in the tested setting: on LLaMA-2-7B-32K-Instruct, debiasing produces identical results to uncalibrated single-pass sorting (94.83\% containment accuracy), while on YaRN-Llama-2-7b-64k, debiasing improves accuracy by 8.67 percentage points but remains 14.84pp behind iterative sorting, closing only 37\% of the gap. These results suggest that position-bias correction is insufficient to match iterative sorting, and that repeated reordering provides additional benefits beyond bias correction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Debiasing attention scores from low-attention docs does not close the gap to iterative sorting, but the paper gives almost no detail on how the correction was done.

read the letter

The main point is that a per-prompt bias correction estimated only from low-attention documents fails to match the performance of repeated reordering on the two models tested. On LLaMA-2-7B-32K-Instruct the fix changes nothing (still 94.83% containment accuracy), and on YaRN-Llama-2-7b-64k it recovers 8.67 points but leaves a 14.84-point shortfall, closing just 37% of the gap. That is the concrete negative result.

The paper does one thing cleanly: it directly tests the hypothesis that position bias is the dominant bottleneck and can be removed with a cheap single-pass adjustment. Reporting the exact accuracy numbers on two different models makes the claim falsifiable and easy to check.

The soft spots are in the missing mechanics. The abstract says nothing about how the bias curve is fitted, what threshold defines the low-attention majority, or whether the same prompts were held constant across conditions. The stress-test concern lands: if high-attention documents experience different positional modulation than the low-attention majority, subtracting the fitted curve will leave residual bias, so the experiment does not isolate whether repeated passes add value beyond bias removal.

This is for people working on long-context retrieval efficiency who need to know whether one-pass approximations are viable. A reader already following attention-sorting work will find the negative result useful, but only after the fitting procedure and the representativeness assumption are spelled out.

It should go to peer review once those details are added; the core empirical question is worth referee time.

Referee Report

2 major / 1 minor

Summary. The paper claims that position bias is not the sole bottleneck for attention-based document sorting in long-context LLMs. It introduces Debiased One-Pass Attention Sorting, which fits a per-prompt position-bias curve exclusively from the low-attention majority of documents and applies subtraction or division corrections to raw attention scores to enable single-pass sorting. On LLaMA-2-7B-32K-Instruct this yields identical 94.83% containment accuracy to uncalibrated single-pass sorting; on YaRN-Llama-2-7b-64k it improves by 8.67pp but remains 14.84pp behind iterative sorting (closing only 37% of the gap). The authors conclude that repeated reordering supplies benefits beyond bias correction.

Significance. If the empirical results hold, the work indicates that iterative attention sorting captures advantages irreducible to position-bias correction, informing the design of efficient long-context methods. The concrete accuracy numbers reported on two models constitute a reproducible empirical contribution.

major comments (2)

[Experiments] The abstract reports concrete accuracy numbers (94.83%, 8.67pp, 14.84pp gap) on two models, yet the manuscript provides no details on how the bias curve is fitted, what data exclusion rules were used for the low-attention majority, or whether identical prompts were used across conditions. This is load-bearing for the central claim that the correction was properly isolated.
[Debiased One-Pass Attention Sorting] §3 (Debiased One-Pass method): the bias curve is estimated solely from the low-attention majority and assumed to represent the positional distortion on high-attention documents whose ordering determines final accuracy. If high-attention documents experience content-dependent or non-additive positional modulation, the correction leaves residual bias, so the experiment does not establish that position bias is the sole bottleneck.

minor comments (1)

[Method] The notation for the subtraction versus division correction variants would benefit from an explicit equation in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating where the manuscript will be revised for clarity and completeness.

read point-by-point responses

Referee: [Experiments] The abstract reports concrete accuracy numbers (94.83%, 8.67pp, 14.84pp gap) on two models, yet the manuscript provides no details on how the bias curve is fitted, what data exclusion rules were used for the low-attention majority, or whether identical prompts were used across conditions. This is load-bearing for the central claim that the correction was properly isolated.

Authors: We agree that the current manuscript lacks sufficient implementation details for full reproducibility. In the revised version we will add an expanded experimental subsection that specifies: (1) the exact fitting procedure for the per-prompt position-bias curve (including functional form and optimization), (2) the precise exclusion rule used to define the low-attention majority (e.g., bottom 70 % by mean attention score), and (3) confirmation that every compared condition was run on the identical prompt/document set. These additions will make the isolation of the debiasing effect transparent. revision: yes
Referee: [Debiased One-Pass Attention Sorting] §3 (Debiased One-Pass method): the bias curve is estimated solely from the low-attention majority and assumed to represent the positional distortion on high-attention documents whose ordering determines final accuracy. If high-attention documents experience content-dependent or non-additive positional modulation, the correction leaves residual bias, so the experiment does not establish that position bias is the sole bottleneck.

Authors: We acknowledge that the assumption is an approximation and that content-dependent or non-additive positional effects on high-attention documents could leave residual bias after correction. This is a genuine limitation of the one-pass proxy. Nevertheless, the reported results show that even the best correction obtainable under the one-pass constraint closes only 37 % of the gap to iterative sorting. We will add an explicit discussion paragraph noting this assumption and its possible incompleteness while emphasizing that the empirical shortfall still indicates benefits of repeated reordering beyond simple position-bias removal. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical evaluation of debiasing method

full rationale

The paper is an empirical study that estimates a per-prompt bias curve from low-attention documents, applies correction (subtraction or division) to attention scores, and measures resulting containment accuracy against iterative sorting baselines on two models. No mathematical derivation, fitted parameter renamed as prediction, or self-citation chain is present; the central claim that correction is insufficient follows directly from the independent experimental measurements rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical hypothesis test; no free parameters, axioms, or invented entities are introduced beyond standard modeling assumptions already present in the cited attention-sorting baseline.

pith-pipeline@v0.9.1-grok · 5731 in / 1043 out tokens · 20038 ms · 2026-06-29T04:55:49.035748+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references

[1]

Liu and Kevin Lin and John Hewitt and Ashwin Paranjape and Michele Bevilacqua and F

Nelson F. Liu and Kevin Lin and John Hewitt and Ashwin Paranjape and Michele Bevilacqua and F. Petroni and Percy Liang , booktitle =. Lost in the Middle: How Language Models Use Long Contexts , volume =. Transactions of the Association for Computational Linguistics , pages =
[2]

Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization , year =

Cheng-Yu Hsieh and Yung-Sung Chuang and Chun-Liang Li and Zifeng Wang and Long Le and Abhishek Kumar and James Glass and Alexander Ratner and Chen-Yu Lee and Ranjay Krishna and Tomas Pfister , booktitle =. Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization , year =
[3]

ArXiv , title =

Zewen Qiang and Sendong Zhao and Haochun Wang and Bing Qin and Ting Liu , booktitle =. ArXiv , title =
[4]

ArXiv , title =

Zihao Yi and Delong Zeng and Zhenqing Ling and Haohao Luo and Zhe Xu and Wei Liu and Jian Luan and Wanxia Cao and Ying Shen , booktitle =. ArXiv , title =
[5]

Efficient Streaming Language Models with Attention Sinks , year =

Guangxuan Xiao and Yuandong Tian and Beidi Chen and Song Han and Mike Lewis , booktitle =. Efficient Streaming Language Models with Attention Sinks , year =
[6]

ArXiv , title =

Shouyuan Chen and Sherman Wong and Liangjian Chen and Yuandong Tian , booktitle =. ArXiv , title =
[7]

YaRN: Efficient Context Window Extension of Large Language Models , year =

Bowen Peng and Jeffrey Quesnelle and Honglu Fan and Enrico Shippole , booktitle =. YaRN: Efficient Context Window Extension of Large Language Models , year =
[8]

Zhang and Chengruidong Zhang and Yuanyuan Xu and Ning Shang and Jiahang Xu and Fan Yang and Mao Yang , booktitle =

Yiran Ding and L. Zhang and Chengruidong Zhang and Yuanyuan Xu and Ning Shang and Jiahang Xu and Fan Yang and Mao Yang , booktitle =. ArXiv , title =
[9]

Peters and Arman Cohan , booktitle =

Iz Beltagy and Matthew E. Peters and Arman Cohan , booktitle =. ArXiv , title =
[10]

Advances in Neural Information Processing Systems , title =

Manzil Zaheer and Guru Guruganesh and Avinava Dubey and Joshua Ainslie and Chris Alberti and Santiago Onta\. Advances in Neural Information Processing Systems , title =
[11]

omformer: A Nystr\

Yunyang Xiong and Zhanpeng Zeng and Rudrasis Chakraborty and Mingxing Tan and Glenn Fung and Yin Li and Vikas Singh , booktitle =. Nystr\"omformer: A Nystr\"om-based Algorithm for Approximating Self-Attention , volume =. Proceedings of the AAAI Conference on Artificial Intelligence , pages =
[12]

ArXiv , title =

Zheng Wang and Boxiao Jin and Zhongzhi Yu and Minjia Zhang , booktitle =. ArXiv , title =
[13]

RoFormer: Enhanced Transformer with Rotary Position Embedding , volume =

Jianlin Su and Murtadha Ahmed and Yu Lu and Shengfeng Pan and Wen Bo and Yunfeng Liu , booktitle =. RoFormer: Enhanced Transformer with Rotary Position Embedding , volume =. Neurocomputing , pages =
[14]

Peysakhovich and Adam Lerer , booktitle =

A. Peysakhovich and Adam Lerer , booktitle =. ArXiv , title =
[15]

Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers , year =

Shijie Chen and Bernal Jim'enez Guti'errez and Yu Su , booktitle =. Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers , year =
[16]

ArXiv , title =

Yifan Zeng and Ojas Tendolkar and Raymond Baartmans and Qingyun Wu and Huazheng Wang and Lizhong Chen , booktitle =. ArXiv , title =
[17]

Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models , year =

Raphael Tang and Xinyu Zhang and Xueguang Ma and Jimmy Lin and Ferhan Ture , booktitle =. Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models , year =
[18]

Arik , booktitle =

Bowen Jin and Jinsung Yoon and Jiawei Han and Sercan O. Arik , booktitle =. Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG , year =
[19]

Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Niko-lay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and D

Hugo Touvron and Louis Martin and Kevin R. Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Niko-lay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and D. Bikel and Lukas Blecher and Cristian Canton-Ferrer and Moya Chen and Guillem Cucurull and David Esiobu and Jude Fernandes and J. Fu and Wenyin Fu and Brian Fuller a...
[20]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[21]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[22]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016

[1] [1]

Liu and Kevin Lin and John Hewitt and Ashwin Paranjape and Michele Bevilacqua and F

Nelson F. Liu and Kevin Lin and John Hewitt and Ashwin Paranjape and Michele Bevilacqua and F. Petroni and Percy Liang , booktitle =. Lost in the Middle: How Language Models Use Long Contexts , volume =. Transactions of the Association for Computational Linguistics , pages =

[2] [2]

Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization , year =

Cheng-Yu Hsieh and Yung-Sung Chuang and Chun-Liang Li and Zifeng Wang and Long Le and Abhishek Kumar and James Glass and Alexander Ratner and Chen-Yu Lee and Ranjay Krishna and Tomas Pfister , booktitle =. Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization , year =

[3] [3]

ArXiv , title =

Zewen Qiang and Sendong Zhao and Haochun Wang and Bing Qin and Ting Liu , booktitle =. ArXiv , title =

[4] [4]

ArXiv , title =

Zihao Yi and Delong Zeng and Zhenqing Ling and Haohao Luo and Zhe Xu and Wei Liu and Jian Luan and Wanxia Cao and Ying Shen , booktitle =. ArXiv , title =

[5] [5]

Efficient Streaming Language Models with Attention Sinks , year =

Guangxuan Xiao and Yuandong Tian and Beidi Chen and Song Han and Mike Lewis , booktitle =. Efficient Streaming Language Models with Attention Sinks , year =

[6] [6]

ArXiv , title =

Shouyuan Chen and Sherman Wong and Liangjian Chen and Yuandong Tian , booktitle =. ArXiv , title =

[7] [7]

YaRN: Efficient Context Window Extension of Large Language Models , year =

Bowen Peng and Jeffrey Quesnelle and Honglu Fan and Enrico Shippole , booktitle =. YaRN: Efficient Context Window Extension of Large Language Models , year =

[8] [8]

Zhang and Chengruidong Zhang and Yuanyuan Xu and Ning Shang and Jiahang Xu and Fan Yang and Mao Yang , booktitle =

Yiran Ding and L. Zhang and Chengruidong Zhang and Yuanyuan Xu and Ning Shang and Jiahang Xu and Fan Yang and Mao Yang , booktitle =. ArXiv , title =

[9] [9]

Peters and Arman Cohan , booktitle =

Iz Beltagy and Matthew E. Peters and Arman Cohan , booktitle =. ArXiv , title =

[10] [10]

Advances in Neural Information Processing Systems , title =

Manzil Zaheer and Guru Guruganesh and Avinava Dubey and Joshua Ainslie and Chris Alberti and Santiago Onta\. Advances in Neural Information Processing Systems , title =

[11] [11]

omformer: A Nystr\

Yunyang Xiong and Zhanpeng Zeng and Rudrasis Chakraborty and Mingxing Tan and Glenn Fung and Yin Li and Vikas Singh , booktitle =. Nystr\"omformer: A Nystr\"om-based Algorithm for Approximating Self-Attention , volume =. Proceedings of the AAAI Conference on Artificial Intelligence , pages =

[12] [12]

ArXiv , title =

Zheng Wang and Boxiao Jin and Zhongzhi Yu and Minjia Zhang , booktitle =. ArXiv , title =

[13] [13]

RoFormer: Enhanced Transformer with Rotary Position Embedding , volume =

Jianlin Su and Murtadha Ahmed and Yu Lu and Shengfeng Pan and Wen Bo and Yunfeng Liu , booktitle =. RoFormer: Enhanced Transformer with Rotary Position Embedding , volume =. Neurocomputing , pages =

[14] [14]

Peysakhovich and Adam Lerer , booktitle =

A. Peysakhovich and Adam Lerer , booktitle =. ArXiv , title =

[15] [15]

Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers , year =

Shijie Chen and Bernal Jim'enez Guti'errez and Yu Su , booktitle =. Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers , year =

[16] [16]

ArXiv , title =

Yifan Zeng and Ojas Tendolkar and Raymond Baartmans and Qingyun Wu and Huazheng Wang and Lizhong Chen , booktitle =. ArXiv , title =

[17] [17]

Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models , year =

Raphael Tang and Xinyu Zhang and Xueguang Ma and Jimmy Lin and Ferhan Ture , booktitle =. Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models , year =

[18] [18]

Arik , booktitle =

Bowen Jin and Jinsung Yoon and Jiawei Han and Sercan O. Arik , booktitle =. Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG , year =

[19] [19]

Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Niko-lay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and D

Hugo Touvron and Louis Martin and Kevin R. Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Niko-lay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and D. Bikel and Lukas Blecher and Cristian Canton-Ferrer and Moya Chen and Guillem Cucurull and David Esiobu and Jude Fernandes and J. Fu and Wenyin Fu and Brian Fuller a...

[20] [20]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

[21] [21]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

[22] [22]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016