arxiv: 2605.11591 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

Logit-Attention Divergence: Mitigating Position Bias in Multi-Image Retrieval via Attention-Guided Calibration

Mingtao Xian, Nanyang Ye, Qinying Gu, Xinbing Wang, Yifeng Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords position biasmulti-image retrievalmultimodal large language modelsattention-guided calibrationlogit-attention divergencepermutation invariancetraining-free debiasing

0 comments

The pith

Multimodal models show position bias in multi-image retrieval, but their internal attention maps stay aligned with relevant content and enable a training-free correction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models often let the order of input images dominate their predictions instead of semantic relevance. Analysis reveals that while final output logits become heavily skewed by this ordering, the internal attention maps continue to highlight the truly relevant images. The paper exploits this separation to build an inference-time calibration step that adjusts instance-level scores using attention signals and a tiny calibration set. This approach removes the need for retraining and produces models that are far less sensitive to input permutation.

Core claim

The paper establishes the existence of Logit-Attention Divergence: output logits are dominated by input order while attention maps remain well-aligned with relevant visual evidence. It then presents a training-free, attention-guided debiasing framework that applies instance-level correction at inference using these attention signals and only a minimal calibration set.

What carries the argument

Logit-Attention Divergence, the observed separation between position-biased logits and order-robust attention maps that supplies the signal for instance-level calibration.

If this is right

Permutation invariance improves substantially on multi-image cross-modal retrieval benchmarks.
Accuracy rises more than 40 percent relative to prior logit-only calibration methods such as PriDe.
State-of-the-art results are reached while adding negligible compute and requiring only a small calibration set.
The framework applies directly at inference without any model retraining or fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The finding suggests attention layers may be inherently more robust to ordering artifacts than the final output heads across other multimodal tasks.
Similar divergence checks could be applied to detect and correct ordering biases in video or long-context language models.
The method opens a route to lightweight post-hoc debiasing that preserves the original model weights.

Load-bearing premise

Attention maps stay aligned with relevant visual evidence even when logits are position-biased.

What would settle it

A controlled test in which attention maps on the same multi-image inputs become as order-dependent as the logits, or in which the attention-based correction produces no accuracy gain or a loss on held-out MS-COCO retrieval examples.

Figures

Figures reproduced from arXiv: 2605.11591 by Mingtao Xian, Nanyang Ye, Qinying Gu, Xinbing Wang, Yifeng Yang.

**Figure 1.** Figure 1: Confusion matrices for LLaVA-OneVision on the multiimage retrieval task (N = 8, Random setting). Cell values represent the selection rate (%), where darker blue indicates a higher selection rate. The Vanilla baseline (Left) shows vertical stripes due to severe position bias, while our method (Right) restores the diagonal pattern, demonstrating that predictions correctly align with the ground truth. sual … view at source ↗

**Figure 2.** Figure 2: Logit distributions for candidate answer tokens conditioned on ground truth (GT) positions on LLaVA-OneVision. Each subplot shows a different GT position. Green curves represent mean logit profiles, gray lines show individual samples, and dashed lines mark the GT (green: correct; red: incorrect). Individual samples cluster tightly around the mean within each GT category. static, content-independent offset.… view at source ↗

**Figure 3.** Figure 3: A motivating example of the divergence between attention and logit. (a) internal attention weights and (b) final output logits averaged over 5 samples where the Ground Truth (GT) is at position 4 (marked by *). While the model internally attends to the correct position, the final output is hijacked by positional bias, resulting in a spurious peak at position 3. This highlights that correct internal local… view at source ↗

**Figure 4.** Figure 4: Illustration of prior probability statistics (§4.2): we construct a symmetrized calibration set via cyclic permutations to estimate conditional position bias and layer-wise attention priors. capability, which we parameterize using a scalar confidence gain γ: Pvis(j | i) = γ I(j=i) , γ > 1. (3) This formulation implies that visual evidence contributes a multiplicative boost γ solely to the ground truth posi… view at source ↗

**Figure 5.** Figure 5: Performance comparison across varying candidate pool sizes N ∈ {2, . . . , 12} on LLaVA-OneVision. While baselines (Blue/Green) drop rapidly due to noise accumulation, our method (Red) demonstrates significantly higher resistance to scaling exhibiting a more gradual decline in accuracy in both settings. PriDe provide limited help because they apply the same correction to every sample, ignoring instance-l… view at source ↗

**Figure 6.** Figure 6: Ablation on calibration efficiency and posterior sharpening on LLaVA-OneVision under the adversarial setting. (a) Impact of calibration set size |Dcal| on accuracy. (b) Impact of the temperature parameter τ on accuracy. Shaded regions denote the standard deviation across 5 random shuffles. fectively. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Examples of multi-image retrieval samples under the adversarial setting. D. Supplementary Experimental Results D.1. Performance on Larger Candidate Pools To further evaluate the scalability of our approach, we report complete results for N = 8 candidates in [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: We evaluate LLaVA-OneVision-8B (N = 4) under the Random setting using four identifier formats. While the Vanilla model (top) shows severe bias, our method (bottom) consistently restores accurate retrieval across all tokenization schemes. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have shown strong performance in multi-image cross-modal retrieval, yet suffer from severe position bias, where predictions are dominated by input order rather than semantic relevance. Through empirical analysis, we identify a phenomenon termed Logit-Attention Divergence, in which output logits are heavily biased while internal attention maps remain well-aligned with relevant visual evidence. This observation reveals a fundamental limitation of existing logit-level calibration methods such as PriDe. Based on this insight, we propose a training-free, attention-guided debiasing framework that leverages intrinsic attention signals for instance-level correction at inference time, requiring only a minimal calibration set with negligible computational overhead. Experiments on MS-COCO-based benchmarks show that our method substantially improves permutation invariance and achieves state-of-the-art performance, enhancing accuracy by over 40\% compared to baselines. Code is available at https://github.com/brightXian/LAD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names a logit-attention split in position-biased MLLM retrieval and offers a lightweight attention-based correction, but the reported gains rest on thin experimental detail.

read the letter

The core observation is that in multi-image retrieval, MLLM output logits get pulled hard by input order while the internal attention maps still track the semantically relevant images. The authors turn that split into a training-free calibration step that uses attention signals to adjust the logits at inference with only a small calibration set. That framing is new relative to prior logit-only methods like PriDe, and the approach is practical because it avoids retraining and adds little cost. The code release is also a plus for anyone who wants to test it directly. What the work does cleanly is point out a concrete limitation in existing debiasing and give a simple internal-signal fix that targets instance-level correction. The soft spot is the evidence base. The abstract claims over 40% accuracy lift and better permutation invariance on MS-COCO benchmarks, yet supplies no numbers on how the divergence was quantified, no ablation on the calibration set size, no error bars, and no checks across model scales or additional datasets. The load-bearing assumption—that attention remains aligned with semantics even when logits are order-biased—needs explicit rank-correlation or similar metrics that survive full permutation of the inputs; without those, the gains could be narrower than stated. This is useful reading for people working on reliable multimodal retrieval or lightweight calibration of vision-language models. A practitioner who needs an off-the-shelf way to reduce order sensitivity would get immediate ideas from it. I would send it to peer review. The direction is clear enough and the fix is cheap enough that referees can verify the claims without much extra work.

Referee Report

2 major / 2 minor

Summary. The manuscript identifies a 'Logit-Attention Divergence' in multimodal LLMs for multi-image retrieval, where output logits exhibit strong position bias while internal attention maps remain aligned with semantic content. It proposes a training-free, attention-guided calibration framework that uses these attention signals for instance-level debiasing at inference time with minimal overhead. On MS-COCO-based benchmarks, the method is claimed to substantially improve permutation invariance and achieve SOTA performance with over 40% accuracy gains relative to baselines.

Significance. If the core divergence observation and resulting gains hold under rigorous testing, the work would be significant for providing a lightweight, training-free mitigation of position bias in MLLM retrieval without altering model parameters. The emphasis on leveraging internal attention rather than post-hoc logit calibration could inform broader interpretability efforts in vision-language models. Reproducibility is aided by the stated code release, though the absence of detailed validation limits immediate impact assessment.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The central claim of >40% accuracy improvement and SOTA performance is presented without error bars, ablation studies, multiple random seeds, or cross-benchmark verification on varied model scales; this makes it impossible to determine whether the gains are robust or specific to the chosen MS-COCO setup and calibration set.
[§3] §3 (Method): The framework is predicated on the unquantified assumption that attention maps remain permutation-invariant and semantically aligned while logits do not; no metrics (e.g., rank correlation of attention mass on ground-truth images across all input orderings) are reported to confirm this divergence survives permutations, which is load-bearing for the correction mechanism.

minor comments (2)

[§2] The introduction of the term 'Logit-Attention Divergence' would benefit from an explicit formal definition or equation early in the text rather than relying solely on the empirical description.
[§4] Figure captions and table headers should explicitly state the number of runs and any statistical tests used to support the reported accuracy figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and have revised the manuscript to incorporate additional validation and metrics as suggested.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim of >40% accuracy improvement and SOTA performance is presented without error bars, ablation studies, multiple random seeds, or cross-benchmark verification on varied model scales; this makes it impossible to determine whether the gains are robust or specific to the chosen MS-COCO setup and calibration set.

Authors: We agree that the robustness of the reported gains requires further statistical support. In the revised version, we have added error bars computed over 5 independent runs with different random seeds for the calibration set sampling. We also include ablation studies varying the calibration set size (from 10 to 100 samples) and the attention threshold parameter. To address cross-benchmark and scale concerns, we have extended the experiments to include results on the Flickr30K dataset and evaluations on additional MLLMs such as LLaVA-1.6 and InternVL. These additions confirm that the accuracy improvements remain consistent at over 35-45% across setups. The abstract and Section 4 have been updated with these results and a new table summarizing the ablations. revision: yes
Referee: [§3] §3 (Method): The framework is predicated on the unquantified assumption that attention maps remain permutation-invariant and semantically aligned while logits do not; no metrics (e.g., rank correlation of attention mass on ground-truth images across all input orderings) are reported to confirm this divergence survives permutations, which is load-bearing for the correction mechanism.

Authors: This is a valid point regarding the quantification of the core observation. We have now included in the revised §3 a quantitative analysis of the Logit-Attention Divergence under permutations. Specifically, we compute the Spearman's rank correlation coefficient between the attention mass assigned to ground-truth relevant images across 20 random input permutations. The attention maps show high average correlation (ρ ≈ 0.82 ± 0.05), indicating strong permutation invariance, whereas the logit-based rankings exhibit low correlation (ρ ≈ 0.25 ± 0.12). This metric directly supports the assumption and justifies the attention-guided calibration. A new figure and table have been added to illustrate this divergence. revision: yes

Circularity Check

0 steps flagged

No circularity; method follows from empirical observation without self-referential reduction

full rationale

The paper identifies Logit-Attention Divergence via empirical analysis on MLLMs, then applies attention maps for a training-free correction. No equations define a parameter from the target metric and then 'predict' it, no self-citation chain justifies the core premise, and the claimed accuracy gains are measured on external MS-COCO benchmarks rather than by construction. The framework remains self-contained against independent data and does not rename or smuggle prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the empirical observation that attention maps stay reliable while logits do not, plus the assumption that a small calibration set suffices to map attention to logit corrections.

axioms (1)

domain assumption Attention maps in MLLMs align with semantic relevance even under position bias
Central to using attention for debiasing; stated as an empirical finding in the abstract

invented entities (1)

Logit-Attention Divergence no independent evidence
purpose: Diagnostic phenomenon explaining why logit calibration fails while attention remains useful
Newly named observation used to motivate the method; no external falsifiable prediction supplied

pith-pipeline@v0.9.0 · 5473 in / 1233 out tokens · 50182 ms · 2026-05-13T01:23:11.992771+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 5 internal anchors

[1]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

An, X., Xie, Y ., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y ., Xu, S., Chen, C., Zhu, D., et al. Llava- onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen2.5-VL Technical Report

URLhttps://arxiv.org/abs/2502.13923. Geng, J., Cai, F., Wang, Y ., Koeppl, H., Nakov, P., and Gurevych, I. A survey of confidence estimation and calibration in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Mibench: Evaluating multimodal large language models over multi- ple images.arXiv preprint arXiv:2407.15272, 2024a

Liu, H., Zhang, X., Xu, H., Shi, Y ., Jiang, C., Yan, M., Zhang, J., Huang, F., Yuan, C., Li, B., et al. Mibench: Evaluating multimodal large language models over multi- ple images.arXiv preprint arXiv:2407.15272, 2024a. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. Lost in the middle: How language models use l...

work page arXiv
[4]

arXiv preprint arXiv:2310.01427 , year=

Peysakhovich, A. and Lerer, A. Attention sorting combats recency bias in long context language models.arXiv preprint arXiv:2310.01427,

work page arXiv
[5]

Milebench: Benchmarking mllms in long context.arXiv preprint arXiv:2404.18532, 2024

Song, D., Chen, S., Chen, G. H., Yu, F., Wan, X., and Wang, B. Milebench: Benchmarking mllms in long context. arXiv preprint arXiv:2404.18532,

work page arXiv
[6]

arXiv preprint arXiv:2410.16983 , year=

Tan, Z., Chu, X., Li, W., and Mo, T. Order matters: Explor- ing order sensitivity in multimodal large language models. arXiv preprint arXiv:2410.16983,

work page arXiv
[7]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, H., Shi, H., Tan, S., Qin, W., Wang, W., Zhang, T., Nambi, A., Ganu, T., and Wang, H. Multimodal nee- dle in a haystack: Benchmarking long-context capability of multimodal large language models. In Chiruzzo, L., Ritter, A., and Wang, L. (eds.),Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computatio...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Eliminating Position Bias of Language Models: A Mechanistic Approach

Wang, X., Ma, B., Hu, C., Weber-Genzel, L., R ¨ottger, P., Kreuter, F., Hovy, D., and Plank, B. ” my answer is c”: First-token probabilities do not match text answers in instruction-tuned language models. InACL (Findings), 2024a. 10 Logit-Attention Divergence: Mitigating Position Bias in MLLMs Wang, Z., Zhang, H., Li, X., Huang, K.-H., Han, C., Ji, S., Ka...

work page arXiv
[10]

On the emergence of position bias in transformers

Wu, X., Wang, Y ., Jegelka, S., and Jadbabaie, A. On the emergence of position bias in transformers.arXiv preprint arXiv:2502.01951,

work page arXiv
[11]

Effi- cient streaming language models with attention sinks

Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Effi- cient streaming language models with attention sinks. In International Conference on Learning Representations, volume 2024, pp. 21875–21895,

work page 2024
[12]

Ferret: Refer and ground anything anywhere at any granularity,

You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S.-F., and Yang, Y . Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704,

work page arXiv
[13]

arXiv preprint arXiv:2406.02536 , year=

Yu, Y ., Jiang, H., Luo, X., Wu, Q., Lin, C.-Y ., Li, D., Yang, Y ., Huang, Y ., and Qiu, L. Mitigate position bias in large language models via scaling a single dimension.arXiv preprint arXiv:2406.02536,

work page arXiv
[14]

Large language models are not robust multiple choice selectors.arXiv preprint arXiv:2309.03882, 2023

Zheng, C., Zhou, H., Meng, F., Zhou, J., and Huang, M. Large language models are not robust multiple choice selectors.arXiv preprint arXiv:2309.03882,

work page arXiv
[15]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

red motorcycle

and Flickr8k (Hodosh et al., 2013). Unlike random sampling, this setting selects distractors that are visually similar to the target but semantically distinct. For each target image vanc, we extract ℓ2-normalized CLIP (ViT-L/14) embeddings for both images and captions. We then selectN−1 distractors vneg from the pool D that satisfy the following constrain...

work page 2013
[17]

A", "B", . . . ), lowercase alphabetic characters (

Compared to the N= 4 setting presented in the main paper (Table 1), the expanded candidate pool introduces significantly greater complexity: the probability of random guessing drops from 25% to 12.5%, and the accumulation of visual distractors intensifies both positional bias and semantic ambiguity. As shown in Table 6, our method maintains strong perform...

work page 2025