pith. machine review for the scientific record. sign in

arxiv: 2603.04676 · v2 · submitted 2026-03-04 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multi-image reasoningvision-language modelschain-of-thoughtattention gatinginference-time methodsvisual attention patterns
0
0 comments X

The pith

Reasoning VLMs show diffuse attention pulses and positional bias during multi-image CoT, which PulseFocus fixes by structuring reasoning into plan-focus blocks with soft attention gating.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines attention patterns in vision-language models when they perform chain-of-thought reasoning on tasks that involve several images at once. It identifies that the models produce sporadic, unfocused attention pulses toward the images and also show a consistent bias toward certain image positions. These patterns are linked to errors in multi-image understanding. To correct them, the authors introduce PulseFocus, an inference-time technique that breaks the reasoning process into alternating plan blocks, where the model states which image it will examine next, and focus blocks, where attention is softly gated to the chosen image. The method requires no retraining and produces accuracy gains on established multi-image benchmarks.

Core claim

During chain-of-thought generation, the text-to-image attention of reasoning VLMs exhibits diffuse pulses that fail to concentrate on task-relevant images and a systematic positional bias in attention allocation across images. PulseFocus structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. By forcing the model to explicitly plan which image to examine and then gating decode-time attention to the referenced image, PulseFocus sharpens attention focus and yields consistent improvements on multi-image benchmarks like BLINK benchmark (+3.7%) and MuirBench (+1.07%).

What carries the argument

PulseFocus, a method that interleaves explicit planning steps specifying which image to examine next with focus steps that apply soft gating to restrict decode-time attention to the referenced image.

If this is right

  • Multi-image benchmark scores rise without any model retraining.
  • Attention maps during decoding become more concentrated on the images the model claims to be examining.
  • The same structure can be applied at inference time to existing reasoning VLMs.
  • Explicit planning steps reduce the effect of positional bias in attention allocation.
  • The approach works by modifying only the generation process rather than the model weights.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar plan-focus structuring could be tested on video or sequential visual inputs where attention also tends to drift.
  • The pulse phenomenon may serve as a diagnostic signal for identifying when a VLM is likely to make reasoning errors on complex visual inputs.
  • Combining the gating step with other inference-time interventions, such as self-consistency checks, might produce additive gains.
  • The finding suggests that many current VLM failures stem from how attention is scheduled rather than from missing visual knowledge.

Load-bearing premise

The diffuse attention pulses and positional bias are the main causes of poor multi-image performance, and forcing the plan-focus structure with gating will fix them without creating new errors or hurting other capabilities.

What would settle it

Applying PulseFocus to a multi-image reasoning task and measuring no accuracy gain or the appearance of new error patterns that are not explained by attention diffusion.

Figures

Figures reproduced from arXiv: 2603.04676 by Chenjun Li.

Figure 1
Figure 1. Figure 1: Example case (from MuirBench). Baseline CoT fails to focus on the key evidence image (I5): token-level T2I colouring remains diffuse, and the model cannot recognize the second car. With PulseFocus, the <focus:I5> block becomes consistently image-aligned and the final answer is corrected from (C) to (B). MuirBench, with qualitative analysis confirming sharper attention focus on case studies involving counti… view at source ↗
Figure 2
Figure 2. Figure 2: Attention pulse visualization. T2I attention mass per image over CoT decode steps for a counting task (the same example as in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Positional attention bias. Mean T2I attention mass per image position for InternVL3.5-8B on MuirBench. Earlier images receive disproportionately more attention regardless of task. Error bars show standard deviation across task types. Observation 2: Earlier images receive more attention [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PulseFocus overview. The model alternates between <plan> blocks (free attention, decides which image to examine next) and <focus:I> blocks (soft attention gate suppresses non-target images by −λ). Bottom: attention heatmaps contrasting standard decoding (left, diffuse) vs. gated decoding (right, concentrated). where λ > 0 is the gate strength hyperparameter (we use λ = 2.0). This soft gating reduces but do… view at source ↗
Figure 5
Figure 5. Figure 5: Image identity confusion (MuirBench # 359, 5 images). Left (Baseline): The model repeatedly examines “I2” but the token colours are dominantly red (I1) rather than blue (I2)—its verbal reference and actual visual attention are misaligned. It falsely concludes I2 matches the query, predicting (A). Right (PulseFocus): Each <focus:Ix> block’s tokens correctly match the target image’s colour, and the model con… view at source ↗
read the original abstract

Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning VLMs exhibits diffuse "pulses": sporadic and unfocused attention patterns that fail to concentrate on task-relevant images. We further reveal a systematic positional bias in attention allocation across images. Motivated by these observations, we propose PulseFocus, a training-free, inference-time method that structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. By forcing the model to explicitly plan which image to examine and then gating decode-time attention to the referenced image, PulseFocus sharpens attention focus and yields consistent improvements on multi-image benchmarks like BLINK benchmark (+3.7%) and MuirBench (+1.07%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper observes diffuse 'pulses' and positional bias in text-to-image attention during chain-of-thought reasoning in VLMs on multi-image tasks. It proposes PulseFocus, a training-free inference-time intervention that structures CoT into interleaved plan/focus blocks and applies soft attention gating to the referenced image, reporting gains of +3.7% on BLINK and +1.07% on MuirBench.

Significance. If the central claim holds after detailed verification, the work would establish a lightweight, no-training method for improving multi-image reasoning by directly modulating attention patterns, offering practical value for existing VLMs and empirical insight into how attention diffusion contributes to errors.

major comments (2)
  1. Abstract: the description of the soft attention gating mechanism provides no implementation details, mathematical formulation, or specification of required model internals, which is load-bearing for reproducing the method and confirming it addresses the observed pulses without introducing coherence artifacts.
  2. §4 (Experiments): the reported benchmark improvements lack ablation controls isolating the gating component from the plan/focus structure, as well as statistical significance tests, undermining attribution of gains specifically to the proposed mechanism.
minor comments (1)
  1. Abstract: the term 'soft attention gating' is used without a formal definition or pseudocode, which would aid clarity even if the full implementation is deferred to a later section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to improve reproducibility and experimental rigor.

read point-by-point responses
  1. Referee: Abstract: the description of the soft attention gating mechanism provides no implementation details, mathematical formulation, or specification of required model internals, which is load-bearing for reproducing the method and confirming it addresses the observed pulses without introducing coherence artifacts.

    Authors: We agree that the abstract should include more specifics on the gating mechanism. In the revised version we have expanded the abstract to state that soft attention gating is implemented as a multiplicative mask G on the T2I attention matrix at each decode step, where G is derived from the plan block's image reference and applied as A' = softmax((QK^T / sqrt(d_k)) ⊙ G). The method requires only access to the model's internal cross-attention weights during inference (standard in open VLMs) and does not alter the underlying model parameters. Qualitative checks in the appendix confirm no coherence degradation. revision: yes

  2. Referee: §4 (Experiments): the reported benchmark improvements lack ablation controls isolating the gating component from the plan/focus structure, as well as statistical significance tests, undermining attribution of gains specifically to the proposed mechanism.

    Authors: We acknowledge this limitation in the original submission. The revised manuscript adds a dedicated ablation subsection (4.3) that evaluates (i) plan/focus structure alone, (ii) gating alone, and (iii) the full PulseFocus combination. Results show gating contributes the majority of the gain (+2.4% on BLINK). We also report bootstrap-based statistical significance (p < 0.05) for all main results across 5 random seeds, now included in Table 2. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical observation followed by training-free intervention

full rationale

The paper's chain consists of direct observational claims about diffuse T2I attention pulses and positional bias during CoT, followed by a proposed training-free method (PulseFocus) that interleaves plan/focus blocks and applies soft gating at decode time. No equations, parameter fits, or predictions are presented that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The reported gains (+3.7% on BLINK, +1.07% on MuirBench) are framed as empirical outcomes of the intervention, not as derivations that presuppose the result. The argument is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical observation of pulses and bias as load-bearing, with the method as an ad-hoc intervention; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption VLMs exhibit modifiable attention mechanisms during CoT that can be intervened upon at inference time via gating without retraining
    The method relies on the ability to structure reasoning and apply soft attention gating at decode time.
invented entities (1)
  • PulseFocus method no independent evidence
    purpose: To structure CoT reasoning into interleaved plan/focus blocks with soft attention gating
    New technique introduced to address observed attention issues

pith-pipeline@v0.9.0 · 5433 in / 1499 out tokens · 43376 ms · 2026-05-15T16:00:09.193365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 4 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Multi-layer learnable attention mask for multimodal tasks.arXiv preprint arXiv:2406.02761, 2024

    Wayner Barrios and SouYoung Jin. Multi-layer learnable attention mask for multimodal tasks.arXiv preprint arXiv:2406.02761, 2024

  3. [3]

    More images, more problems? a controlled analysis of vlm failure modes.arXiv preprint arXiv:2601.07812, 2026

    Anurag Das, Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Bernt Schiele, Georgios Tzimiropoulos, and Brais Martinez. More images, more problems? a controlled analysis of vlm failure modes.arXiv preprint arXiv:2601.07812, 2026

  4. [4]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

  5. [5]

    V2pe: Improving multimodal long-context capability of vision-language models with variable visual position encoding

    Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, and Xizhou Zhu. V2pe: Improving multimodal long-context capability of vision-language models with variable visual position encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21070–21084, 2025

  6. [6]

    Mantis: Interleaved multi-image instruction tuning

    Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.arXiv preprint arXiv:2405.01483, 2024

  7. [7]

    Control large language models via divide and conquer

    Bingxuan Li, Yiwei Wang, Tao Meng, Kai-Wei Chang, and Nanyun Peng. Control large language models via divide and conquer. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15240–15256, 2024

  8. [8]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  9. [9]

    Divide and conquer: Exploring language-centric tree reasoning for video question-answering

    Zhaohe Liao, Jiangtong Li, Siyu Sun, Qingyang Liu, Fengshun Xiao, Tianjiao Li, Qiang Zhang, Guang Chen, Li Niu, Changjun Jiang, et al. Divide and conquer: Exploring language-centric tree reasoning for video question-answering. InForty-second International Conference on Machine Learning, 2025. 8

  10. [10]

    Mibench: Evaluating multimodal large language models over multiple images

    Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, et al. Mibench: Evaluating multimodal large language models over multiple images. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22417–22428, 2024

  11. [11]

    Mmiu: Multimodal multi-image understanding for evaluating large vision-language models.arXiv preprint arXiv:2408.02718, 2024

    Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, et al. Mmiu: Multimodal multi-image understanding for evaluating large vision-language models.arXiv preprint arXiv:2408.02718, 2024

  12. [12]

    Rethinking causal mask attention for vision-language inference.arXiv preprint arXiv:2505.18605, 2025

    Xiaohuan Pei, Tao Huang, YanXiang Ma, and Chang Xu. Rethinking causal mask attention for vision-language inference.arXiv preprint arXiv:2505.18605, 2025

  13. [13]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  14. [14]

    MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

    Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024

  15. [15]

    Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks

    Wenhai Wang, Zhe Chen, Yangzhou Liu, Yue Cao, Weiyun Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks. InLarge Vision-Language Models: Pre-training, Prompting, and Applications, pages 23–57. Springer, 2025

  16. [16]

    Visual haystacks: A vision-centric needle-in-a-haystack benchmark.arXiv preprint arXiv:2407.13766, 2024

    Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E Gonzalez, Trevor Darrell, and David M Chan. Visual haystacks: A vision-centric needle-in-a-haystack benchmark.arXiv preprint arXiv:2407.13766, 2024

  17. [17]

    Interleavedreasoningforlargelanguagemodelsviareinforcement learning.arXiv preprint arXiv:2505.19640, 2025

    Roy Xie, David Qiu, Deepak Gopinath, Dong Lin, Yanchao Sun, Chong Wang, Saloni Potdar, and Bhuwan Dhingra. Interleaved reasoning for large language models via reinforcement learning.arXiv preprint arXiv:2505.19640, 2025

  18. [18]

    MMSI-Bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025

    Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. Mmsi-bench: A benchmark for multi-image spatial intelligence. arXiv preprint arXiv:2505.23764, 2025

  19. [19]

    Idealgpt: Iteratively decomposing vision and language reasoning via large language models

    Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad Ayyubi, Kai-Wei Chang, and Shih-Fu Chang. Idealgpt: Iteratively decomposing vision and language reasoning via large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 11289–11303, 2023

  20. [20]

    Cmmcot: Enhancing complex multi-image comprehension via multi-modal chain-of-thought and memory augmentation.arXiv preprint arXiv:2503.05255, 2025a

    Guanghao Zhang, Tao Zhong, Yan Xia, Mushui Liu, Zhelun Yu, Haoyuan Li, Wanggui He, Fangxun Shu, Dong She, Yi Wang, et al. Cmmcot: Enhancing complex multi-image comprehension via multi-modal chain-of-thought and memory augmentation.arXiv preprint arXiv:2503.05255, 2025. 9