arxiv: 2603.04676 · v2 · submitted 2026-03-04 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

Chenjun Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multi-image reasoningvision-language modelschain-of-thoughtattention gatinginference-time methodsvisual attention patterns

0 comments

The pith

Reasoning VLMs show diffuse attention pulses and positional bias during multi-image CoT, which PulseFocus fixes by structuring reasoning into plan-focus blocks with soft attention gating.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines attention patterns in vision-language models when they perform chain-of-thought reasoning on tasks that involve several images at once. It identifies that the models produce sporadic, unfocused attention pulses toward the images and also show a consistent bias toward certain image positions. These patterns are linked to errors in multi-image understanding. To correct them, the authors introduce PulseFocus, an inference-time technique that breaks the reasoning process into alternating plan blocks, where the model states which image it will examine next, and focus blocks, where attention is softly gated to the chosen image. The method requires no retraining and produces accuracy gains on established multi-image benchmarks.

Core claim

During chain-of-thought generation, the text-to-image attention of reasoning VLMs exhibits diffuse pulses that fail to concentrate on task-relevant images and a systematic positional bias in attention allocation across images. PulseFocus structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. By forcing the model to explicitly plan which image to examine and then gating decode-time attention to the referenced image, PulseFocus sharpens attention focus and yields consistent improvements on multi-image benchmarks like BLINK benchmark (+3.7%) and MuirBench (+1.07%).

What carries the argument

PulseFocus, a method that interleaves explicit planning steps specifying which image to examine next with focus steps that apply soft gating to restrict decode-time attention to the referenced image.

If this is right

Multi-image benchmark scores rise without any model retraining.
Attention maps during decoding become more concentrated on the images the model claims to be examining.
The same structure can be applied at inference time to existing reasoning VLMs.
Explicit planning steps reduce the effect of positional bias in attention allocation.
The approach works by modifying only the generation process rather than the model weights.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar plan-focus structuring could be tested on video or sequential visual inputs where attention also tends to drift.
The pulse phenomenon may serve as a diagnostic signal for identifying when a VLM is likely to make reasoning errors on complex visual inputs.
Combining the gating step with other inference-time interventions, such as self-consistency checks, might produce additive gains.
The finding suggests that many current VLM failures stem from how attention is scheduled rather than from missing visual knowledge.

Load-bearing premise

The diffuse attention pulses and positional bias are the main causes of poor multi-image performance, and forcing the plan-focus structure with gating will fix them without creating new errors or hurting other capabilities.

What would settle it

Applying PulseFocus to a multi-image reasoning task and measuring no accuracy gain or the appearance of new error patterns that are not explained by attention diffusion.

Figures

Figures reproduced from arXiv: 2603.04676 by Chenjun Li.

**Figure 1.** Figure 1: Example case (from MuirBench). Baseline CoT fails to focus on the key evidence image (I5): token-level T2I colouring remains diffuse, and the model cannot recognize the second car. With PulseFocus, the <focus:I5> block becomes consistently image-aligned and the final answer is corrected from (C) to (B). MuirBench, with qualitative analysis confirming sharper attention focus on case studies involving counti… view at source ↗

**Figure 2.** Figure 2: Attention pulse visualization. T2I attention mass per image over CoT decode steps for a counting task (the same example as in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Positional attention bias. Mean T2I attention mass per image position for InternVL3.5-8B on MuirBench. Earlier images receive disproportionately more attention regardless of task. Error bars show standard deviation across task types. Observation 2: Earlier images receive more attention [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: PulseFocus overview. The model alternates between <plan> blocks (free attention, decides which image to examine next) and <focus:I> blocks (soft attention gate suppresses non-target images by −λ). Bottom: attention heatmaps contrasting standard decoding (left, diffuse) vs. gated decoding (right, concentrated). where λ > 0 is the gate strength hyperparameter (we use λ = 2.0). This soft gating reduces but do… view at source ↗

**Figure 5.** Figure 5: Image identity confusion (MuirBench # 359, 5 images). Left (Baseline): The model repeatedly examines “I2” but the token colours are dominantly red (I1) rather than blue (I2)—its verbal reference and actual visual attention are misaligned. It falsely concludes I2 matches the query, predicting (A). Right (PulseFocus): Each <focus:Ix> block’s tokens correctly match the target image’s colour, and the model con… view at source ↗

read the original abstract

Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning VLMs exhibits diffuse "pulses": sporadic and unfocused attention patterns that fail to concentrate on task-relevant images. We further reveal a systematic positional bias in attention allocation across images. Motivated by these observations, we propose PulseFocus, a training-free, inference-time method that structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. By forcing the model to explicitly plan which image to examine and then gating decode-time attention to the referenced image, PulseFocus sharpens attention focus and yields consistent improvements on multi-image benchmarks like BLINK benchmark (+3.7%) and MuirBench (+1.07%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper spots diffuse attention pulses in multi-image CoT and offers a simple training-free gating fix that delivers small benchmark lifts, but the causal link and implementation details stay thin.

read the letter

The standout element is the identification of those sporadic, unfocused T2I attention pulses during chain-of-thought on multi-image inputs, plus the positional bias that comes with them. PulseFocus then turns this into an inference-time intervention: it forces the model to output explicit plan blocks that name which image to check next, followed by focus blocks where soft gating restricts decode-time attention to the referenced image. That structure is new enough in the multi-image VLM literature and requires no retraining, which keeps it lightweight. The reported numbers are modest but consistent: +3.7 points on BLINK and +1.07 on MuirBench. Those gains are the kind of incremental, practical result that can matter for people already running these models at scale. The approach is presented as directly motivated by the observed patterns rather than post-hoc fitting, which is a clean framing. The main soft spots sit in the missing mechanics. The abstract gives no concrete description of how the gating is computed at decode time, what exact attention heads or layers are involved, or whether the plan blocks themselves change the model's output distribution in unintended ways. There is also no mention of ablations that isolate the pulses as the causal driver versus a correlated symptom, nor any statistical significance tests on the benchmark deltas. Without those, it is hard to rule out that the improvements come from the extra planning text alone or from reduced context length in the focus steps. The assumption that gating will sharpen focus without introducing coherence breaks or new positional artifacts still needs direct evidence. This paper is aimed at researchers who work on attention diagnostics and lightweight inference tweaks for VLMs. A reader already familiar with multi-image benchmarks and CoT analysis will find the observation useful and the method easy to try. It is coherent on its own terms and shows honest engagement with the problem, so it clears the bar for serious refereeing even though the current write-up will need expansion on controls and reproducibility before acceptance.

Referee Report

2 major / 1 minor

Summary. The paper observes diffuse 'pulses' and positional bias in text-to-image attention during chain-of-thought reasoning in VLMs on multi-image tasks. It proposes PulseFocus, a training-free inference-time intervention that structures CoT into interleaved plan/focus blocks and applies soft attention gating to the referenced image, reporting gains of +3.7% on BLINK and +1.07% on MuirBench.

Significance. If the central claim holds after detailed verification, the work would establish a lightweight, no-training method for improving multi-image reasoning by directly modulating attention patterns, offering practical value for existing VLMs and empirical insight into how attention diffusion contributes to errors.

major comments (2)

Abstract: the description of the soft attention gating mechanism provides no implementation details, mathematical formulation, or specification of required model internals, which is load-bearing for reproducing the method and confirming it addresses the observed pulses without introducing coherence artifacts.
§4 (Experiments): the reported benchmark improvements lack ablation controls isolating the gating component from the plan/focus structure, as well as statistical significance tests, undermining attribution of gains specifically to the proposed mechanism.

minor comments (1)

Abstract: the term 'soft attention gating' is used without a formal definition or pseudocode, which would aid clarity even if the full implementation is deferred to a later section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to improve reproducibility and experimental rigor.

read point-by-point responses

Referee: Abstract: the description of the soft attention gating mechanism provides no implementation details, mathematical formulation, or specification of required model internals, which is load-bearing for reproducing the method and confirming it addresses the observed pulses without introducing coherence artifacts.

Authors: We agree that the abstract should include more specifics on the gating mechanism. In the revised version we have expanded the abstract to state that soft attention gating is implemented as a multiplicative mask G on the T2I attention matrix at each decode step, where G is derived from the plan block's image reference and applied as A' = softmax((QK^T / sqrt(d_k)) ⊙ G). The method requires only access to the model's internal cross-attention weights during inference (standard in open VLMs) and does not alter the underlying model parameters. Qualitative checks in the appendix confirm no coherence degradation. revision: yes
Referee: §4 (Experiments): the reported benchmark improvements lack ablation controls isolating the gating component from the plan/focus structure, as well as statistical significance tests, undermining attribution of gains specifically to the proposed mechanism.

Authors: We acknowledge this limitation in the original submission. The revised manuscript adds a dedicated ablation subsection (4.3) that evaluates (i) plan/focus structure alone, (ii) gating alone, and (iii) the full PulseFocus combination. Results show gating contributes the majority of the gain (+2.4% on BLINK). We also report bootstrap-based statistical significance (p < 0.05) for all main results across 5 random seeds, now included in Table 2. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical observation followed by training-free intervention

full rationale

The paper's chain consists of direct observational claims about diffuse T2I attention pulses and positional bias during CoT, followed by a proposed training-free method (PulseFocus) that interleaves plan/focus blocks and applies soft gating at decode time. No equations, parameter fits, or predictions are presented that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The reported gains (+3.7% on BLINK, +1.07% on MuirBench) are framed as empirical outcomes of the intervention, not as derivations that presuppose the result. The argument is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical observation of pulses and bias as load-bearing, with the method as an ad-hoc intervention; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption VLMs exhibit modifiable attention mechanisms during CoT that can be intervened upon at inference time via gating without retraining
The method relies on the ability to structure reasoning and apply soft attention gating at decode time.

invented entities (1)

PulseFocus method no independent evidence
purpose: To structure CoT reasoning into interleaved plan/focus blocks with soft attention gating
New technique introduced to address observed attention issues

pith-pipeline@v0.9.0 · 5433 in / 1499 out tokens · 43376 ms · 2026-05-15T16:00:09.193365+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 4 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Multi-layer learnable attention mask for multimodal tasks.arXiv preprint arXiv:2406.02761, 2024

Wayner Barrios and SouYoung Jin. Multi-layer learnable attention mask for multimodal tasks.arXiv preprint arXiv:2406.02761, 2024

work page arXiv 2024
[3]

More images, more problems? a controlled analysis of vlm failure modes.arXiv preprint arXiv:2601.07812, 2026

Anurag Das, Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Bernt Schiele, Georgios Tzimiropoulos, and Brais Martinez. More images, more problems? a controlled analysis of vlm failure modes.arXiv preprint arXiv:2601.07812, 2026

work page arXiv 2026
[4]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

work page 2024
[5]

V2pe: Improving multimodal long-context capability of vision-language models with variable visual position encoding

Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, and Xizhou Zhu. V2pe: Improving multimodal long-context capability of vision-language models with variable visual position encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21070–21084, 2025

work page 2025
[6]

Mantis: Interleaved multi-image instruction tuning

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.arXiv preprint arXiv:2405.01483, 2024

work page arXiv 2024
[7]

Control large language models via divide and conquer

Bingxuan Li, Yiwei Wang, Tao Meng, Kai-Wei Chang, and Nanyun Peng. Control large language models via divide and conquer. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15240–15256, 2024

work page 2024
[8]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Divide and conquer: Exploring language-centric tree reasoning for video question-answering

Zhaohe Liao, Jiangtong Li, Siyu Sun, Qingyang Liu, Fengshun Xiao, Tianjiao Li, Qiang Zhang, Guang Chen, Li Niu, Changjun Jiang, et al. Divide and conquer: Exploring language-centric tree reasoning for video question-answering. InForty-second International Conference on Machine Learning, 2025. 8

work page 2025
[10]

Mibench: Evaluating multimodal large language models over multiple images

Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, et al. Mibench: Evaluating multimodal large language models over multiple images. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22417–22428, 2024

work page 2024
[11]

Mmiu: Multimodal multi-image understanding for evaluating large vision-language models.arXiv preprint arXiv:2408.02718, 2024

Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, et al. Mmiu: Multimodal multi-image understanding for evaluating large vision-language models.arXiv preprint arXiv:2408.02718, 2024

work page arXiv 2024
[12]

Rethinking causal mask attention for vision-language inference.arXiv preprint arXiv:2505.18605, 2025

Xiaohuan Pei, Tao Huang, YanXiang Ma, and Chang Xu. Rethinking causal mask attention for vision-language inference.arXiv preprint arXiv:2505.18605, 2025

work page arXiv 2025
[13]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024

work page internal anchor Pith review arXiv 2024
[15]

Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks

Wenhai Wang, Zhe Chen, Yangzhou Liu, Yue Cao, Weiyun Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks. InLarge Vision-Language Models: Pre-training, Prompting, and Applications, pages 23–57. Springer, 2025

work page 2025
[16]

Visual haystacks: A vision-centric needle-in-a-haystack benchmark.arXiv preprint arXiv:2407.13766, 2024

Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E Gonzalez, Trevor Darrell, and David M Chan. Visual haystacks: A vision-centric needle-in-a-haystack benchmark.arXiv preprint arXiv:2407.13766, 2024

work page arXiv 2024
[17]

Interleavedreasoningforlargelanguagemodelsviareinforcement learning.arXiv preprint arXiv:2505.19640, 2025

Roy Xie, David Qiu, Deepak Gopinath, Dong Lin, Yanchao Sun, Chong Wang, Saloni Potdar, and Bhuwan Dhingra. Interleaved reasoning for large language models via reinforcement learning.arXiv preprint arXiv:2505.19640, 2025

work page arXiv 2025
[18]

MMSI-Bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. Mmsi-bench: A benchmark for multi-image spatial intelligence. arXiv preprint arXiv:2505.23764, 2025

work page arXiv 2025
[19]

Idealgpt: Iteratively decomposing vision and language reasoning via large language models

Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad Ayyubi, Kai-Wei Chang, and Shih-Fu Chang. Idealgpt: Iteratively decomposing vision and language reasoning via large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 11289–11303, 2023

work page 2023
[20]

Cmmcot: Enhancing complex multi-image comprehension via multi-modal chain-of-thought and memory augmentation.arXiv preprint arXiv:2503.05255, 2025a

Guanghao Zhang, Tao Zhong, Yan Xia, Mushui Liu, Zhelun Yu, Haoyuan Li, Wanggui He, Fangxun Shu, Dong She, Yi Wang, et al. Cmmcot: Enhancing complex multi-image comprehension via multi-modal chain-of-thought and memory augmentation.arXiv preprint arXiv:2503.05255, 2025. 9

work page arXiv 2025