pith. machine review for the scientific record. sign in

arxiv: 2604.22498 · v1 · submitted 2026-04-24 · 💻 cs.CV · cs.AI

Recognition: unknown

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multi-image understandingmultimodal large language modelscontrastive learningvisual groundingspatial reasoningfine-grained visual understandingGRPOobject constancy
0
0 comments X

The pith

Compositional Grounded Contrast improves fine-grained multi-image understanding in multimodal models by building contrastive examples from single-image annotations plus spatial rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models often fail at tasks involving several images at once, mixing up positions, losing track of objects, or leaking attention across pictures. The paper shows how to fix this without collecting costly new multi-image labels or chain-of-thought data. It creates training examples by contrasting one image against another to teach discrimination and by contrasting views of the same scene to teach object constancy. A rule-based reward then guides the model to ground answers correctly in space during reinforcement learning. The resulting capability not only lifts scores on dedicated multi-image tests but also improves performance on broader reasoning benchmarks.

Core claim

CGC constructs compositional multi-image training instances through Inter-Image Contrast, which introduces semantically decoupled distractor contexts for cross-image discrimination, and Intra-Image Contrast, which supplies correlated cross-view samples for object constancy. It further adds a Rule-Based Spatial Reward inside the GRPO framework to enforce source-image attribution, spatial alignment, and valid structured output under a Think-before-Grounding paradigm. This combination, built entirely on existing single-image grounding annotations, yields state-of-the-art results on fine-grained multi-image benchmarks including MIG-Bench and VLM2-Bench and produces consistent gains when the same

What carries the argument

Compositional Grounded Contrast framework that generates multi-image training data from single-image grounding annotations via inter-image and intra-image contrasts, paired with a rule-based spatial reward inside GRPO.

If this is right

  • State-of-the-art results on MIG-Bench and VLM2-Bench for fine-grained multi-image understanding.
  • Transfer gains of 2.90 on MathVista, 2.88 on MuirBench, 1.93 on MMStar, 1.77 on MMMU, and 1.69 on BLINK over the Qwen3-VL-8B base model.
  • Reduced need for expensive human multi-image annotations or large-scale chain-of-thought data.
  • Improved source-image attribution, spatial alignment, and structured output validity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could substantially lower the cost of scaling multi-image capabilities by recycling existing single-image grounding datasets.
  • Similar contrast constructions might help with consistency problems in video or 3D scene understanding where object identity across frames matters.
  • The rule-based spatial reward could be ported to other reinforcement-learning setups that train models to cite evidence from visual inputs.
  • If the approach generalizes, it suggests that many grounding failures in multimodal models stem from missing contrastive signals rather than from insufficient model capacity.

Load-bearing premise

That contrastive data built from single-image annotations plus the spatial reward is sufficient to correct spatial hallucination, attention leakage, and object constancy failures without creating new biases or requiring extensive tuning.

What would settle it

An evaluation set of multi-image questions where distractors are deliberately chosen to violate the semantic decoupling or object-constancy assumptions used in training, then checking whether accuracy gains over the base model disappear.

Figures

Figures reproduced from arXiv: 2604.22498 by Hao Ma, Jiawei Chen, Lihao Zheng, Tao Wei, Xintian Shen, Yan Yang, Yu Zhou, Zhenwei Shao.

Figure 1
Figure 1. Figure 1: A qualitative and quantitative overview of the proposed CGC models. view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed CGC framework. Starting from single-image grounding annotations, CGC automatically con view at source ↗
Figure 3
Figure 3. Figure 3: Data scaling law. Performance improves steadily view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative examples on MIG-Bench. Top: object constancy across an image sequence. CGC correctly tracks the view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative examples on VLM2-Bench. Top: fine-grained cross-image change attribution. CGC correctly determines view at source ↗
Figure 6
Figure 6. Figure 6: A qualitative example on BLINK. The task requires selecting the point in the second image that best corresponds to a view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative examples on HallusionBench and MathVista. Top: in HallusionBench, CGC correctly judges that the two view at source ↗
Figure 8
Figure 8. Figure 8: A qualitative example on MMMU (Part I). The task asks which candidate graph best matches the physical relationship view at source ↗
Figure 9
Figure 9. Figure 9: A qualitative example on MMMU (Part II). Continuation of the MMMU case in Figure view at source ↗
Figure 10
Figure 10. Figure 10: A qualitative example on MMStar. The task requires solving a visual analogy under 3D structural transformation. view at source ↗
read the original abstract

Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial Reward within the GRPO framework to improve source-image attribution, spatial alignment, and structured output validity under a Think-before-Grounding paradigm. Experiments show that CGC achieves state-of-the-art results on fine-grained multi-image benchmarks, including MIG-Bench and VLM2-Bench. The learned multi-image understanding capability also transfers to broader multimodal understanding and reasoning tasks, yielding consistent gains over the Qwen3-VL-8B base model on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Compositional Grounded Contrast (CGC), a low-cost framework that builds compositional multi-image training instances from existing single-image grounding annotations via Inter-Image Contrast (semantically decoupled distractors) and Intra-Image Contrast (correlated cross-view samples), augmented by a Rule-Based Spatial Reward inside the GRPO optimization under a Think-before-Grounding paradigm. It reports state-of-the-art results on fine-grained multi-image benchmarks (MIG-Bench, VLM2-Bench) and consistent transfer gains over Qwen3-VL-8B on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69).

Significance. If the reported gains prove robust under controlled evaluation, CGC would demonstrate an efficient way to leverage single-image annotations for multi-image spatial and compositional reasoning without expensive new data collection or CoT generation. The integration of rule-based rewards with GRPO is a concrete strength that could generalize to other grounding tasks.

major comments (2)
  1. [§3] §3 (Method): The central claim that Inter-Image Contrast and Intra-Image Contrast constructions, together with the Rule-Based Spatial Reward, jointly resolve spatial hallucination, attention leakage, and object constancy rests on the unverified assumption that simple composition from single-image annotations preserves fine-grained relations and avoids new leakage or bias. No explicit verification, semantic-decoupling metrics, or failure-mode-targeted ablations are described to confirm the generated instances actually test the targeted weaknesses rather than reinforce base-model errors.
  2. [§5] §5 (Experiments): The abstract states SOTA performance and specific transfer deltas, yet provides no details on experimental controls, baseline implementations, statistical significance testing, or post-hoc selection safeguards. This absence makes it impossible to assess whether the gains are attributable to the proposed components or to uncontrolled factors, directly undermining evaluation of the load-bearing performance claims.
minor comments (2)
  1. [§3.3] Notation for the GRPO reward components and the Think-before-Grounding paradigm could be formalized with explicit equations to improve reproducibility.
  2. [Figure 2] Figure captions for the contrast construction diagrams should explicitly label the source annotations and distractor selection heuristics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing honest responses and committing to revisions that strengthen the manuscript without overstating our current results.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The central claim that Inter-Image Contrast and Intra-Image Contrast constructions, together with the Rule-Based Spatial Reward, jointly resolve spatial hallucination, attention leakage, and object constancy rests on the unverified assumption that simple composition from single-image annotations preserves fine-grained relations and avoids new leakage or bias. No explicit verification, semantic-decoupling metrics, or failure-mode-targeted ablations are described to confirm the generated instances actually test the targeted weaknesses rather than reinforce base-model errors.

    Authors: We agree that the manuscript would benefit from explicit verification of the contrast constructions. While the Inter-Image Contrast and Intra-Image Contrast are designed to introduce semantically decoupled distractors and correlated cross-view samples respectively, the original submission did not include quantitative semantic-decoupling metrics or failure-mode ablations. In the revised version, we will add (1) semantic-decoupling metrics such as average CLIP embedding cosine similarity between source and distractor images in inter-image pairs, and (2) targeted ablations measuring reduction in spatial hallucination and attention leakage on held-out failure cases. These additions will directly test whether the generated instances address the intended weaknesses. revision: yes

  2. Referee: [§5] §5 (Experiments): The abstract states SOTA performance and specific transfer deltas, yet provides no details on experimental controls, baseline implementations, statistical significance testing, or post-hoc selection safeguards. This absence makes it impossible to assess whether the gains are attributable to the proposed components or to uncontrolled factors, directly undermining evaluation of the load-bearing performance claims.

    Authors: We acknowledge that the current experimental section lacks the requested details on controls and statistical rigor. In the revision, we will expand §5 and the appendix to include: full specifications of baseline reproduction (including exact prompts, decoding parameters, and checkpoint versions), results across multiple random seeds with standard deviations and significance tests (e.g., paired t-tests), and explicit discussion of safeguards against post-hoc selection. These changes will allow clearer attribution of gains to the CGC components. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper describes CGC as a compositional construction of multi-image training instances (Inter-Image Contrast and Intra-Image Contrast) directly from existing single-image grounding annotations, combined with a Rule-Based Spatial Reward inside the GRPO framework under a Think-before-Grounding paradigm. All reported gains (SOTA on MIG-Bench/VLM2-Bench and transfer improvements on MathVista etc.) are presented as outcomes of empirical experiments rather than mathematical predictions or quantities defined by the method's own fitted parameters. No self-definitional equations, fitted-input predictions, load-bearing self-citations, uniqueness theorems, or ansatzes that reduce the central claims to the inputs by construction appear in the abstract or method description. The approach builds on established contrastive and RL techniques without internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions from contrastive learning and reinforcement learning in vision-language models; no new physical entities or ad-hoc constants are introduced beyond typical training hyperparameters.

axioms (1)
  • domain assumption Existing single-image grounding annotations can be repurposed to construct effective compositional multi-image training instances that mitigate spatial hallucination and object constancy issues
    This underpins the core data construction step without new human annotations.

pith-pipeline@v0.9.0 · 5571 in / 1467 out tokens · 44285 ms · 2026-05-08T12:22:34.274352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 45 canonical work pages · 17 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. 2025. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning.arXiv e-prints(2025), arXiv–2506

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

  4. [4]

    Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang

  5. [5]

    Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272(2025)

  6. [6]

    Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. 2025. SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models.arXiv:2504.11468(2025)

  7. [7]

    Jiawei Chen, Zhe Chen, Chaoqun Du, Maokui He, Wei He, Hengtao Li, Qizhen Li, Zide Liu, Hao Ma, Xuhao Pan, et al. 2026. StreamingClaw Technical Report. arXiv preprint arXiv:2603.22120(2026)

  8. [8]

    Jiawei Chen, Xintian Shen, Lihao Zheng, Lifu Mu, Haoyi Sun, Ning Mao, Hao Ma, Tao Wei, Pan Zhou, and Kun Zhan. 2026. Evaluating the Search Agent in a Parallel World.arXiv preprint arXiv:2603.04751(2026)

  9. [9]

    Jiawei Chen, Xintian Shen, Lihao Zheng, Zhenwei Shao, Hongyuan Zhang, Pengfei Yu, Xudong Rao, Ning Mao, Xiaobo Liu, Lian Wen, et al. 2025. Mind- Watcher: Toward Smarter Multimodal Tool-Integrated Reasoning.arXiv preprint arXiv:2512.23412(2025)

  10. [10]

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao

  11. [11]

    Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195(2023)

  12. [12]

    Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David F Fouhey, and Joyce Chai. 2024. Multi-object hallucination in vision language models. InProceedings of the 38th International Conference on Neural Information Processing Systems. 44393–44418

  13. [13]

    Xi Chen, Mingkang Zhu, Shaoteng Liu, Xiaoyang Wu, Xiaogang Xu, Yu Liu, Xiang Bai, and Hengshuang Zhao. 2025. Mico: Multi-image contrast for reinforcement visual reasoning.arXiv preprint arXiv:2506.22434(2025)

  14. [14]

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al . 2024. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271(2024)

  15. [15]

    Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang

  16. [16]

    Openvlthinker: An early exploration to complex vision-language reasoning via iterative self-improvement.arXiv preprint arXiv:2503.17352(2025)

  17. [17]

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. 2024. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision. Springer, 148–166

  18. [18]

    Haowen Gao, Zhenyu Zhang, Liang Pang, Fangda Guo, Hongjian Dou, Guannan Lv, Shaoguo Liu, Tingting Gao, Huawei Shen, and Xueqi Cheng. 2026. DIVA- GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage.arXiv preprint arXiv:2603.01106(2026)

  19. [19]

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. 2024. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR

  20. [20]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  21. [21]

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. 2025. Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models. arXiv:2503.06749 [cs.CV] https://arxiv.org/abs/2503.06749

  22. [22]

    Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. 2024. Mantis: Interleaved multi-image instruction tuning.arXiv preprint arXiv:2405.01483(2024)

  23. [23]

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. 2024. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9579–9589

  24. [24]

    Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. 2023. Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advances in Neural Information Processing Systems36 (2023), 71683–71702

  25. [25]

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326(2024)

  26. [26]

    You Li, Heyu Huang, Chi Chen, Kaiyu Huang, Chao Huang, Zonghao Guo, Zhiyuan Liu, Jinan Xu, Yuhua Li, Ruixuan Li, et al. 2025. Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models.arXiv preprint arXiv:2501.05767(2025)

  27. [27]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

  28. [28]

    Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, et al. 2024. Mibench: Evaluating multimodal large language models over multiple images.arXiv preprint arXiv:2407.15272 (2024)

  29. [29]

    Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, and Michael Qizhe Shieh. 2025. NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation.arXiv preprint arXiv:2504.13055(2025)

  30. [30]

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. 2025. Visual-rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785(2025)

  31. [31]

    Xinyu Ma, Ziyang Ding, Zhicong Luo, Chi Chen, Zonghao Guo, Derek F Wong, Xiaoyi Feng, and Maosong Sun. 2025. DeepPerception: Advancing R1-like Cog- nitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding. arXiv preprint arXiv:2503.12797(2025)

  32. [32]

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. 2025. MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning.arXiv preprint arXiv:2503.07365(2025)

  33. [33]

    Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, et al. 2024. Mmiu: Multimodal multi- image understanding for evaluating large vision-language models.arXiv preprint arXiv:2408.02718(2024)

  34. [34]

    MindGPT ov Team. 2025. MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm.arXiv preprint arXiv:2512.02895(2025). Lihao Zheng, Zhenwei Shao, Yu Zhou, Yan Yang, Xintian Shen, Jiawei Chen, Hao Ma, and Tao Wei

  35. [35]

    Sungjune Park, Hyunjun Kim, Junho Kim, Seongho Kim, and Yong Man Ro

  36. [36]

    Dip-r1: Deep inspection and perception with rl looking through and understanding complex scenes.arXiv preprint arXiv:2505.23179(2025)

  37. [37]

    Yeji Park, Minyoung Lee, Sanghyuk Chun, and Junsuk Choe. 2025. Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks.arXiv preprint arXiv:2508.13744(2025)

  38. [38]

    Khoi Pham, Kushal Kafle, Zhe Lin, Zhihong Ding, Scott Cohen, Quan Tran, and Abhinav Shrivastava. 2021. Learning to predict visual attributes in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13018–13028

  39. [39]

    Tingrui Qiao, Di Zhao, Yuzhuo Li, Bo Pang, Caroline Walker, Chris W Cunning- ham, and Yun Sing Koh. [n. d.]. Multiple Images Distract Large Multimodal Models via Attention Fragmentation. ([n. d.])

  40. [40]

    Samuel Schulter, Yumin Suh, Konstantinos M Dafnis, Zhixing Zhang, Shiyu Zhao, Dimitris Metaxas, et al. 2023. Omnilabel: A challenging benchmark for language-based object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision. 11953–11962

  41. [41]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al . 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

  42. [42]

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. 2025. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615(2025)

  43. [43]

    Xintian Shen, Jiawei Chen, Lihao Zheng, Hao Ma, Tao Wei, and Kun Zhan

  44. [44]

    Evolving from Tool User to Creator via Training-Free Experience Reuse in Multimodal Reasoning.arXiv preprint arXiv:2602.01983(2026)

  45. [45]

    OpenGVLab Team. 2024. Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy

  46. [46]

    Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. 2024. MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding.arXiv preprint arXiv:2406.09411(2024)

  47. [47]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191(2024)

  48. [48]

    Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. 2025. SoTA with Less: MCTS- Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement. arXiv:2504.07934(2025)

  49. [49]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  50. [50]

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. 2024. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800(2024)

  51. [51]

    Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. 2024. mplug-owl3: Towards long image- sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840(2024)

  52. [52]

    Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. 2024. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. InCVPR

  53. [53]

    Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. 2023. Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704(2023)

  54. [54]

    En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. 2025. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954 (2025)

  55. [55]

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg

  56. [56]

    InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14

    Modeling context in referring expressions. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer, 69–85

  57. [57]

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al . 2024. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13807–13816

  58. [58]

    Xiang Yuan, Gong Cheng, Kebing Yan, Qinghua Zeng, and Junwei Han. 2023. Small object detection via coarse-to-fine proposal generation and imitation learning. InProceedings of the IEEE/CVF international conference on computer vision. 6317–6327

  59. [59]

    Yufei Zhan, Yousong Zhu, Zhiyang Chen, Fan Yang, Ming Tang, and Jinqiao Wang. 2024. Griffon: Spelling out all object locations at any granularity with large language models. InEuropean Conference on Computer Vision. Springer, 405–422

  60. [60]

    Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. 2025. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937(2025)

  61. [61]

    Jianshu Zhang, Dongyu Yao, Renjie Pi, Paul Pu Liang, and Yi R Fung. 2025. VLM2- Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues.arXiv:2502.12084(2025)

  62. [62]

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. 2024. Long context transfer from language to vision.arXiv:2406.16852(2024)

  63. [63]

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. 2024. Video instruction tuning with synthetic data.arXiv:2410.02713(2024)

  64. [64]

    Lihao Zheng, Jiawei Chen, Xintian Shen, Hao Ma, and Tao Wei. 2025. MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning.arXiv preprint arXiv:2509.21788(2025)

  65. [65]

    Shurong Zheng, Yousong Zhu, Hongyin Zhao, Fan Yang, Yufei Zhan, Ming Tang, and Jinqiao Wang. 2026. GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models.arXiv preprint arXiv:2601.04777(2026)

  66. [66]

    Yikang Zhou, Tao Zhang, Shilin Xu, Shihao Chen, Qianyu Zhou, Yunhai Tong, Shunping Ji, Jiangning Zhang, Xiangtai Li, and Lu Qi. 2025. Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs. arXiv:2501.04670(2025)

  67. [67]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. 2025. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479(2025). A Implementation Details Source datasets and preprocessing.We build the source ground- ing p...

  68. [68]

    **maximum kinetic energy K**

    **Intercept with frequency axis**: When \( K_{\text{max}} = 0 \), we get \( h\nu = \phi \), so \( \nu = \frac{\phi}{h} \). This is the **threshold frequency** — the minimum frequency required to eject electrons. - So, the line crosses the x-axis at this threshold frequency. 3. **Slope**: The slope of the line is \( h \), which is positive → so as frequenc...