Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

Fangcong Yin; Greg Durrett; Liyan Tang

arxiv: 2607.02490 · v1 · pith:MEYQMSYHnew · submitted 2026-07-02 · 💻 cs.CL · cs.CV

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

Liyan Tang , Fangcong Yin , Greg Durrett This is my paper

Pith reviewed 2026-07-03 14:15 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords vision-language modelsself-reflectionreinforcement learningvisual groundingout-of-distributionchain of thoughtexperience replay

0 comments

The pith

Reinforcement learning with masked prefixes and replay buffers trains vision-language models to ground self-reflection in visual inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large vision-language models generate chains of thought but often ignore images when correcting their own mistakes, especially on unfamiliar images. The paper introduces VRRL, a reinforcement learning method that randomly masks trajectory prefixes and draws from an experience replay buffer to force the model to recover from errors using visual evidence. This is tested on visual grounding for tables and charts plus spatial navigation tasks. The result is higher out-of-distribution accuracy than standard reinforcement learning or reflection fine-tuning. A reader would care if this turns self-correction from a text habit into a visually reliable process.

Core claim

The central claim is that a reinforcement learning training framework called VRRL, using randomly masked trajectory prefixes and buffered roll-ins from an experience replay buffer, elicits visually grounded self-reflection in vision-language models and thereby substantially improves average out-of-distribution accuracy over standard RL and reflection-oriented fine-tuning baselines on visual grounding tasks.

What carries the argument

VRRL, the reinforcement learning framework that randomly masks trajectory prefixes to emphasize recovery from incorrect predictions and samples from an experience replay buffer to expose the model to diverse failure states.

If this is right

Models will translate feedback into corrections grounded in the actual image content rather than text patterns.
Accuracy on table, chart, and spatial navigation tasks will remain higher when images come from a shifted distribution.
Off-the-shelf and conventionally fine-tuned models will continue to degrade under shift while VRRL-trained models improve.
Self-reflection will become a mechanism for grounded error recovery instead of ungrounded textual revision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same masking and replay approach could be tested on other multimodal reasoning settings such as diagram-based math or video question answering.
If the method works by increasing visual attention, it might combine with attention-regularization losses for further gains.
The technique could reduce the amount of in-distribution data needed to maintain performance on new image styles.
Failure cases where the model still ignores the image would point to limits in how much replay alone can enforce grounding.

Load-bearing premise

Randomly masking prefixes and sampling from a replay buffer will cause the model to attend to visual inputs during reflection rather than learning textual recovery patterns.

What would settle it

Measure visual attention during the reflection phase on out-of-distribution images; if VRRL models show no higher visual attention or no accuracy gain over baselines, the claim fails.

Figures

Figures reproduced from arXiv: 2607.02490 by Fangcong Yin, Greg Durrett, Liyan Tang.

**Figure 1.** Figure 1: A motivating example of multi-turn reflection for visual grounding. In this illustrative task, the goal is [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: (a) Random Turn Masking masks gradient updates on a random number of prefix steps to avoid training on potentially erroneous steps. (b) Buffered Roll-In begins roll-outs from a potentially erroneous step in the replay buffer, enabling correction of this mistake. Note that partial rewards are assigned for correct reflection steps. will be evaluated. In the visual grounding task, for example, an answer propo… view at source ↗

**Figure 3.** Figure 3: Evaluation tasks for visually grounded self-reflection. (1) Visual Grounding. models are trained to localize headers of synthetic small tables. OOD tasks include generalization to larger tables (Large Table) queries of inner cell values rather than headers (Cell Query), and queries about bar charts (b), and scatter plots (c). (2) Spatial Navigation. The model is given a maze-style map and needs to find the… view at source ↗

**Figure 4.** Figure 4: Performance for multi-turn reflection across in-distribution and OOD tasks for visual grounding tasks. The top row illustrates the progression of cumulative accuracy across turns. Single-SFT → GRPO is a single-turn baseline method where its per-turn accuracy remains the same. The bottom row shows the percentage of examples reaching turn X. VRRL generally uses reflection better, improving more over iteratio… view at source ↗

**Figure 5.** Figure 5: Example output for Large Table question. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Figure [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Example output for Bar Chart question. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Figure [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Figure [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Example output for FrozenLake question. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Figure [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

read the original abstract

Large vision-language models can reason over multimodal inputs by generating textual chains of thought (CoT). A key capability exhibited in CoT reasoning is self-reflection: revisiting earlier decisions and correcting previous errors. However, existing LVLMs often fail to properly attend to visual inputs during reflection, limiting their ability to translate feedback into grounded corrections, especially for out-of-distribution images. To address this issue, we propose a novel reinforcement learning training framework VRRL, with two components explicitly designed to elicit visually grounded self-reflection. First, we randomly mask trajectory prefixes during training to emphasize recovery from incorrect intermediate predictions rather than making early mistakes. Second, we introduce buffered roll-ins from an experience replay buffer to expose the model to diverse failure states that it must learn to correct. We evaluate our approach on visual grounding tasks involving tables and charts, as well as spatial navigation benchmarks. While off-the-shelf and conventionally fine-tuned models degrade substantially under distribution shift, our method substantially improves average out-of-distribution accuracy over standard RL and reflection-oriented fine-tuning baselines by using self-reflection effectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VRRL adds prefix masking and replay buffers to push VLMs toward visually grounded reflection, but the abstract supplies no evidence that the gains come from vision rather than text recovery.

read the letter

The paper's main contribution is a reinforcement learning procedure called VRRL that combines random prefix masking on reasoning trajectories with roll-ins from an experience replay buffer. The goal is to make vision-language models correct their own errors by actually looking at the image instead of just patching the text.

This targets a clear practical issue: current models often ignore visual input when they reflect, which hurts performance on out-of-distribution tables, charts, and navigation scenes. The two components are simple and directly motivated by that failure mode. If they work as intended, the approach could be a useful addition to the toolkit for multimodal RL fine-tuning.

The soft spot is the missing link between the training signals and the claimed mechanism. Prefix masking and replay exposure are compatible with learning purely textual correction heuristics. The abstract reports better out-of-distribution accuracy than standard RL and reflection baselines, yet it gives no numbers, no ablation that removes visual features during reflection, no attention maps, and no text-only control. Without those, the OOD gains could come from improved text recovery alone. The central claim therefore rests on an assumption that is not yet isolated.

The work is aimed at people already running RL on LVLMs who care about robustness under distribution shift. A reader in that group could pick up the training recipe and test it themselves. It is coherent on its own terms and engages honestly with the literature on reflection and grounding.

I would send it to peer review. The problem is real and the proposed fix is specific enough to evaluate, but the experiments will need to address the visual-versus-text isolation question directly before the results can be trusted.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes VRRL, a reinforcement learning framework for vision-language models to elicit visually grounded self-reflection. It introduces two components: random masking of trajectory prefixes to emphasize recovery from incorrect intermediate predictions, and buffered roll-ins from an experience replay buffer to expose the model to diverse failure states. The approach is evaluated on visual grounding tasks involving tables/charts and spatial navigation benchmarks, with the claim that it substantially improves average out-of-distribution accuracy over standard RL and reflection-oriented fine-tuning baselines by using self-reflection effectively.

Significance. If the OOD gains are shown to result specifically from visually grounded reflection (rather than textual recovery heuristics), the work could help address robustness limitations in current VLMs under distribution shift for multimodal reasoning. The RL-based training procedure with explicit recovery mechanisms is a direct response to observed failures in CoT-style reflection.

major comments (2)

[Abstract] Abstract: The central claim of 'substantially improves average out-of-distribution accuracy ... by using self-reflection effectively' is presented without any quantitative metrics, dataset sizes, statistical significance tests, or ablation results. This makes it impossible to assess the magnitude or reliability of the reported gains.
[Method and Experimental Evaluation] Method (training procedure) and Experimental Evaluation: The prefix-masking and replay-buffer mechanisms are compatible with purely textual error-recovery learning. No ablation is described that removes or corrupts visual features specifically during the reflection phase, no attention-map analysis compares visual vs. textual focus during reflection, and no text-only trajectory control is reported. Without such isolation, the OOD gains cannot be attributed to visual grounding rather than textual heuristics.

minor comments (1)

[Abstract] Abstract: The statement that 'off-the-shelf and conventionally fine-tuned models degrade substantially under distribution shift' lacks specific baseline names or degradation magnitudes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of 'substantially improves average out-of-distribution accuracy ... by using self-reflection effectively' is presented without any quantitative metrics, dataset sizes, statistical significance tests, or ablation results. This makes it impossible to assess the magnitude or reliability of the reported gains.

Authors: We agree that the abstract should include quantitative details to support the central claim. In the revised version, we will update the abstract to report the specific average OOD accuracy gains (with percentages), evaluation dataset sizes, and references to statistical significance and ablations from the main text. revision: yes
Referee: [Method and Experimental Evaluation] Method (training procedure) and Experimental Evaluation: The prefix-masking and replay-buffer mechanisms are compatible with purely textual error-recovery learning. No ablation is described that removes or corrupts visual features specifically during the reflection phase, no attention-map analysis compares visual vs. textual focus during reflection, and no text-only trajectory control is reported. Without such isolation, the OOD gains cannot be attributed to visual grounding rather than textual heuristics.

Authors: We acknowledge this is a valid point and a limitation of the current experiments. The mechanisms could in principle apply to text-only settings, and we lack explicit isolations of the visual modality during reflection. We will add the requested elements in revision: an ablation corrupting visual features during reflection, attention-map comparisons of visual vs. textual focus, and a text-only control where feasible. These will be used to better attribute gains to visual grounding. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL procedure with no derivations or load-bearing self-citations

full rationale

The paper proposes VRRL, an RL training framework using random prefix masking and experience replay buffer roll-ins to elicit visually grounded self-reflection. No equations, mathematical derivations, or parameter-fitting steps are present. Central claims rest on reported OOD accuracy improvements versus baselines, framed as experimental outcomes rather than reductions to inputs by construction. No self-citation chains or uniqueness theorems are invoked to justify the method. The derivation chain is self-contained as an empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract introduces no explicit free parameters, axioms, or invented entities; the approach rests on standard assumptions of reinforcement learning for language models and the premise that the proposed interventions will produce visual attention during reflection.

pith-pipeline@v0.9.1-grok · 5715 in / 1073 out tokens · 24375 ms · 2026-07-03T14:15:01.244844+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 20 canonical work pages · 6 internal anchors

[1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
[2]

2025 , eprint=

Qwen3-VL Technical Report , author=. 2025 , eprint=

2025
[3]

2025 , eprint=

Towards Understanding Visual Grounding in Visual Language Models , author=. 2025 , eprint=

2025
[4]

ArXiv , year=

Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection , author=. ArXiv , year=
[5]

The Fourteenth International Conference on Learning Representations , year=

Grounding Computer Use Agents on Human Demonstrations , author=. The Fourteenth International Conference on Learning Representations , year=
[6]

Proceedings of the Asian Conference on Computer Vision , pages=

Vision language models are blind , author=. Proceedings of the Asian Conference on Computer Vision , pages=
[7]

2026 , eprint=

OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning , author=. 2026 , eprint=

2026
[8]

VisualWebBench: How Far Have Multimodal

Junpeng Liu and Yifan Song and Bill Yuchen Lin and Wai Lam and Graham Neubig and Yuanzhi Li and Xiang Yue , booktitle=. VisualWebBench: How Far Have Multimodal. 2024 , url=

2024
[9]

Navigating the Digital World as Humans Do: Universal Visual Grounding for

Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su , booktitle=. Navigating the Digital World as Humans Do: Universal Visual Grounding for. 2025 , url=

2025
[10]

2025 , eprint=

GTA1: GUI Test-time Scaling Agent , author=. 2025 , eprint=

2025
[11]

S ee C lick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Cheng, Kanzhi and Sun, Qiushi and Chu, Yougang and Xu, Fangzhi and YanTao, Li and Zhang, Jianbing and Wu, Zhiyong. S ee C lick: Harnessing GUI Grounding for Advanced Visual GUI Agents. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.505

work page doi:10.18653/v1/2024.acl-long.505 2024
[12]

Proceedings of the 33rd ACM International Conference on Multimedia , pages =

Li, Kaixin and Meng, Ziyang and Lin, Hongzhan and Luo, Ziyang and Tian, Yuchen and Ma, Jing and Huang, Zhiyong and Chua, Tat-Seng , title =. Proceedings of the 33rd ACM International Conference on Multimedia , pages =. 2025 , isbn =. doi:10.1145/3746027.3755688 , abstract =

work page doi:10.1145/3746027.3755688 2025
[13]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Zhang, Xingxuan and Li, Jiansheng and Chu, Wenjing and hai, junjia and Xu, Renzhe and Yang, Yuqing and Guan, Shikai and Xu, Jiazheng and Jing, Liping and Cui, Peng , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

2025
[14]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Self-Refine: Iterative Refinement with Self-Feedback , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[15]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Reflexion: language agents with verbal reinforcement learning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[16]

Zhang, Haoyu and Wu, Yuwei and Li, Pengxiang and Zhang, Xintong and Gao, Zhi and Gao, Rui and Gao, Mingyang and Sun, Che and Jia, Yunde , journal=
[17]

2025 , eprint=

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning , author=. 2025 , eprint=

2025
[18]

The Twelfth International Conference on Learning Representations , year=

Large Language Models Cannot Self-Correct Reasoning Yet , author=. The Twelfth International Conference on Learning Representations , year=
[19]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers , author=. arXiv preprint arXiv:2506.23918 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

2026 , url=

Mingyuan Wu and Jingcheng Yang and Jize Jiang and Meitang Li and Kaizhuo Yan and Hanchao Yu and Minjia Zhang and ChengXiang Zhai and Klara Nahrstedt , booktitle=. 2026 , url=

2026
[21]

Yang, Shuo and Niu, Yuwei and Liu, Yuyang and Ye, Yang and Lin, Bin and Yuan, Li , journal=
[22]

2025 , eprint=

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning , author=. 2025 , eprint=

2025
[23]

2025 , url=

Zhongwei Wan and Zhihao Dou and Che Liu and Yu Zhang and Dongfei Cui and Qinjian Zhao and Hui Shen and Jing Xiong and Yi Xin and Yifan Jiang and Chaofan Tao and Yangfan He and Mi Zhang and Shen Yan , booktitle=. 2025 , url=

2025
[24]

2024 , url=

Zhibin Gou and Zhihong Shao and Yeyun Gong and yelong shen and Yujiu Yang and Nan Duan and Weizhu Chen , booktitle=. 2024 , url=

2024
[25]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[26]

Learning to Refine with Fine-Grained Natural Language Feedback

Wadhwa, Manya and Zhao, Xinyu and Li, Junyi Jessy and Durrett, Greg. Learning to Refine with Fine-Grained Natural Language Feedback. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.716

work page doi:10.18653/v1/2024.findings-emnlp.716 2024
[27]

2025 , eprint=

Kimi k1.5: Scaling Reinforcement Learning with LLMs , author=. 2025 , eprint=

2025
[28]

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal

Zirui Wang and Mengzhou Xia and Luxi He and Howard Chen and Yitao Liu and Richard Zhu and Kaiqu Liang and Xindi Wu and Haotian Liu and Sadhika Malladi and Alexis Chevalier and Sanjeev Arora and Danqi Chen , booktitle=. CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal. 2024 , url=

2024
[29]

2025 , eprint=

Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models , author=. 2025 , eprint=

2025
[30]

Wu, Qiucheng and Zhao, Handong and Saxon, Michael and Bui, Trung and Wang, William Yang and Zhang, Yang and Chang, Shiyu , booktitle=
[31]

The Fourteenth International Conference on Learning Representations , year=

Visual Planning: Let's Think Only with Images , author=. The Fourteenth International Conference on Learning Representations , year=
[32]

2024 , url=

Jiayu Wang and Yifei Ming and Zhenmei Shi and Vibhav Vineet and Xin Wang and Yixuan Li and Neel Joshi , booktitle=. 2024 , url=

2024
[33]

T able D reamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning

Zheng, Mingyu and Feng, Zhifan and Wang, Jia and Wang, Lanrui and Lin, Zheng and Yang, Hao and Wang, Weiping. T able D reamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.381

work page doi:10.18653/v1/2025.findings-acl.381 2025
[34]

Proceedings of the 2025 2nd International Conference on Virtual Reality, Image and Signal Processing , pages =

Long, Hui and Yu, Haoli and Wang, Jingqiao and Long, Jian and Fang, Changxin , title =. Proceedings of the 2025 2nd International Conference on Virtual Reality, Image and Signal Processing , pages =. 2025 , isbn =. doi:10.1145/3772128.3772169 , abstract =

work page doi:10.1145/3772128.3772169 2025
[35]

Vision-Language Models Can Self-Improve Reasoning via Reflection

Cheng, Kanzhi and YanTao, Li and Xu, Fangzhi and Zhang, Jianbing and Zhou, Hao and Liu, Yang. Vision-Language Models Can Self-Improve Reasoning via Reflection. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v...

work page doi:10.18653/v1/2025.naacl-long.447 2025
[36]

CVPR , year=

Generation and Comprehension of Unambiguous Object Descriptions , author=. CVPR , year=
[37]

R efer I t G ame: Referring to Objects in Photographs of Natural Scenes

Kazemzadeh, Sahar and Ordonez, Vicente and Matten, Mark and Berg, Tamara. R efer I t G ame: Referring to Objects in Photographs of Natural Scenes. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). 2014. doi:10.3115/v1/D14-1086

work page doi:10.3115/v1/d14-1086 2014
[38]

IJCV , volume=

Flickr30K Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , author=. IJCV , volume=
[39]

, booktitle=

Yu, Licheng and Lin, Zhe and Shen, Xiaohui and Yang, Jimei and Lu, Xin and Bansal, Mohit and Berg, Tamara L. , booktitle=. MAttNet: Modular Attention Network for Referring Expression Comprehension , year=
[40]

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts , author=. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2025
[41]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z
[42]

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Revisor: Beyond textual reflection, towards multimodal introspective reasoning in long-form video understanding , author=. arXiv preprint arXiv:2511.13026 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[44]

2024 , url=

Hao Shao and Shengju Qian and Han Xiao and Guanglu Song and Zhuofan Zong and Letian Wang and Yu Liu and Hongsheng Li , booktitle=. 2024 , url=

2024
[45]

Su, Zhaochen and Li, Linjie and Song, Mingyang and Hao, Yunzhuo and Yang, Zhengyuan and Zhang, Jun and Chen, Guanjie and Gu, Jiawei and Li, Juntao and Qu, Xiaoye and others , journal=
[46]

2026 , url=

Jiacong Wang and Zijian Kang and Haochen Wang and LiangXiao and Ya Wang and Jiawen Li and Bohong Wu and Ran Jiao and Haiyong Jiang and ChaoFeng and Jun Xiao , booktitle=. 2026 , url=

2026
[47]

arXiv preprint arXiv:2512.08511 , year=

Thinking with Images via Self-Calling Agent , author=. arXiv preprint arXiv:2512.08511 , year=

work page arXiv
[48]

2025 , url=

Penghao Wu and Shengnan Ma and Bo Wang and Jiaheng Yu and Lewei Lu and Ziwei Liu , booktitle=. 2025 , url=

2025
[49]

2026 , url=

Yan Yang and Dongxu Li and Yutong Dai and Yuhao Yang and Ziyang Luo and Zirui Zhao and Zhiyuan Hu and JUNZHE HUANG and Amrita Saha and Zeyuan Chen and Ran Xu and Liyuan Pan and Caiming Xiong and Junnan Li , booktitle=. 2026 , url=

2026
[50]

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding , author=. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2025
[51]

OpenAI Gym

Greg Brockman and Vicki Cheung and Ludwig Pettersson and Jonas Schneider and John Schulman and Jie Tang and Wojciech Zaremba , year=. 1606.01540 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

2026 , eprint=

OpenAI GPT-5 System Card , author=. 2026 , eprint=

2026
[53]

2026 , url=

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and YuYue and Weinan Dai and Tiantian Fan and Gaohong Liu and Juncai Liu and LingJun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Ru Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and ...

2026
[54]

Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

Jian, Pu and Wu, Junhong and Sun, Wei and Wang, Chen and Ren, Shuo and Zhang, Jiajun. Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.470

work page doi:10.18653/v1/2025.emnlp-main.470 2025
[55]

2026 , url=

Haozhe Wang and Chao Qu and Zuming Huang and Wei Chu and Fangzhen Lin and Wenhu Chen , booktitle=. 2026 , url=

2026
[56]

2026 , url=

Penghao Wu and Shengnan Ma and Bo Wang and Jiaheng Yu and Lewei Lu and Ziwei Liu , booktitle=. 2026 , url=

2026
[57]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Latent Chain-of-Thought for Visual Reasoning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[58]

2026 , eprint=

Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning , author=. 2026 , eprint=

2026
[59]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Grounded Reinforcement Learning for Visual Reasoning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[60]

The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
[61]

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=

GUI-G ^2 : Gaussian Reward Modeling for GUI Grounding , author=. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=
[62]

arXiv preprint arXiv:2505.14231 , year=

UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning , author=. arXiv preprint arXiv:2505.14231 , year=

work page arXiv
[63]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal=
[64]

Qwen2.5-VL Technical Report

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

Visual Instruction Tuning , url =

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , booktitle =. Visual Instruction Tuning , url =
[67]

2025 , url=

Dongzhi Jiang and Renrui Zhang and Ziyu Guo and Yanwei Li and Yu Qi and Xinyan Chen and Liuhui Wang and Jianhan Jin and Claire Guo and Shen Yan and Bo Zhang and Chaoyou Fu and Peng Gao and Hongsheng Li , booktitle=. 2025 , url=

2025
[68]

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine , author=. arXiv preprint arXiv:2603.06665 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Wu, Xueqing and Ding, Yuheng and Li, Bingxuan and Lu, Pan and Yin, Da and Chang, Kai-Wei and Peng, Nanyun , booktitle=
[70]

The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
[71]

Zayne Sprague and Jack Lu and Manya Wadhwa and Sedrick Keh and Mengye Ren and Greg Durrett , booktitle =
[72]

C hart QAP ro: A More Diverse and Challenging Benchmark for Chart Question Answering

Masry, Ahmed and Islam, Mohammed Saidul and Ahmed, Mahir and Bajaj, Aayush and Kabir, Firoz and Kartha, Aaryaman and Laskar, Md Tahmid Rahman and Rahman, Mizanur and Rahman, Shadikur and Shahmohammadi, Mehrad and Thakkar, Megh and Parvez, Md Rizwan and Hoque, Enamul and Joty, Shafiq. C hart QAP ro: A More Diverse and Challenging Benchmark for Chart Questi...

work page doi:10.18653/v1/2025.findings-acl.978 2025
[73]

2026 , eprint=

Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling? , author=. 2026 , eprint=

2026
[74]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Sherlock: Self-Correcting Reasoning in Vision-Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[75]

2025 , eprint=

Limits and Gains of Test-Time Scaling in Vision-Language Reasoning , author=. 2025 , eprint=

2025
[76]

The Fourteenth International Conference on Learning Representations , year=

Vision Language Models are Biased , author=. The Fourteenth International Conference on Learning Representations , year=
[77]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Bridge the Modality and Capability Gaps in Vision-Language Model Selection , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[78]

Transactions on Machine Learning Research , issn=

Multimodal Chain-of-Thought Reasoning in Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

2024
[79]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Luo, Tiange and Logeswaran, Lajanugen and Johnson, Justin and Lee, Honglak , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

2025
[80]

arXiv preprint arXiv:2603.27201 , year=

Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models , author=. arXiv preprint arXiv:2603.27201 , year=

work page arXiv

Showing first 80 references.

[1] [1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

[2] [2]

2025 , eprint=

Qwen3-VL Technical Report , author=. 2025 , eprint=

2025

[3] [3]

2025 , eprint=

Towards Understanding Visual Grounding in Visual Language Models , author=. 2025 , eprint=

2025

[4] [4]

ArXiv , year=

Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection , author=. ArXiv , year=

[5] [5]

The Fourteenth International Conference on Learning Representations , year=

Grounding Computer Use Agents on Human Demonstrations , author=. The Fourteenth International Conference on Learning Representations , year=

[6] [6]

Proceedings of the Asian Conference on Computer Vision , pages=

Vision language models are blind , author=. Proceedings of the Asian Conference on Computer Vision , pages=

[7] [7]

2026 , eprint=

OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning , author=. 2026 , eprint=

2026

[8] [8]

VisualWebBench: How Far Have Multimodal

Junpeng Liu and Yifan Song and Bill Yuchen Lin and Wai Lam and Graham Neubig and Yuanzhi Li and Xiang Yue , booktitle=. VisualWebBench: How Far Have Multimodal. 2024 , url=

2024

[9] [9]

Navigating the Digital World as Humans Do: Universal Visual Grounding for

Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su , booktitle=. Navigating the Digital World as Humans Do: Universal Visual Grounding for. 2025 , url=

2025

[10] [10]

2025 , eprint=

GTA1: GUI Test-time Scaling Agent , author=. 2025 , eprint=

2025

[11] [11]

S ee C lick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Cheng, Kanzhi and Sun, Qiushi and Chu, Yougang and Xu, Fangzhi and YanTao, Li and Zhang, Jianbing and Wu, Zhiyong. S ee C lick: Harnessing GUI Grounding for Advanced Visual GUI Agents. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.505

work page doi:10.18653/v1/2024.acl-long.505 2024

[12] [12]

Proceedings of the 33rd ACM International Conference on Multimedia , pages =

Li, Kaixin and Meng, Ziyang and Lin, Hongzhan and Luo, Ziyang and Tian, Yuchen and Ma, Jing and Huang, Zhiyong and Chua, Tat-Seng , title =. Proceedings of the 33rd ACM International Conference on Multimedia , pages =. 2025 , isbn =. doi:10.1145/3746027.3755688 , abstract =

work page doi:10.1145/3746027.3755688 2025

[13] [13]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Zhang, Xingxuan and Li, Jiansheng and Chu, Wenjing and hai, junjia and Xu, Renzhe and Yang, Yuqing and Guan, Shikai and Xu, Jiazheng and Jing, Liping and Cui, Peng , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

2025

[14] [14]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Self-Refine: Iterative Refinement with Self-Feedback , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

[15] [15]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Reflexion: language agents with verbal reinforcement learning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

[16] [16]

Zhang, Haoyu and Wu, Yuwei and Li, Pengxiang and Zhang, Xintong and Gao, Zhi and Gao, Rui and Gao, Mingyang and Sun, Che and Jia, Yunde , journal=

[17] [17]

2025 , eprint=

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning , author=. 2025 , eprint=

2025

[18] [18]

The Twelfth International Conference on Learning Representations , year=

Large Language Models Cannot Self-Correct Reasoning Yet , author=. The Twelfth International Conference on Learning Representations , year=

[19] [19]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers , author=. arXiv preprint arXiv:2506.23918 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

2026 , url=

Mingyuan Wu and Jingcheng Yang and Jize Jiang and Meitang Li and Kaizhuo Yan and Hanchao Yu and Minjia Zhang and ChengXiang Zhai and Klara Nahrstedt , booktitle=. 2026 , url=

2026

[21] [21]

Yang, Shuo and Niu, Yuwei and Liu, Yuyang and Ye, Yang and Lin, Bin and Yuan, Li , journal=

[22] [22]

2025 , eprint=

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning , author=. 2025 , eprint=

2025

[23] [23]

2025 , url=

Zhongwei Wan and Zhihao Dou and Che Liu and Yu Zhang and Dongfei Cui and Qinjian Zhao and Hui Shen and Jing Xiong and Yi Xin and Yifan Jiang and Chaofan Tao and Yangfan He and Mi Zhang and Shen Yan , booktitle=. 2025 , url=

2025

[24] [24]

2024 , url=

Zhibin Gou and Zhihong Shao and Yeyun Gong and yelong shen and Yujiu Yang and Nan Duan and Weizhu Chen , booktitle=. 2024 , url=

2024

[25] [25]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[26] [26]

Learning to Refine with Fine-Grained Natural Language Feedback

Wadhwa, Manya and Zhao, Xinyu and Li, Junyi Jessy and Durrett, Greg. Learning to Refine with Fine-Grained Natural Language Feedback. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.716

work page doi:10.18653/v1/2024.findings-emnlp.716 2024

[27] [27]

2025 , eprint=

Kimi k1.5: Scaling Reinforcement Learning with LLMs , author=. 2025 , eprint=

2025

[28] [28]

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal

Zirui Wang and Mengzhou Xia and Luxi He and Howard Chen and Yitao Liu and Richard Zhu and Kaiqu Liang and Xindi Wu and Haotian Liu and Sadhika Malladi and Alexis Chevalier and Sanjeev Arora and Danqi Chen , booktitle=. CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal. 2024 , url=

2024

[29] [29]

2025 , eprint=

Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models , author=. 2025 , eprint=

2025

[30] [30]

Wu, Qiucheng and Zhao, Handong and Saxon, Michael and Bui, Trung and Wang, William Yang and Zhang, Yang and Chang, Shiyu , booktitle=

[31] [31]

The Fourteenth International Conference on Learning Representations , year=

Visual Planning: Let's Think Only with Images , author=. The Fourteenth International Conference on Learning Representations , year=

[32] [32]

2024 , url=

Jiayu Wang and Yifei Ming and Zhenmei Shi and Vibhav Vineet and Xin Wang and Yixuan Li and Neel Joshi , booktitle=. 2024 , url=

2024

[33] [33]

T able D reamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning

Zheng, Mingyu and Feng, Zhifan and Wang, Jia and Wang, Lanrui and Lin, Zheng and Yang, Hao and Wang, Weiping. T able D reamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.381

work page doi:10.18653/v1/2025.findings-acl.381 2025

[34] [34]

Proceedings of the 2025 2nd International Conference on Virtual Reality, Image and Signal Processing , pages =

Long, Hui and Yu, Haoli and Wang, Jingqiao and Long, Jian and Fang, Changxin , title =. Proceedings of the 2025 2nd International Conference on Virtual Reality, Image and Signal Processing , pages =. 2025 , isbn =. doi:10.1145/3772128.3772169 , abstract =

work page doi:10.1145/3772128.3772169 2025

[35] [35]

Vision-Language Models Can Self-Improve Reasoning via Reflection

Cheng, Kanzhi and YanTao, Li and Xu, Fangzhi and Zhang, Jianbing and Zhou, Hao and Liu, Yang. Vision-Language Models Can Self-Improve Reasoning via Reflection. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v...

work page doi:10.18653/v1/2025.naacl-long.447 2025

[36] [36]

CVPR , year=

Generation and Comprehension of Unambiguous Object Descriptions , author=. CVPR , year=

[37] [37]

R efer I t G ame: Referring to Objects in Photographs of Natural Scenes

Kazemzadeh, Sahar and Ordonez, Vicente and Matten, Mark and Berg, Tamara. R efer I t G ame: Referring to Objects in Photographs of Natural Scenes. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). 2014. doi:10.3115/v1/D14-1086

work page doi:10.3115/v1/d14-1086 2014

[38] [38]

IJCV , volume=

Flickr30K Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , author=. IJCV , volume=

[39] [39]

, booktitle=

Yu, Licheng and Lin, Zhe and Shen, Xiaohui and Yang, Jimei and Lu, Xin and Bansal, Mohit and Berg, Tamara L. , booktitle=. MAttNet: Modular Attention Network for Referring Expression Comprehension , year=

[40] [40]

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts , author=. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2025

[41] [41]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z

[42] [42]

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Revisor: Beyond textual reflection, towards multimodal introspective reasoning in long-form video understanding , author=. arXiv preprint arXiv:2511.13026 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[44] [44]

2024 , url=

Hao Shao and Shengju Qian and Han Xiao and Guanglu Song and Zhuofan Zong and Letian Wang and Yu Liu and Hongsheng Li , booktitle=. 2024 , url=

2024

[45] [45]

Su, Zhaochen and Li, Linjie and Song, Mingyang and Hao, Yunzhuo and Yang, Zhengyuan and Zhang, Jun and Chen, Guanjie and Gu, Jiawei and Li, Juntao and Qu, Xiaoye and others , journal=

[46] [46]

2026 , url=

Jiacong Wang and Zijian Kang and Haochen Wang and LiangXiao and Ya Wang and Jiawen Li and Bohong Wu and Ran Jiao and Haiyong Jiang and ChaoFeng and Jun Xiao , booktitle=. 2026 , url=

2026

[47] [47]

arXiv preprint arXiv:2512.08511 , year=

Thinking with Images via Self-Calling Agent , author=. arXiv preprint arXiv:2512.08511 , year=

work page arXiv

[48] [48]

2025 , url=

Penghao Wu and Shengnan Ma and Bo Wang and Jiaheng Yu and Lewei Lu and Ziwei Liu , booktitle=. 2025 , url=

2025

[49] [49]

2026 , url=

Yan Yang and Dongxu Li and Yutong Dai and Yuhao Yang and Ziyang Luo and Zirui Zhao and Zhiyuan Hu and JUNZHE HUANG and Amrita Saha and Zeyuan Chen and Ran Xu and Liyuan Pan and Caiming Xiong and Junnan Li , booktitle=. 2026 , url=

2026

[50] [50]

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding , author=. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2025

[51] [51]

OpenAI Gym

Greg Brockman and Vicki Cheung and Ludwig Pettersson and Jonas Schneider and John Schulman and Jie Tang and Wojciech Zaremba , year=. 1606.01540 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

2026 , eprint=

OpenAI GPT-5 System Card , author=. 2026 , eprint=

2026

[53] [53]

2026 , url=

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and YuYue and Weinan Dai and Tiantian Fan and Gaohong Liu and Juncai Liu and LingJun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Ru Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and ...

2026

[54] [54]

Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

Jian, Pu and Wu, Junhong and Sun, Wei and Wang, Chen and Ren, Shuo and Zhang, Jiajun. Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.470

work page doi:10.18653/v1/2025.emnlp-main.470 2025

[55] [55]

2026 , url=

Haozhe Wang and Chao Qu and Zuming Huang and Wei Chu and Fangzhen Lin and Wenhu Chen , booktitle=. 2026 , url=

2026

[56] [56]

2026 , url=

Penghao Wu and Shengnan Ma and Bo Wang and Jiaheng Yu and Lewei Lu and Ziwei Liu , booktitle=. 2026 , url=

2026

[57] [57]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Latent Chain-of-Thought for Visual Reasoning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[58] [58]

2026 , eprint=

Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning , author=. 2026 , eprint=

2026

[59] [59]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Grounded Reinforcement Learning for Visual Reasoning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[60] [60]

The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

[61] [61]

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=

GUI-G ^2 : Gaussian Reward Modeling for GUI Grounding , author=. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year=

[62] [62]

arXiv preprint arXiv:2505.14231 , year=

UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning , author=. arXiv preprint arXiv:2505.14231 , year=

work page arXiv

[63] [63]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal=

[64] [64]

Qwen2.5-VL Technical Report

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[65] [65]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[66] [66]

Visual Instruction Tuning , url =

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , booktitle =. Visual Instruction Tuning , url =

[67] [67]

2025 , url=

Dongzhi Jiang and Renrui Zhang and Ziyu Guo and Yanwei Li and Yu Qi and Xinyan Chen and Liuhui Wang and Jianhan Jin and Claire Guo and Shen Yan and Bo Zhang and Chaoyou Fu and Peng Gao and Hongsheng Li , booktitle=. 2025 , url=

2025

[68] [68]

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine , author=. arXiv preprint arXiv:2603.06665 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[69] [69]

Wu, Xueqing and Ding, Yuheng and Li, Bingxuan and Lu, Pan and Yin, Da and Chang, Kai-Wei and Peng, Nanyun , booktitle=

[70] [70]

The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

[71] [71]

Zayne Sprague and Jack Lu and Manya Wadhwa and Sedrick Keh and Mengye Ren and Greg Durrett , booktitle =

[72] [72]

C hart QAP ro: A More Diverse and Challenging Benchmark for Chart Question Answering

Masry, Ahmed and Islam, Mohammed Saidul and Ahmed, Mahir and Bajaj, Aayush and Kabir, Firoz and Kartha, Aaryaman and Laskar, Md Tahmid Rahman and Rahman, Mizanur and Rahman, Shadikur and Shahmohammadi, Mehrad and Thakkar, Megh and Parvez, Md Rizwan and Hoque, Enamul and Joty, Shafiq. C hart QAP ro: A More Diverse and Challenging Benchmark for Chart Questi...

work page doi:10.18653/v1/2025.findings-acl.978 2025

[73] [73]

2026 , eprint=

Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling? , author=. 2026 , eprint=

2026

[74] [74]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Sherlock: Self-Correcting Reasoning in Vision-Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[75] [75]

2025 , eprint=

Limits and Gains of Test-Time Scaling in Vision-Language Reasoning , author=. 2025 , eprint=

2025

[76] [76]

The Fourteenth International Conference on Learning Representations , year=

Vision Language Models are Biased , author=. The Fourteenth International Conference on Learning Representations , year=

[77] [77]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Bridge the Modality and Capability Gaps in Vision-Language Model Selection , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[78] [78]

Transactions on Machine Learning Research , issn=

Multimodal Chain-of-Thought Reasoning in Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

2024

[79] [79]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Luo, Tiange and Logeswaran, Lajanugen and Johnson, Justin and Lee, Honglak , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

2025

[80] [80]

arXiv preprint arXiv:2603.27201 , year=

Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models , author=. arXiv preprint arXiv:2603.27201 , year=

work page arXiv