Recognition: 3 theorem links
· Lean TheoremPerceptual Flow Network for Visually Grounded Reasoning
Pith reviewed 2026-05-08 18:37 UTC · model grok-4.3
The pith
Perceptual Flow Network improves visually grounded reasoning by decoupling perception from reasoning and shaping it with variational reinforcement learning instead of rigid expert priors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PFlowNet decouples perception from reasoning to create a self-conditioned generation process, then integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning. This produces reasoning-oriented perceptual behaviors while preserving visual reliability and yields a provable performance guarantee along with new state-of-the-art scores on V* Bench and MME-RealWorld-lite.
What carries the argument
The self-conditioned generation process in PFlowNet, which decouples perception from reasoning and applies vicinal geometric shaping through variational reinforcement learning to avoid rigid alignment with expert priors.
If this is right
- The approach delivers a provable performance guarantee for the resulting model.
- It reaches new state-of-the-art performance of 90.6 percent on V* Bench.
- It reaches new state-of-the-art performance of 67.0 percent on MME-RealWorld-lite.
- It enables reasoning-oriented perceptual behaviors while keeping visual outputs reliable.
Where Pith is reading between the lines
- The decoupling strategy might allow separate tuning of perception modules in other multimodal systems without retraining the entire model.
- Variational reinforcement learning for shaping perceptual flows could extend to balancing competing objectives in non-visual language tasks.
- The method suggests a route to reduce hallucinations by prioritizing reasoning utility over strict geometric matching in additional visual benchmarks.
Load-bearing premise
That geometric priors from visual experts are suboptimal for reasoning utility and that vicinal geometric shaping via variational reinforcement learning will produce superior perceptual behaviors without reducing visual reliability.
What would settle it
An experiment in which models trained with rigid geometric priors from visual experts achieve higher accuracy than PFlowNet on the V* Bench or MME-RealWorld-lite benchmarks would undermine the central claim.
read the original abstract
Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility. To bridge this gap, we propose Perceptual Flow Network (PFlowNet), which eschews rigid alignment with the expert priors and achieves interpretable yet more effective visual reasoning. Specifically, PFlowNet decouples perception from reasoning to establish a self-conditioned generation process. Based on this, it integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning, thereby facilitating reasoning-oriented perceptual behaviors while preserving visual reliability. PFlowNet delivers a provable performance guarantee and competitive empirical results, particularly setting new SOTA records on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Perceptual Flow Network (PFlowNet) to address limitations in Large Vision-Language Models (LVLMs) where standard MLE optimization leads to language bias and hallucination. It observes that geometric priors from visual experts are suboptimal for reasoning utility due to bias toward geometric precision. PFlowNet decouples perception from reasoning via a self-conditioned generation process and integrates multi-dimensional rewards with vicinal geometric shaping using variational reinforcement learning. This is claimed to produce reasoning-oriented perceptual behaviors while preserving visual reliability, delivering a provable performance guarantee and new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
Significance. If the claimed provable guarantee can be rigorously established under clearly stated assumptions and the SOTA empirical results are reproducible with proper controls for backbone choice and hyperparameter tuning, the work would offer a meaningful alternative to rigid expert-prior alignment in grounded reasoning tasks. The decoupling of perception and reasoning plus the variational RL formulation with vicinal shaping could influence methods for reducing hallucinations in LVLMs, provided the guarantee applies to downstream reasoning utility rather than only the surrogate objective.
major comments (2)
- [Abstract] Abstract: The central claim that 'PFlowNet delivers a provable performance guarantee' is load-bearing for the paper's novelty yet provides no statement of what is proven (e.g., convergence rate, bound on hallucination rate, or optimality of the decoupled flow), no assumptions (e.g., bounded reward variance, Lipschitz continuity of the shaping term, or properties of the self-conditioned distribution), and no proof sketch or derivation. This prevents evaluation of whether the guarantee supports attribution of the reported SOTA numbers to the proposed mechanism.
- [Abstract] Abstract: The superiority claim rests on eschewing rigid alignment with expert geometric priors in favor of variational RL with vicinal shaping and multi-dimensional rewards, but the abstract supplies no indication of how the guarantee reduces to the fitted parameters or how the empirical results on V* Bench and MME-RealWorld-lite isolate the effect of the proposed shaping from backbone or tuning choices.
minor comments (1)
- [Abstract] Abstract: The term 'vicinal geometric shaping' is introduced without a brief definition or reference to its precise formulation, which may hinder immediate understanding of the method's novelty.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the presentation of our theoretical and empirical contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'PFlowNet delivers a provable performance guarantee' is load-bearing for the paper's novelty yet provides no statement of what is proven (e.g., convergence rate, bound on hallucination rate, or optimality of the decoupled flow), no assumptions (e.g., bounded reward variance, Lipschitz continuity of the shaping term, or properties of the self-conditioned distribution), and no proof sketch or derivation. This prevents evaluation of whether the guarantee supports attribution of the reported SOTA numbers to the proposed mechanism.
Authors: We agree that the abstract would benefit from greater specificity to allow readers to evaluate the theoretical claim. The full manuscript contains the detailed analysis in Section 4, which establishes a bound on the expected reasoning utility of the self-conditioned perceptual flow. The proof relies on standard variational RL convergence arguments under the assumptions of bounded reward variance and Lipschitz continuity of the vicinal shaping term. We will revise the abstract to include a concise statement of the proven guarantee, the key assumptions, and a pointer to the proof section. revision: yes
-
Referee: [Abstract] Abstract: The superiority claim rests on eschewing rigid alignment with expert geometric priors in favor of variational RL with vicinal shaping and multi-dimensional rewards, but the abstract supplies no indication of how the guarantee reduces to the fitted parameters or how the empirical results on V* Bench and MME-RealWorld-lite isolate the effect of the proposed shaping from backbone or tuning choices.
Authors: The abstract is a high-level summary; the reduction of the guarantee to the learned parameters is derived explicitly in the variational objective of Section 4. For the empirical results, Section 5.3 reports controlled ablations that isolate the vicinal shaping and multi-dimensional reward components while holding the backbone model and hyperparameter settings fixed. We will add a sentence to the abstract noting that the reported SOTA numbers are supported by these ablations and the theoretical analysis. revision: yes
Circularity Check
No circularity detected; claims rest on asserted guarantee without self-referential reduction
full rationale
The abstract asserts a 'provable performance guarantee' and SOTA results from decoupling perception, multi-dimensional rewards, and variational RL with vicinal shaping, but supplies no equations, derivations, or self-citations that reduce the guarantee or empirical claims to fitted inputs or prior author results by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The method is described as building on (then modifying) external geometric priors, which is independent of the target claims. This is the common honest case of a self-contained high-level description.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Perceptual Flow Network (PFlowNet)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean (J-cost uniqueness)washburn_uniqueness_aczel unclearPFlowNet ... integrates a multi-dimensional reward function with vicinal geometric shaping via variational reinforcement learning ... Sub-Trajectory Balance (SubTB)
-
Foundation/AlphaCoordinateFixation.lean (parameter-free α=1 pin)alpha_pin_under_high_calibration unclearω_λ(z_{0:k}, E) := exp(−λ·1[d_IoU(r_{1:k}, E) > ε]) ... we set λ=4.5, ε=0.5 in this work.
-
Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclearD_TV(p_{θ⋆}(·|X), P_V(·|X,Y)) ≤ (1/2Z_λ)·(q|s_V−Z_λ| + (1−q)|e^{−λ}s_V−Z_λ| + e^{−λ}(1−s_V))
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report, 2025.URL https://arxiv. org/abs/2511.21631, 2025
work page Pith review arXiv 2025
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page Pith review arXiv 2025
-
[3]
Vicinal risk minimization.Advances in neural information processing systems, 13, 2000
Olivier Chapelle, Jason Weston, Léon Bottou, and Vladimir Vapnik. Vicinal risk minimization.Advances in neural information processing systems, 13, 2000
2000
-
[4]
Multi-object hallucination in vision language models.Advances in Neural Information Processing Systems, 37: 44393–44418, 2024
Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David Fouhey, and Joyce Chai. Multi-object hallucination in vision language models.Advances in Neural Information Processing Systems, 37: 44393–44418, 2024
2024
-
[5]
Seeclick: Harnessing gui grounding for advanced visual gui agents
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, 2024
2024
-
[6]
Gemini-3-flash.https://deepmind.google/models/gemini/flash/, 2025
DeepMind. Gemini-3-flash.https://deepmind.google/models/gemini/flash/, 2025
2025
-
[7]
Gemini-3-pro.https://deepmind.google/models/gemini/pro/, 2025
DeepMind. Gemini-3-pro.https://deepmind.google/models/gemini/pro/, 2025
2025
-
[8]
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024
-
[9]
Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...
2024
-
[10]
Detecting and preventing hallucinations in large vision language models
Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18135–18143, 2024
2024
-
[11]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review arXiv 2025
-
[12]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review arXiv 2025
-
[13]
Deepeyesv2: Toward agentic multimodal model
Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025
-
[14]
The dawn of GUI agent: A preliminary case study with claude 3.5 computer use,
Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. The dawn of gui agent: A preliminary case study with claude 3.5 computer use.arXiv preprint arXiv:2411.10323, 2024. 13
-
[15]
Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024
2024
-
[16]
Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought
Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. Vlm-r3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.arXiv preprint arXiv:2505.16192, 2025
-
[17]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016
2016
-
[18]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
2023
-
[19]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page Pith review arXiv 2024
-
[20]
Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding
Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9098–9108, 2025
2025
-
[21]
Screenspot-pro: Gui grounding for professional high-resolution computer use
Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8778–8786, 2025
2025
-
[22]
arXiv preprint arXiv:2403.00231 , year=
Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models.arXiv preprint arXiv:2403.00231, 2024
-
[23]
arXiv preprint arXiv:2603.03857 (2026)
Yangfu Li, Hongjian Zhan, Jiawei Chen, Yuning Gong, Qi Liu, and Yue Lu. Deepscan: A training-free framework for visually grounded reasoning in large vision-language models.arXiv preprint arXiv:2603.03857, 2026
-
[24]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023
work page internal anchor Pith review arXiv 2023
-
[25]
A Survey on Hallucination in Large Vision-Language Models
Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models.arXiv preprint arXiv:2402.00253, 2024
work page internal anchor Pith review arXiv 2024
-
[26]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
2023
-
[27]
Llava-next: Improved reasoning, ocr, and world knowledge.https://llava-vl.github.io/blog/2024-01-30-llava-next/ , 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge.https://llava-vl.github.io/blog/2024-01-30-llava-next/ , 2024
2024
-
[28]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision, pages 38–55. Springer, 2024
2024
-
[29]
Shuochen Liu, Pengfei Luo, Chao Zhang, Yuhao Chen, Haotian Zhang, Qi Liu, Xin Kou, Tong Xu, and Enhong Chen. Look as you think: Unifying reasoning and visual evidence attribution for verifiable document rag via reinforcement learning.arXiv preprint arXiv:2511.12003, 2025
-
[30]
Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024
2024
-
[31]
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025
work page internal anchor Pith review arXiv 2025
-
[32]
Visual agentic reinforcement fine-tuning
Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual agentic reinforcement fine-tuning.URL https://arxiv. org/abs/2505.14246, 2025
-
[33]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 14
work page internal anchor Pith review arXiv 2017
-
[34]
Gui-r1: A generalist r1-style vision-language action model for gui agents
Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025
-
[35]
Yuanhuiyi Lyu, Kaiyu Lei, Ziqiao Weng, Xu Zheng, Lutao Jiang, Teng Li, Yangfu Li, Ziyuan Huang, Linfeng Zhang, and Xuming Hu. Struvis: Enhancing reasoning-based text-to-image generation via thinking with structured vision.arXiv preprint arXiv:2603.06032, 2026
-
[36]
Learning gflownets from partial episodes for improved convergence and stability
Kanika Madan, Jarrid Rector-Brooks, Maksym Korablyov, Emmanuel Bengio, Moksh Jain, Andrei Cristian Nica, Tom Bosc, Yoshua Bengio, and Nikolay Malkin. Learning gflownets from partial episodes for improved convergence and stability. InInternational Conference on Machine Learning, pages 23467–23483. PMLR, 2023
2023
-
[37]
Chartqa: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022
2022
-
[38]
Openai-gpt-4o.https://openai.com/index/gpt-4o-system-card/, 2024
OpenAI. Openai-gpt-4o.https://openai.com/index/gpt-4o-system-card/, 2024
2024
-
[39]
Openai-o3.https://openai.com/index/introducing-o3-and-o4-mini/, 2025
OpenAI. Openai-o3.https://openai.com/index/introducing-o3-and-o4-mini/, 2025
2025
-
[40]
Operator: A computer-using agent.https://openai.com/index/operator-system-card/, 2025
OpenAI. Operator: A computer-using agent.https://openai.com/index/operator-system-card/, 2025. System Card and Technical Report
2025
-
[41]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025
work page Pith review arXiv 2025
-
[42]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
2021
-
[43]
Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025
Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025
-
[44]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025
work page internal anchor Pith review arXiv 2025
-
[45]
arXiv preprint arXiv:2512.17312 (2025)
Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, and Yunqing Zhao. Codedance: A dynamic tool-integrated mllm for executable visual reasoning.arXiv preprint arXiv:2512.17312, 2025
-
[46]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Alex Su, Haozhe Wang, Weimin Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025
work page internal anchor Pith review arXiv 2025
-
[47]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review arXiv 2025
-
[48]
Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024
Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024
2024
-
[49]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
2017
-
[50]
Trl: Transformer reinforcement learning
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning. https: //github.com/huggingface/trl, 2020
2020
-
[51]
Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology. arXiv preprint arXiv:2507.07999, 2025
-
[52]
VGR: Visual Grounded Reasoning
Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025. 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024
2024
-
[54]
Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7907–7915, 2025
2025
-
[55]
Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement. arXiv preprint arXiv:2504.07934, 2025
-
[56]
V*: Guided visual search as a core mechanism in multimodal llms
Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13084–13094, 2024
2024
-
[57]
Os-atlas: Foundation action model for generalist gui agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: Foundation action model for generalist gui agents. InThe Thirteenth International Conference on Learning Representations, 2024
2024
-
[58]
Vacot: Rethinking visual data augmentation with vlms.arXiv preprint arXiv:2512.02361, 2025
Zhengzhuo Xu, Chong Sun, SiNan Du, Chen Li, Jing Lyu, and Chun Yuan. Vacot: Rethinking visual data augmentation with vlms.arXiv preprint arXiv:2512.02361, 2025
-
[59]
Xuan Yu, Dayan Guan, Michael Ying Yang, and Yanfeng Gu. Zoom-refine: Boosting high-resolution multimodal understanding via localized zoom and self-refinement.arXiv preprint arXiv:2506.01663, 2025
-
[60]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025
work page Pith review arXiv 2025
-
[61]
Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024
-
[62]
Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025
work page internal anchor Pith review arXiv 2025
-
[63]
Lihao Zheng, Jiawei Chen, Xintian Shen, Hao Ma, and Tao Wei. Mirg-rl: Multi-image reasoning and grounding with reinforcement learning.arXiv preprint arXiv:2509.21788, 2025
-
[64]
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models.arXiv preprint arXiv:2403.13372, 2024
work page internal anchor Pith review arXiv 2024
-
[65]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing “thinking with images” via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025
work page internal anchor Pith review arXiv 2025
-
[66]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 16 A Omitted Technical Details Roadmap.We organize the theoretical analysis as follows. InAppendix ...
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.