Recognition: 2 theorem links
· Lean TheoremUnderstanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
Pith reviewed 2026-05-13 20:44 UTC · model grok-4.3
The pith
Reinforcement learning post-training boosts multimodal reasoning even when models must rely on hallucination due to corrupted visual inputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that RL post-training under purely hallucination-inductive settings can still significantly improve models' reasoning performance, and in some cases even outperform standard training. This is shown through experiments on multiple multimodal reasoning benchmarks using modality-specific corruptions that remove essential visual information, forcing hallucination-based reasoning. The findings indicate that hallucination plays a more significant role in RL-training dynamics than previously recognized.
What carries the argument
The Hallucination-as-Cue Framework, which uses hallucination-inductive, modality-specific corruptions to force and study hallucination-driven reasoning during RL post-training.
If this is right
- RL can improve reasoning performance without access to complete visual information.
- Hallucination contributes substantially to the effectiveness of RL post-training in MLLMs.
- Existing multimodal reasoning datasets may contain properties that favor hallucination over grounded reasoning.
- Training designs should consider modality-aware approaches to better leverage or mitigate hallucination effects.
Where Pith is reading between the lines
- Similar corruption techniques could be used to evaluate other training methods like supervised fine-tuning for their reliance on hallucination.
- This implies that scaling RL post-training might amplify hallucination benefits, potentially requiring new safeguards.
- Applications in real-world vision-language tasks may need to balance performance gains from hallucination with accuracy requirements.
Load-bearing premise
The modality-specific corruptions isolate hallucination-based reasoning without introducing unrelated artifacts that could drive the performance gains independently.
What would settle it
Observing no performance improvement or even degradation when RL is applied under these hallucination-inductive corruptions on a new benchmark would falsify the claim.
Figures
read the original abstract
The recent success of reinforcement learning (RL) in large reasoning models has inspired the growing adoption of RL for post-training Multimodal Large Language Models (MLLMs) to enhance their visual reasoning capabilities. Although many studies have reported improved performance, it remains unclear whether RL training truly enables models to learn from visual information. In this work, we propose the Hallucination-as-Cue Framework, an analytical framework designed to investigate the effects of RL-based post-training on multimodal reasoning models from the perspective of model hallucination. Specifically, we introduce hallucination-inductive, modality-specific corruptions that remove or replace essential information required to derive correct answers, thereby forcing the model to reason by hallucination. By applying these corruptions during both training and evaluation, our framework provides a unique perspective for diagnosing RL training dynamics and understanding the intrinsic properties of datasets. Through extensive experiments and analyses across multiple multimodal reasoning benchmarks, we reveal that the role of model hallucination for RL-training is more significant than previously recognized. For instance, we find that RL post-training under purely hallucination-inductive settings can still significantly improve models' reasoning performance, and in some cases even outperform standard training. These findings challenge prevailing assumptions about MLLM reasoning training and motivate the development of more modality-aware RL-based training designs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Hallucination-as-Cue Framework to investigate RL post-training of multimodal large language models (MLLMs). It applies modality-specific corruptions that remove or replace essential visual information during both training and evaluation, forcing reliance on hallucination. Experiments across multiple benchmarks claim that RL under these purely hallucination-inductive settings still yields significant reasoning improvements, sometimes outperforming standard training, implying hallucination plays a larger role in multimodal RL than previously recognized and motivating modality-aware training designs.
Significance. If the corruptions are shown to isolate hallucination without confounding artifacts and the performance gains are statistically robust, the work would be significant for challenging assumptions about visual grounding in MLLM RL post-training. The broad experimental scope across benchmarks provides a useful diagnostic lens for training dynamics and dataset properties. Credit is due for the reproducible experimental setup implied by the multi-benchmark evaluation.
major comments (3)
- [Abstract] Abstract: The central claim that RL post-training under purely hallucination-inductive settings can outperform standard training lacks any mention of statistical controls, baseline comparisons, or quantitative hallucination-rate measurements, which are required to attribute gains to the intended mechanism rather than reward-landscape changes.
- [§3] §3 (Hallucination-as-Cue Framework): The framework defines modality-specific corruptions to force hallucination-based reasoning, but provides no ablation or analysis demonstrating that these operations do not independently flatten the reward landscape or introduce spurious patterns that RL could exploit without true hallucination.
- [§4] §4 (Experiments): Performance tables or figures reporting improvements under corrupted settings contain no controls (e.g., hallucination-rate verification or corruption-type ablations) to rule out non-hallucination artifacts driving the observed gains, which is load-bearing for the claim that hallucination is the operative factor.
minor comments (2)
- [Abstract] Abstract: Specify the exact number and names of multimodal reasoning benchmarks used to allow readers to assess coverage.
- Ensure all figures clearly label the corruption types (remove vs. replace) and include error bars or significance markers for the reported improvements.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive comments. We address each major comment below, providing clarifications and indicating revisions where necessary to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that RL post-training under purely hallucination-inductive settings can outperform standard training lacks any mention of statistical controls, baseline comparisons, or quantitative hallucination-rate measurements, which are required to attribute gains to the intended mechanism rather than reward-landscape changes.
Authors: We agree that the abstract should better highlight the controls and measurements supporting our claims. The manuscript includes baseline comparisons to standard RL training and reports performance improvements across multiple benchmarks. We also measure hallucination rates by comparing responses on corrupted vs. original inputs. In the revised version, we will update the abstract to explicitly mention these statistical controls, baseline comparisons, and hallucination-rate verifications to better attribute the gains to the hallucination mechanism. revision: yes
-
Referee: [§3] §3 (Hallucination-as-Cue Framework): The framework defines modality-specific corruptions to force hallucination-based reasoning, but provides no ablation or analysis demonstrating that these operations do not independently flatten the reward landscape or introduce spurious patterns that RL could exploit without true hallucination.
Authors: The framework employs modality-specific corruptions such as object removal and attribute replacement to eliminate essential visual cues. To address potential confounding factors, we performed ablations across different corruption strategies and observed consistent performance trends, which suggest the improvements are not solely due to reward landscape flattening or spurious patterns. We will add a dedicated subsection in §3 discussing these ablations and their implications for ruling out non-hallucination artifacts. revision: partial
-
Referee: [§4] §4 (Experiments): Performance tables or figures reporting improvements under corrupted settings contain no controls (e.g., hallucination-rate verification or corruption-type ablations) to rule out non-hallucination artifacts driving the observed gains, which is load-bearing for the claim that hallucination is the operative factor.
Authors: In §4 and the appendix, we provide corruption-type ablations and verify increased hallucination rates through qualitative and quantitative analysis of model outputs on corrupted data. These controls are included to demonstrate that the gains persist across corruption types. We will revise the main text of §4 to more prominently feature these controls and include additional statistical significance tests in the tables. revision: yes
Circularity Check
No significant circularity; empirical framework is self-contained
full rationale
The paper defines an empirical Hallucination-as-Cue Framework that applies modality-specific corruptions during RL post-training and evaluates resulting performance on external multimodal benchmarks. No equations, fitted parameters, or derivations are presented that reduce the central claims (e.g., RL gains under hallucination-inductive settings) to inputs by construction. No self-citations serve as load-bearing uniqueness theorems, no ansatzes are smuggled via prior work, and no known results are merely renamed. The derivation chain rests on independent experimental interventions and benchmark measurements rather than self-referential definitions or statistical forcing, satisfying the criteria for a non-circular analysis.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The hallucination-inductive corruptions remove or replace essential visual information required for correct answers.
invented entities (1)
-
Hallucination-as-Cue Framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we introduce hallucination-inductive, modality-specific corruptions that remove or replace essential information... forcing the model to reason by hallucination
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RL post-training under purely hallucination-inductive settings can still significantly improve models' reasoning performance
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026
Mohammad Asadi, Jack W O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026. 3
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
arXiv preprint arXiv:2504.11468 , year=
Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models.arXiv preprint arXiv:2504.11468,
-
[4]
Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From opti- mized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025. 5
-
[5]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 2
work page 2024
-
[6]
Bowen Dong, Minheng Ni, Zitong Huang, Guanglei Yang, Wangmeng Zuo, and Lei Zhang. Mirage: Assessing hal- lucination in multimodal reasoning chains of mllm.arXiv preprint arXiv:2505.24238, 2025. 3
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023. 2
work page 2023
-
[11]
Clevr: A diagnostic dataset for compositional language and elementary visual reasoning
Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017. 3, 4, 7
work page 2017
-
[12]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska- Barwinska, et al. Overcoming catastrophic forgetting in neu- ral networks.Proceedings of the national academy of sci- ences, 114(13):3521–3526, 2017. 8
work page 2017
-
[13]
Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, Yuheng Li, Konstantinos Psounis, and Xiaofeng Yang. Med-r1: Re- inforcement learning for generalizable medical reasoning in 9 vision-language models.arXiv preprint arXiv:2503.13939,
-
[14]
Mmr1: Advancing the frontiers of multimodal reasoning, 2025
Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Hang Zhang, Yuming Jiang, Xin Li, Deli Zhao, et al. Mmr1: Advancing the frontiers of multimodal reasoning, 2025. 4, 7
work page 2025
-
[15]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 2
work page 2022
-
[16]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2
work page 2023
-
[17]
Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025. 2, 3
-
[18]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2
work page 2023
-
[19]
Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, and Michael Qizhe Shieh. Noisyrollout: Reinforcing visual reasoning with data aug- mentation.arXiv preprint arXiv:2504.13055, 2025. 2, 3
-
[20]
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 2, 4
work page internal anchor Pith review arXiv 2025
-
[21]
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021. 4, 5, 6, 7
-
[22]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu Min- huiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma Gongque, Shanglin Lei, Yifan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathemat- ical reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages...
work page 2025
-
[24]
Proximal Policy Optimization Algo- rithms, 2017
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal Policy Optimization Algo- rithms, 2017. 3
work page 2017
-
[25]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Rl’s razor: Why online reinforcement learning forgets less, 2025
Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s ra- zor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025. 8
-
[27]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 2
work page 2019
-
[28]
Llamav-o1: Rethinking step-by-step vi- sual reasoning in llms
Omkar Thawakar, Dinura Dissanayake, Ketan Pravin More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Il- muz Zaman Mohammed Zumri, Jean Lahoud, Rao Muham- mad Anwer, et al. Llamav-o1: Rethinking step-by-step vi- sual reasoning in llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 24290–24315,
work page 2025
-
[29]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025. 2
work page Pith review arXiv 2025
-
[30]
Jiaqi Wang, Kevin Qinghong Lin, James Cheng, and Mike Zheng Shou. Think or not? selective reasoning via reinforcement learning for vision-language models.arXiv preprint arXiv:2505.16854, 2025. 3
-
[31]
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2024. 2, 4
work page 2024
-
[32]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Lin- jie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Li- juan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025. 2, 3
-
[34]
Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, and Lichao Sun. First sft, second rl, third upt: Continual improving multi-modal llm reasoning via unsuper- vised post-training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 4
work page 2025
-
[35]
Jiulong Wu, Zhengliang Shi, Shuaiqiang Wang, Jizhou Huang, Dawei Yin, Lingyong Yan, Min Cao, and Min Zhang. Mitigating hallucinations in large vision-language models via entity-centric multimodal preference optimization.arXiv preprint arXiv:2506.04039, 2025. 3
-
[36]
Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Visionthink: Smart and efficient vision lan- guage model via reinforcement learning.arXiv preprint arXiv:2507.13348, 2025. 2, 3
-
[37]
Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing gen- eralized multimodal reasoning through cross-modal formal- ization.arXiv preprint arXiv:2503.10615, 2025. 2, 3, 4 10
-
[38]
Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multi- modal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025. 2
-
[39]
Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search.arXiv preprint arXiv:2412.18319, 2024. 2
-
[40]
Slca: Slow learner with classifier align- ment for continual learning on a pre-trained model
Gengwei Zhang, Liyuan Wang, Guoliang Kang, Ling Chen, and Yunchao Wei. Slca: Slow learner with classifier align- ment for continual learning on a pre-trained model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19148–19158, 2023. 8
work page 2023
-
[41]
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186. Springer, 2024. 2, 3, 4, 8
work page 2024
-
[42]
Improve vision language model chain-of- thought reasoning
Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of- thought reasoning. InProceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1631–1662, 2025. 2
work page 2025
-
[43]
Yaowei Zheng et al. Easyr1: An efficient, scalable, multi- modality rl training framework.https://github.com/ hiyouga/EasyR1, 2025. 4
work page 2025
-
[44]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Xin Zou, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Ken- ing Zheng, Sirui Huang, Junkai Chen, Peijie Jiang, Jia Liu, Chang Tang, et al. Look twice before you answer: Memory-space visual retracing for hallucination mitiga- tion in multimodal large language models.arXiv preprint arXiv:2410.03577, 2024. 3 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.