Recognition: unknown
Improving Vision-language Models with Perception-centric Process Reward Models
Pith reviewed 2026-05-08 04:29 UTC · model grok-4.3
The pith
Perceval provides token-level perceptual error detection to refine vision-language model reasoning chains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Perceval is trained on perception-intensive data to extract image-related claims from responses and compare them one by one with visual evidence, returning the claims that contain perceptual errors. When plugged into RL training it replaces sequence-level advantages with token-level penalties applied only to the hallucinated spans it identifies. The same model also enables test-time scaling by truncating erroneous portions of an output and either regenerating directly or inducing reflection, with the process repeatable for further gains.
What carries the argument
Perceval, a process reward model that performs claim extraction followed by visual verification to ground perceptual errors at the token level.
Load-bearing premise
Supervised training on perception-intensive data produces a reward model that can reliably spot perceptual errors inside long, multi-step VLM reasoning without introducing new biases or harming non-visual reasoning steps.
What would settle it
A controlled RL run on a benchmark with independently human-annotated perceptual error locations where adding Perceval token penalties produces no accuracy gain or produces lower scores than standard sequence-level GRPO.
Figures
read the original abstract
Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified by Perceval, thus enabling fine-grained supervision signals. In addition to augmenting the training process, Perceval can also assist VLMs during the inference stage. Using Perceval, we can truncate the erroneous portions of the model's response, and then either have the model regenerate the response directly or induce the model to reflect on its previous output. This process can be repeated multiple times to achieve test-time scaling. Experiments show significant improvements on benchmarks from various domains across multiple reasoning VLMs trained with RL, highlighting the promise of perception-centric supervision as a general-purpose strategy. For test-time scaling, it also demonstrates consistent performance gains over other strategies, such as major voting. Our code and data will be publicly released at https://github.com/RUCAIBox/Perceval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Perceval, a process reward model (PRM) for vision-language models that extracts image-related claims from VLM responses and verifies them one-by-one against visual evidence to identify perceptual errors. Perceval is trained on perception-intensive supervised data and integrated into RL training (replacing GRPO's sequence-level advantages with token-level penalties on hallucinated spans). It is also applied at test time for iterative error correction via truncation, regeneration, or reflection. The authors claim significant benchmark improvements across multiple reasoning VLMs and superior test-time scaling versus majority voting, with code and data to be released.
Significance. If the central mechanism proves reliable, the work could advance fine-grained, perception-centric supervision in multimodal RL, addressing the coarseness of outcome-level rewards for hallucination correction in VLMs. Public release of code and data would support reproducibility and follow-up research in the RLVR and VLM communities.
major comments (3)
- [§3] §3 (Perceval architecture): The claim-extraction and one-by-one visual-verification procedure is described only at a high level; no architecture details, verification implementation (e.g., additional VLM call, grounding model, or similarity metric), or error analysis on multi-step chains are supplied. This is load-bearing because the resulting token-level advantage signal directly replaces GRPO's sequence-level signal and any false positives/negatives will propagate into the policy update.
- [Experiments] Experiments section: No quantitative evaluation of Perceval itself (precision/recall on perceptual errors, robustness on out-of-distribution reasoning traces) or ablations on the token-level advantage scaling factor appear. Without these, the reported benchmark gains cannot be attributed to the perception-centric mechanism rather than other training choices.
- [§4.2] §4.2 (RL integration): The method applies token-level penalties only to spans labeled erroneous by Perceval, yet no analysis is given of how this affects non-perceptual reasoning steps or whether it introduces new biases; this directly tests the weakest assumption in the argument.
minor comments (2)
- [Abstract] The abstract uses 'major voting' where 'majority voting' is the conventional term; a minor terminology clarification would improve readability.
- [§4] Notation for the token-level advantage computation is introduced informally; an explicit equation would aid clarity when contrasting with GRPO.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our paper. We value the feedback on the need for greater transparency in Perceval's design and evaluation. In our revision, we will provide expanded details on the architecture, add quantitative assessments of Perceval, and include analyses of the RL integration effects. These changes will directly address the concerns and improve the manuscript's clarity and rigor.
read point-by-point responses
-
Referee: [§3] §3 (Perceval architecture): The claim-extraction and one-by-one visual-verification procedure is described only at a high level; no architecture details, verification implementation (e.g., additional VLM call, grounding model, or similarity metric), or error analysis on multi-step chains are supplied. This is load-bearing because the resulting token-level advantage signal directly replaces GRPO's sequence-level signal and any false positives/negatives will propagate into the policy update.
Authors: We acknowledge that Section 3 provides a high-level overview of the claim-extraction and verification process. To address this, we will revise the manuscript to include a detailed description of the architecture, including the specific implementation of verification (which involves an additional VLM call for one-by-one claim checking against the image using direct prompting for evidence comparison, without a separate grounding model). We will also add an error analysis for multi-step chains, with examples illustrating how false positives and negatives are managed in generating the token-level advantages. This will clarify the reliability of the supervision signal. revision: yes
-
Referee: [Experiments] Experiments section: No quantitative evaluation of Perceval itself (precision/recall on perceptual errors, robustness on out-of-distribution reasoning traces) or ablations on the token-level advantage scaling factor appear. Without these, the reported benchmark gains cannot be attributed to the perception-centric mechanism rather than other training choices.
Authors: We agree that evaluating Perceval independently is essential to attribute the benchmark improvements. In the revised version, we will augment the Experiments section with quantitative results for Perceval, such as precision and recall metrics evaluated on perceptual error detection, and robustness tests using out-of-distribution reasoning traces. We will also present ablations on the token-level advantage scaling factor to isolate its effect. These additions will strengthen the causal link between the perception-centric mechanism and the observed gains. revision: yes
-
Referee: [§4.2] §4.2 (RL integration): The method applies token-level penalties only to spans labeled erroneous by Perceval, yet no analysis is given of how this affects non-perceptual reasoning steps or whether it introduces new biases; this directly tests the weakest assumption in the argument.
Authors: We recognize the importance of analyzing the selective application of penalties. Perceval targets only perceptual errors, so non-perceptual reasoning steps are not penalized. In the revision, we will expand §4.2 with an analysis of its effects, including breakdowns of how penalties influence different reasoning components and an assessment of potential biases (e.g., via performance comparisons on perceptual-heavy vs. logic-heavy tasks). We will discuss any introduced biases and our mitigation approaches, such as careful threshold selection for error detection. revision: yes
Circularity Check
No circularity; method relies on external supervised data and standard RL
full rationale
The paper proposes Perceval as a PRM trained on perception-intensive supervised data, then integrated into RL training to apply token-level penalties on hallucinated spans identified by the model. No equations, derivations, or first-principles claims are presented that reduce to fitted parameters, self-definitions, or self-citations by construction. The central mechanism (claim extraction and visual comparison) is implemented via supervised training and standard RL modifications to GRPO, with claimed gains supported by experiments rather than tautological reduction to inputs. This is a standard empirical method paper without load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
free parameters (1)
- token-level advantage scaling factor
axioms (1)
- domain assumption Process-level perceptual supervision yields better policy improvement than outcome-level rewards for VLMs
invented entities (1)
-
Perceval
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Towards mitigating hallucinations in large vision-language models by refining textual embeddings
Aakriti Agrawal, Gouthaman KV , Rohith Aralikatti, Gauri Jagatap, Jiaxin Yuan, Vijay Kamarshi, Andrea Fanelli, and Furong Huang. Towards mitigating hallucinations in large vision-language models by refining textual embeddings. arXiv preprint arXiv:2511.05017, 2025. 1, 8
-
[2]
Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,
-
[3]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Keqin Chen, Mengfei Du, Yang Fan, Zhihao Fan, Wenbin Ge, Dayiheng Liu, Rui Men, Xu- ancheng Ren, et al. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and be- yond.arXiv preprint arXiv:2308.12966, 2023. 1, 8
work page internal anchor Pith review arXiv 2023
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...
work page internal anchor Pith review arXiv 2025
-
[5]
Sft or rl? an early investigation into training r1-like reasoning large vision-language models, 2025
Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models, 2025. 5, 6
2025
-
[6]
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330,
work page internal anchor Pith review arXiv
-
[7]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1
2024
-
[8]
Openvlthinker: Complex vision- language reasoning via iterative sft-rl cycles, 2025
Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Openvlthinker: Complex vision- language reasoning via iterative sft-rl cycles, 2025. 5, 6
2025
-
[9]
Sophiavl-r1: Reinforcing mllms reasoning with thinking reward, 2025
Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward, 2025. 8
2025
-
[10]
Sophiavl-r1: Reinforcing mllms reasoning with thinking reward, 2025
Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward, 2025. 3, 5
2025
-
[11]
Blink: Multimodal large language models can see but not perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390,
-
[12]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, et al. Gemini: A family of highly capable multimodal mod- els.arXiv preprint arXiv:2508.11630, 2025. 1, 8
work page internal anchor Pith review arXiv 2025
-
[13]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 3, 8
work page internal anchor Pith review arXiv 2025
-
[14]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,
work page internal anchor Pith review arXiv
-
[15]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Scaling up visual and vision-language representa- tion learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,
-
[17]
Inference- time reward hacking in large language models, 2025
Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, and Flavio du Pin Calmon. Inference- time reward hacking in large language models, 2025. 1
2025
-
[18]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 8
2023
-
[19]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 1, 8
work page internal anchor Pith review arXiv 2023
-
[20]
Yifan Li, Kun Zhou, Wayne Xin Zhao, Lei Fang, and Ji- Rong Wen. Analyzing and mitigating object hallucination: A training bias perspective.arXiv preprint arXiv:2508.04567,
-
[21]
Let’s verify step by step
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Ed- wards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Repre- sentations, 2023. 1
2023
-
[22]
Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025. 1, 8
-
[23]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 8
2023
-
[24]
Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024. 1
2024
-
[25]
Inference-time scaling for generalist reward modeling, 2025
Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling, 2025. 8
2025
-
[26]
Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. InIn- ternational Conference on Learning Representations (ICLR),
-
[27]
ChartQA: A benchmark for question answer- ing about charts with visual and logical reasoning
Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Ena- mul Hoque. ChartQA: A benchmark for question answer- ing about charts with visual and logical reasoning. InFind- ings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, 2022. Association for Computational Linguistics. 1, 5
2022
-
[28]
Mm-eureka: Ex- ploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Ex- ploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025. 5, 6
2025
-
[29]
Khapra, and Pratyush Kumar
Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. Plotqa: Reasoning over scientific plots. In The IEEE Winter Conference on Applications of Computer Vision (WACV), 2020. 5
2020
-
[30]
Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536, 2025. 5, 8
-
[31]
Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl,
Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl,
-
[32]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 8
2021
-
[33]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. 1, 2
2024
-
[34]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 5, 6, 8
work page internal anchor Pith review arXiv 2025
-
[35]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelli- gence.arXiv preprint arXiv:2507.20534, 2025. 1, 8
work page internal anchor Pith review arXiv 2025
-
[36]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025. 5, 6
work page Pith review arXiv 2025
-
[37]
Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning, 2025
Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning, 2025. 5
2025
-
[38]
Mea- suring multimodal mathematical reasoning with math-vision dataset
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset. InThe Thirty-eight Conference on Neural Infor- mation Processing Systems Datasets and Benchmarks Track,
-
[39]
Math- shepherd: Verify and reinforce llms step-by-step without hu- man annotations
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math- shepherd: Verify and reinforce llms step-by-step without hu- man annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024. 1, 8
2024
-
[40]
Unified multimodal chain-of-thought reward model through reinforcement fine- tuning, 2025
Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine- tuning, 2025. 8
2025
-
[41]
Blaschko
Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Ji- aqian Yu, and Matthew B. Blaschko. Jigsaw-r1: A study of rule-based visual reinforcement learning with jigsaw puz- zles, 2025. 5, 6
2025
-
[42]
Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms.arXiv preprint arXiv:2312.14135, 2023. 1, 3, 4
-
[43]
Grok-1.5 vision preview.https://x.ai/news/ grok-1.5v, 2024
xAI. Grok-1.5 vision preview.https://x.ai/news/ grok-1.5v, 2024. Accessed: 2024-08-27. 4
2024
-
[44]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 8
work page internal anchor Pith review arXiv 2025
-
[45]
Perception-r1: Pioneering perception policy with reinforcement learning, 2025
En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Jingyu Wang, and Wen- bing Tao. Perception-r1: Pioneering perception policy with reinforcement learning, 2025. 2, 5, 6
2025
-
[46]
Internlm-xcomposer2.5-reward: A simple yet effec- tive multi-modal reward model, 2025
Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2.5-reward: A simple yet effec- tive multi-modal reward model, 2025. 8
2025
-
[47]
R1-vl: Learning to reason with multimodal large language models via step- wise group relative policy optimization, 2025
Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step- wise group relative policy optimization, 2025. 5, 6
2025
-
[48]
Jipeng Zhang, Kehao Miao, Renjie Pi, Zhaowei Wang, Run- tao Liu, Rui Pan, and Tong Zhang. Vl-genrm: Enhancing vision-language verification via vision experts and iterative training.arXiv preprint arXiv:2506.13888, 2025. 1
-
[49]
Structvrm: Aligning multimodal reasoning with structured and verifiable reward models, 2025
Xiangxiang Zhang, Jingxuan Wei, Donghong Zhong, Qi Chen, Caijun Jia, Cheng Tan, Jinming Gu, Xiaobo Qin, Zhiping Liu, Liang Hu, Tong Sun, Yuchen Wu, Zewei Sun, Chenwei Lou, Hua Zheng, Tianyang Zhan, Changbao Wang, Shuangzhi Wu, Zefa Lin, Chang Guo, Sihang Yuan, Riwei Chen, Shixiong Zhao, Yingping Zhang, Gaowei Wu, Bi- hui Yu, Jiahui Wu, Zhehui Zhao, Qianqi...
2025
-
[50]
Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qing- song Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024. 1, 4
-
[51]
R1-reward: Train- ing multimodal reward model through stable reinforcement learning, 2025
Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, and Liang Wang. R1-reward: Train- ing multimodal reward model through stable reinforcement learning, 2025. 8
2025
-
[52]
Basereward: A strong base- line for multimodal reward model, 2025
Yi-Fan Zhang, Haihua Yang, Huanyu Zhang, Yang Shi, Zezhou Chen, Haochen Tian, Chaoyou Fu, Haotian Wang, Kai Wu, Bo Cui, Xu Wang, Jianfei Pan, Haotian Wang, Zhang Zhang, and Liang Wang. Basereward: A strong base- line for multimodal reward model, 2025. 8
2025
-
[53]
Zhuoran Zhang, Tengyue Wang, Xilin Gong, Yang Shi, Hao- tian Wang, Di Wang, and Lijie Hu. When modalities conflict: How unimodal reasoning uncertainty governs preference dy- namics in mllms.arXiv preprint arXiv:2511.02243, 2025. 1, 8
-
[54]
The lessons of developing process reward models in mathematical reasoning, 2025
Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning, 2025. 1
2025
-
[55]
Congming Zheng, Jiachen Zhu, Zhuoying Ou, Yuxiang Chen, Kangning Zhang, Rong Shan, Zeyu Zheng, Mengyue Yang, Jianghao Lin, Yong Yu, et al. A survey of process re- ward models: From outcome signals to process supervisions for large language models.arXiv preprint arXiv:2510.08049,
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
thinking with images
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing "thinking with images" via reinforcement learning, 2025. 3, 5, 6
2025
-
[57]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 8
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.