arxiv: 2604.24583 · v1 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

Improving Vision-language Models with Perception-centric Process Reward Models

Yingqian Min , Kun Zhou , Yifan Li , Yuhuan Wu , Han Peng , Yifan Du , Wayne Xin Zhao , Min Yang

show 1 more author

Ji-Rong Wen

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsprocess reward modelsperceptual errorshallucinationstoken-level supervisionreinforcement learningtest-time scaling

0 comments

The pith

Perceval provides token-level perceptual error detection to refine vision-language model reasoning chains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Perceval, a process reward model that extracts image-related claims from a VLM's response and checks each against the actual image to flag perceptual mistakes. This model supplies fine-grained signals during reinforcement learning by penalizing specific hallucinated tokens rather than scoring the whole output, and it supports inference-time fixes by cutting erroneous segments and prompting the model to regenerate or reflect. A sympathetic reader would care because VLMs frequently misdescribe visual details even when their logical steps are sound, and outcome-only rewards leave those errors uncorrected.

Core claim

Perceval is trained on perception-intensive data to extract image-related claims from responses and compare them one by one with visual evidence, returning the claims that contain perceptual errors. When plugged into RL training it replaces sequence-level advantages with token-level penalties applied only to the hallucinated spans it identifies. The same model also enables test-time scaling by truncating erroneous portions of an output and either regenerating directly or inducing reflection, with the process repeatable for further gains.

What carries the argument

Perceval, a process reward model that performs claim extraction followed by visual verification to ground perceptual errors at the token level.

Load-bearing premise

Supervised training on perception-intensive data produces a reward model that can reliably spot perceptual errors inside long, multi-step VLM reasoning without introducing new biases or harming non-visual reasoning steps.

What would settle it

A controlled RL run on a benchmark with independently human-annotated perceptual error locations where adding Perceval token penalties produces no accuracy gain or produces lower scores than standard sequence-level GRPO.

Figures

Figures reproduced from arXiv: 2604.24583 by Han Peng, Ji-Rong Wen, Kun Zhou, Min Yang, Wayne Xin Zhao, Yifan Du, Yifan Li, Yingqian Min, Yuhuan Wu.

**Figure 1.** Figure 1: An overview of our Process-Supervised GRPO framework. For each generated response, we use the Perceval to create a token view at source ↗

**Figure 2.** Figure 2: The proportion of responses identified by P view at source ↗

**Figure 3.** Figure 3: Case study of the visual reasoning process from models view at source ↗

read the original abstract

Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified by Perceval, thus enabling fine-grained supervision signals. In addition to augmenting the training process, Perceval can also assist VLMs during the inference stage. Using Perceval, we can truncate the erroneous portions of the model's response, and then either have the model regenerate the response directly or induce the model to reflect on its previous output. This process can be repeated multiple times to achieve test-time scaling. Experiments show significant improvements on benchmarks from various domains across multiple reasoning VLMs trained with RL, highlighting the promise of perception-centric supervision as a general-purpose strategy. For test-time scaling, it also demonstrates consistent performance gains over other strategies, such as major voting. Our code and data will be publicly released at https://github.com/RUCAIBox/Perceval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Perceval, a process reward model (PRM) for vision-language models that extracts image-related claims from VLM responses and verifies them one-by-one against visual evidence to identify perceptual errors. Perceval is trained on perception-intensive supervised data and integrated into RL training (replacing GRPO's sequence-level advantages with token-level penalties on hallucinated spans). It is also applied at test time for iterative error correction via truncation, regeneration, or reflection. The authors claim significant benchmark improvements across multiple reasoning VLMs and superior test-time scaling versus majority voting, with code and data to be released.

Significance. If the central mechanism proves reliable, the work could advance fine-grained, perception-centric supervision in multimodal RL, addressing the coarseness of outcome-level rewards for hallucination correction in VLMs. Public release of code and data would support reproducibility and follow-up research in the RLVR and VLM communities.

major comments (3)

[§3] §3 (Perceval architecture): The claim-extraction and one-by-one visual-verification procedure is described only at a high level; no architecture details, verification implementation (e.g., additional VLM call, grounding model, or similarity metric), or error analysis on multi-step chains are supplied. This is load-bearing because the resulting token-level advantage signal directly replaces GRPO's sequence-level signal and any false positives/negatives will propagate into the policy update.
[Experiments] Experiments section: No quantitative evaluation of Perceval itself (precision/recall on perceptual errors, robustness on out-of-distribution reasoning traces) or ablations on the token-level advantage scaling factor appear. Without these, the reported benchmark gains cannot be attributed to the perception-centric mechanism rather than other training choices.
[§4.2] §4.2 (RL integration): The method applies token-level penalties only to spans labeled erroneous by Perceval, yet no analysis is given of how this affects non-perceptual reasoning steps or whether it introduces new biases; this directly tests the weakest assumption in the argument.

minor comments (2)

[Abstract] The abstract uses 'major voting' where 'majority voting' is the conventional term; a minor terminology clarification would improve readability.
[§4] Notation for the token-level advantage computation is introduced informally; an explicit equation would aid clarity when contrasting with GRPO.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our paper. We value the feedback on the need for greater transparency in Perceval's design and evaluation. In our revision, we will provide expanded details on the architecture, add quantitative assessments of Perceval, and include analyses of the RL integration effects. These changes will directly address the concerns and improve the manuscript's clarity and rigor.

read point-by-point responses

Referee: [§3] §3 (Perceval architecture): The claim-extraction and one-by-one visual-verification procedure is described only at a high level; no architecture details, verification implementation (e.g., additional VLM call, grounding model, or similarity metric), or error analysis on multi-step chains are supplied. This is load-bearing because the resulting token-level advantage signal directly replaces GRPO's sequence-level signal and any false positives/negatives will propagate into the policy update.

Authors: We acknowledge that Section 3 provides a high-level overview of the claim-extraction and verification process. To address this, we will revise the manuscript to include a detailed description of the architecture, including the specific implementation of verification (which involves an additional VLM call for one-by-one claim checking against the image using direct prompting for evidence comparison, without a separate grounding model). We will also add an error analysis for multi-step chains, with examples illustrating how false positives and negatives are managed in generating the token-level advantages. This will clarify the reliability of the supervision signal. revision: yes
Referee: [Experiments] Experiments section: No quantitative evaluation of Perceval itself (precision/recall on perceptual errors, robustness on out-of-distribution reasoning traces) or ablations on the token-level advantage scaling factor appear. Without these, the reported benchmark gains cannot be attributed to the perception-centric mechanism rather than other training choices.

Authors: We agree that evaluating Perceval independently is essential to attribute the benchmark improvements. In the revised version, we will augment the Experiments section with quantitative results for Perceval, such as precision and recall metrics evaluated on perceptual error detection, and robustness tests using out-of-distribution reasoning traces. We will also present ablations on the token-level advantage scaling factor to isolate its effect. These additions will strengthen the causal link between the perception-centric mechanism and the observed gains. revision: yes
Referee: [§4.2] §4.2 (RL integration): The method applies token-level penalties only to spans labeled erroneous by Perceval, yet no analysis is given of how this affects non-perceptual reasoning steps or whether it introduces new biases; this directly tests the weakest assumption in the argument.

Authors: We recognize the importance of analyzing the selective application of penalties. Perceval targets only perceptual errors, so non-perceptual reasoning steps are not penalized. In the revision, we will expand §4.2 with an analysis of its effects, including breakdowns of how penalties influence different reasoning components and an assessment of potential biases (e.g., via performance comparisons on perceptual-heavy vs. logic-heavy tasks). We will discuss any introduced biases and our mitigation approaches, such as careful threshold selection for error detection. revision: yes

Circularity Check

0 steps flagged

No circularity; method relies on external supervised data and standard RL

full rationale

The paper proposes Perceval as a PRM trained on perception-intensive supervised data, then integrated into RL training to apply token-level penalties on hallucinated spans identified by the model. No equations, derivations, or first-principles claims are presented that reduce to fitted parameters, self-definitions, or self-citations by construction. The central mechanism (claim extraction and visual comparison) is implemented via supervised training and standard RL modifications to GRPO, with claimed gains supported by experiments rather than tautological reduction to inputs. This is a standard empirical method paper without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the existence and effectiveness of a newly trained PRM and on the transfer of its error signals into both training and inference pipelines; these components are introduced without independent prior validation in the abstract.

free parameters (1)

token-level advantage scaling factor
Used to convert Perceval detections into per-token penalties during RL; value and tuning procedure not specified.

axioms (1)

domain assumption Process-level perceptual supervision yields better policy improvement than outcome-level rewards for VLMs
Invoked to justify moving from sequence-level GRPO to token-level advantages.

invented entities (1)

Perceval no independent evidence
purpose: Token-level perceptual error detector for VLMs
New model trained on perception-intensive data and inserted into both RL and inference loops.

pith-pipeline@v0.9.0 · 5603 in / 1395 out tokens · 95811 ms · 2026-05-08T04:29:28.001532+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 23 canonical work pages · 13 internal anchors

[1]

Towards mitigating hallucinations in large vision-language models by refining textual embeddings

Aakriti Agrawal, Gouthaman KV , Rohith Aralikatti, Gauri Jagatap, Jiaxin Yuan, Vijay Kamarshi, Andrea Fanelli, and Furong Huang. Towards mitigating hallucinations in large vision-language models by refining textual embeddings. arXiv preprint arXiv:2511.05017, 2025. 1, 8

work page arXiv 2025
[2]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Keqin Chen, Mengfei Du, Yang Fan, Zhihao Fan, Wenbin Ge, Dayiheng Liu, Rui Men, Xu- ancheng Ren, et al. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and be- yond.arXiv preprint arXiv:2308.12966, 2023. 1, 8

work page internal anchor Pith review arXiv 2023
[4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review arXiv 2025
[5]

Sft or rl? an early investigation into training r1-like reasoning large vision-language models, 2025

Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models, 2025. 5, 6

2025
[6]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330,

work page internal anchor Pith review arXiv
[7]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1

2024
[8]

Openvlthinker: Complex vision- language reasoning via iterative sft-rl cycles, 2025

Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Openvlthinker: Complex vision- language reasoning via iterative sft-rl cycles, 2025. 5, 6

2025
[9]

Sophiavl-r1: Reinforcing mllms reasoning with thinking reward, 2025

Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward, 2025. 8

2025
[10]

Sophiavl-r1: Reinforcing mllms reasoning with thinking reward, 2025

Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward, 2025. 3, 5

2025
[11]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390,

work page arXiv
[12]

Thyme: Think Beyond Images

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, et al. Gemini: A family of highly capable multimodal mod- els.arXiv preprint arXiv:2508.11630, 2025. 1, 8

work page internal anchor Pith review arXiv 2025
[13]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 3, 8

work page internal anchor Pith review arXiv 2025
[14]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review arXiv
[15]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,
[17]

Inference- time reward hacking in large language models, 2025

Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, and Flavio du Pin Calmon. Inference- time reward hacking in large language models, 2025. 1

2025
[18]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 8

2023
[19]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 1, 8

work page internal anchor Pith review arXiv 2023
[20]

Analyzing and mitigating object hallucination: A training bias perspective.arXiv preprint arXiv:2508.04567,

Yifan Li, Kun Zhou, Wayne Xin Zhao, Lei Fang, and Ji- Rong Wen. Analyzing and mitigating object hallucination: A training bias perspective.arXiv preprint arXiv:2508.04567,

work page arXiv
[21]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Ed- wards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Repre- sentations, 2023. 1

2023
[22]

More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025

Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified halluci- nation in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025. 1, 8

work page arXiv 2025
[23]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 8

2023
[24]

Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024. 1

2024
[25]

Inference-time scaling for generalist reward modeling, 2025

Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling, 2025. 8

2025
[26]

Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. InIn- ternational Conference on Learning Representations (ICLR),
[27]

ChartQA: A benchmark for question answer- ing about charts with visual and logical reasoning

Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Ena- mul Hoque. ChartQA: A benchmark for question answer- ing about charts with visual and logical reasoning. InFind- ings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, 2022. Association for Computational Linguistics. 1, 5

2022
[28]

Mm-eureka: Ex- ploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Ex- ploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025. 5, 6

2025
[29]

Khapra, and Pratyush Kumar

Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. Plotqa: Reasoning over scientific plots. In The IEEE Winter Conference on Applications of Computer Vision (WACV), 2020. 5

2020
[30]

Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536,

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536, 2025. 5, 8

work page arXiv 2025
[31]

Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl,

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl,
[32]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 8

2021
[33]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. 1, 2

2024
[34]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 5, 6, 8

work page internal anchor Pith review arXiv 2025
[35]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelli- gence.arXiv preprint arXiv:2507.20534, 2025. 1, 8

work page internal anchor Pith review arXiv 2025
[36]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025. 5, 6

work page Pith review arXiv 2025
[37]

Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning, 2025

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning, 2025. 5

2025
[38]

Mea- suring multimodal mathematical reasoning with math-vision dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset. InThe Thirty-eight Conference on Neural Infor- mation Processing Systems Datasets and Benchmarks Track,
[39]

Math- shepherd: Verify and reinforce llms step-by-step without hu- man annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math- shepherd: Verify and reinforce llms step-by-step without hu- man annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024. 1, 8

2024
[40]

Unified multimodal chain-of-thought reward model through reinforcement fine- tuning, 2025

Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine- tuning, 2025. 8

2025
[41]

Blaschko

Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Ji- aqian Yu, and Matthew B. Blaschko. Jigsaw-r1: A study of rule-based visual reinforcement learning with jigsaw puz- zles, 2025. 5, 6

2025
[42]

V*: Guided visual search as a core mechanism in multimodal llms.arXiv preprint arXiv:2312.14135, 2023

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms.arXiv preprint arXiv:2312.14135, 2023. 1, 3, 4

work page arXiv 2023
[43]

Grok-1.5 vision preview.https://x.ai/news/ grok-1.5v, 2024

xAI. Grok-1.5 vision preview.https://x.ai/news/ grok-1.5v, 2024. Accessed: 2024-08-27. 4

2024
[44]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 8

work page internal anchor Pith review arXiv 2025
[45]

Perception-r1: Pioneering perception policy with reinforcement learning, 2025

En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Jingyu Wang, and Wen- bing Tao. Perception-r1: Pioneering perception policy with reinforcement learning, 2025. 2, 5, 6

2025
[46]

Internlm-xcomposer2.5-reward: A simple yet effec- tive multi-modal reward model, 2025

Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2.5-reward: A simple yet effec- tive multi-modal reward model, 2025. 8

2025
[47]

R1-vl: Learning to reason with multimodal large language models via step- wise group relative policy optimization, 2025

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step- wise group relative policy optimization, 2025. 5, 6

2025
[48]

Vl-genrm: Enhancing vision-language verification via vision experts and iterative training.arXiv preprint arXiv:2506.13888, 2025

Jipeng Zhang, Kehao Miao, Renjie Pi, Zhaowei Wang, Run- tao Liu, Rui Pan, and Tong Zhang. Vl-genrm: Enhancing vision-language verification via vision experts and iterative training.arXiv preprint arXiv:2506.13888, 2025. 1

work page arXiv 2025
[49]

Structvrm: Aligning multimodal reasoning with structured and verifiable reward models, 2025

Xiangxiang Zhang, Jingxuan Wei, Donghong Zhong, Qi Chen, Caijun Jia, Cheng Tan, Jinming Gu, Xiaobo Qin, Zhiping Liu, Liang Hu, Tong Sun, Yuchen Wu, Zewei Sun, Chenwei Lou, Hua Zheng, Tianyang Zhan, Changbao Wang, Shuangzhi Wu, Zefa Lin, Chang Guo, Sihang Yuan, Riwei Chen, Shixiong Zhao, Yingping Zhang, Gaowei Wu, Bi- hui Yu, Jiahui Wu, Zhehui Zhao, Qianqi...

2025
[50]

Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qing- song Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024. 1, 4

work page arXiv 2024
[51]

R1-reward: Train- ing multimodal reward model through stable reinforcement learning, 2025

Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, and Liang Wang. R1-reward: Train- ing multimodal reward model through stable reinforcement learning, 2025. 8

2025
[52]

Basereward: A strong base- line for multimodal reward model, 2025

Yi-Fan Zhang, Haihua Yang, Huanyu Zhang, Yang Shi, Zezhou Chen, Haochen Tian, Chaoyou Fu, Haotian Wang, Kai Wu, Bo Cui, Xu Wang, Jianfei Pan, Haotian Wang, Zhang Zhang, and Liang Wang. Basereward: A strong base- line for multimodal reward model, 2025. 8

2025
[53]

When modalities conflict: How unimodal reasoning uncertainty governs preference dy- namics in mllms.arXiv preprint arXiv:2511.02243, 2025

Zhuoran Zhang, Tengyue Wang, Xilin Gong, Yang Shi, Hao- tian Wang, Di Wang, and Lijie Hu. When modalities conflict: How unimodal reasoning uncertainty governs preference dy- namics in mllms.arXiv preprint arXiv:2511.02243, 2025. 1, 8

work page arXiv 2025
[54]

The lessons of developing process reward models in mathematical reasoning, 2025

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning, 2025. 1

2025
[55]

A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

Congming Zheng, Jiachen Zhu, Zhuoying Ou, Yuxiang Chen, Kangning Zhang, Rong Shan, Zeyu Zheng, Mengyue Yang, Jianghao Lin, Yong Yu, et al. A survey of process re- ward models: From outcome signals to process supervisions for large language models.arXiv preprint arXiv:2510.08049,

work page internal anchor Pith review Pith/arXiv arXiv
[56]

thinking with images

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing "thinking with images" via reinforcement learning, 2025. 3, 5, 6

2025
[57]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 8

work page internal anchor Pith review arXiv 2025