pith. sign in

arxiv: 2606.23543 · v1 · pith:EQUFQFKWnew · submitted 2026-06-22 · 💻 cs.AI · cs.CL· cs.CV· cs.LG

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

Pith reviewed 2026-06-26 08:32 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.LG
keywords multimodal mathematical reasoningverifiable data scalingevolution-instructvisual reasoningreinforcement learningdata verificationGRPO
0
0 comments X

The pith

Decoupling prompt evolution from answer verification enables reliable scaling of training data for visual mathematical reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that scaling reinforcement learning for visual math requires keeping reward labels reliable even as data volume grows, rather than assuming the labeller is trustworthy. It separates two scaling axes before any policy training: prompt difficulty, which is increased through route-specific evolution operators applied to image-question seeds, and answer reliability, which is enforced by an offline verifier that only accepts an answer once multiple sources of counter-evidence fail to refute it. The resulting verified data can be used directly in existing RL recipes and extends by adding new evolution routes or verifier channels. Experiments show that growing the evolved SFT data from 10K to 250K samples lifts mean accuracy from 35.42 to 54.73 on a five-benchmark suite, and the full VeriEvol pipeline adds +3.88 over an un-evolved RL baseline when backbone and recipe are held fixed.

Core claim

VeriEvol is an iterative framework that first applies a type-aware evolution module to rewrite low-difficulty image-question seeds into harder, image-grounded prompts and then passes candidate answers through an HTV-Agent verifier that accepts them only after multi-source counter-evidence has failed to refute them. Scaling the verified evolved SFT data from 10K to 250K samples raises mean accuracy from 35.42 to 54.73; with backbone, SFT initialization, and GRPO recipe fixed, the pipeline contributes a cumulative +3.88 over an un-evolved RL baseline, of which +1.82 is attributable to the evolved prompts and +2.06 to the verifier.

What carries the argument

The HTV-Agent verifier that accepts an answer only after multi-source counter-evidence has failed to refute it, together with the type-aware evolution module that rewrites low-difficulty seeds into harder image-grounded prompts.

If this is right

  • Scaling evolved SFT data from 10K to 250K samples raises mean accuracy from 35.42 to 54.73 on the five-benchmark visual-math suite.
  • With backbone and GRPO recipe fixed, VeriEvol adds +3.88 over an un-evolved RL baseline.
  • Of the +3.88 gain, +1.82 is attributable to the evolved prompts and +2.06 to the HTV-Agent verifier.
  • The verified data extends by adding new evolution routes or additional verifier channels.
  • The full verifier trace released for every sample allows downstream auditing and further scaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of evolution and verification could be applied to scale reliable data in other multimodal reasoning domains such as science or coding.
  • Releasing complete verifier traces may enable independent development of stronger or cheaper verifiers by the community.
  • If the verifier remains reliable at even larger scales, the approach could support training runs with millions of verified visual-math examples.

Load-bearing premise

The HTV-Agent verifier can keep answer labels reliable at large scale without introducing systematic false accepts or false rejects that would corrupt the training signal.

What would settle it

A measurement showing that the verifier's false-accept or false-reject rate rises sharply once the dataset exceeds 100K samples, or an ablation in which replacing the verifier with a weaker labeler eliminates the reported +2.06 gain.

read the original abstract

Scaling reinforcement learning for visual mathematical reasoning requires more than generating harder questions: as data volume grows, the reward labels themselves must remain reliable. Yet existing data pipelines scale supervision while trusting the labeller, and policy-side methods assume the underlying answers are already correct. We instead treat scaling as a verifiable data-construction problem and decouple two axes before any policy update: prompt difficulty, expanded by route-specific evolution operators, and answer reliability, enforced by offline hypothesis-test falsification. We instantiate this as VeriEvol, an iterative framework with two extensible components: a type-aware evolution module that rewrites low-difficulty image-question seeds into harder, image-grounded prompts; and HTV-Agent, a verifier that accepts an answer only after multi-source counter-evidence has failed to refute it. The resulting verified data scales in volume, extends by adding evolution routes or verifier channels, and plugs directly into existing GRPO-style RL recipes. On a five-benchmark visual-math suite, scaling evolved SFT data from 10K to 250K samples raises the mean accuracy from 35.42 to 54.73; then, with backbone, SFT initialization, and GRPO recipe held fixed, VeriEvol adds a cumulative +3.88 over an un-evolved RL baseline, of which +1.82 comes from evolved prompts and +2.06 from the HTV-Agent verifier. We release the prompts, data, models, code, and the full verifier trace of every sample, so that downstream work can scale and audit the pipeline rather than only inspect its outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VeriEvol, an iterative framework that decouples prompt difficulty scaling via type-aware evolution operators from answer reliability via the HTV-Agent verifier (offline multi-source hypothesis-test falsification). It reports that scaling verified SFT data from 10K to 250K samples lifts mean accuracy on a five-benchmark visual-math suite from 35.42 to 54.73; with backbone, SFT init, and GRPO recipe fixed, VeriEvol yields a cumulative +3.88 over an un-evolved RL baseline, decomposed as +1.82 from evolved prompts and +2.06 from the verifier. The work releases prompts, data, models, code, and full verifier traces.

Significance. If the verifier's error rate remains controlled at 250K scale, the decoupling of evolution routes from verifiable labeling supplies a practical route to higher-quality RL data for multimodal math without assuming trusted labellers. The explicit release of verifier traces for every sample is a concrete strength that enables downstream auditing and extension.

major comments (2)
  1. [abstract and §3.2 (HTV-Agent description)] The attribution of +2.06 to HTV-Agent (abstract) is load-bearing for the central claim yet rests on the untested assumption that offline hypothesis-test falsification maintains stable precision/recall as prompt diversity grows; no held-out quantitative evaluation of false-accept or false-reject rates, nor any scaling analysis of verifier error with data volume, is supplied.
  2. [abstract and experimental results section] The reported decomposition (+1.82 prompts, +2.06 verifier) requires an ablation that isolates each component while holding the other fixed; the abstract states the numbers but supplies neither the corresponding table rows nor statistical significance tests for the deltas.
minor comments (2)
  1. [abstract] The five-benchmark suite and exact metric (mean accuracy) should be named explicitly in the abstract rather than referenced only as 'five-benchmark visual-math suite'.
  2. [§3] Notation for evolution routes and verifier channels is introduced without a compact summary table; a single table listing each route, its operator, and the verifier channels would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on VeriEvol. The two major comments highlight important aspects of evidence strength for the verifier contribution and the reported decomposition. We address each point below and commit to revisions that strengthen the manuscript without altering the core claims.

read point-by-point responses
  1. Referee: [abstract and §3.2 (HTV-Agent description)] The attribution of +2.06 to HTV-Agent (abstract) is load-bearing for the central claim yet rests on the untested assumption that offline hypothesis-test falsification maintains stable precision/recall as prompt diversity grows; no held-out quantitative evaluation of false-accept or false-reject rates, nor any scaling analysis of verifier error with data volume, is supplied.

    Authors: We agree that direct held-out metrics on verifier error rates would provide stronger grounding for attributing the +2.06 gain specifically to HTV-Agent rather than to downstream effects. The reported gain is measured via the controlled RL performance delta (evolved+verified vs. evolved+unverified data) with all other factors fixed, and the release of full verifier traces enables external auditing. However, we acknowledge the absence of explicit precision/recall scaling curves. In revision we will add a held-out evaluation set, report false-accept and false-reject rates, and include a scaling plot of verifier error versus data volume. revision: yes

  2. Referee: [abstract and experimental results section] The reported decomposition (+1.82 prompts, +2.06 verifier) requires an ablation that isolates each component while holding the other fixed; the abstract states the numbers but supplies neither the corresponding table rows nor statistical significance tests for the deltas.

    Authors: The decomposition is obtained from two controlled ablations described in the experimental results section: one holding the verifier fixed while varying prompt evolution, and one holding evolved prompts fixed while varying verification. We agree that presenting these as explicit table rows with significance tests would improve clarity and allow readers to assess the deltas directly. In the revision we will add a dedicated ablation table containing the isolated contributions together with bootstrap confidence intervals or paired significance tests for each reported delta. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical scaling results are externally benchmarked

full rationale

The paper reports measured accuracy lifts on five external visual-math benchmarks when scaling evolved SFT data from 10K to 250K and when adding the HTV-Agent verifier under fixed GRPO. No equations, uniqueness theorems, or self-citations are invoked to derive the gains; the +1.82 / +2.06 decomposition is obtained by ablation with held-fixed backbone and recipe. The verifier reliability is an empirical assumption whose validity is left to the released traces rather than enforced by definition. This is a standard empirical pipeline paper with no load-bearing self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified or required for the empirical claims.

pith-pipeline@v0.9.1-grok · 5840 in / 1167 out tokens · 29585 ms · 2026-06-26T08:32:33.051233+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 19 linked inside Pith

  1. [1]

    WizardLM: Empowering large language models to follow complex instructions

    CanXu,QingfengSun,KaiZheng,XiuboGeng,PuZhao,JiazhanFeng,ChongyangTao,andDaxinJiang. WizardLM: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023

  2. [2]

    DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning

    DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  3. [3]

    Kimi k1.5: Scaling reinforcement learning with LLMs

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, and others. Kimi k1.5: Scaling reinforcement learning with LLMs. arXiv preprint arXiv:2501.12599, 2025

  4. [4]

    MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyang Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. In Proceedings of ICLR, 2024

  5. [5]

    MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems? In Proceedings of ECCV, 2024

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems? In Proceedings of ECCV, 2024

  6. [6]

    MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, and others. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. In Proceedings of CVPR, 2024

  7. [7]

    Measuring multimodal mathematical reasoning with MATH-Vision dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with MATH-Vision dataset. In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2024

  8. [8]

    OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Proceedings of ACL, 2024. 14

  9. [9]

    DynaMath: A dynamic visual benchmarkforevaluatingmathematicalreasoningrobustnessofvisionlanguagemodels

    Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. DynaMath: A dynamic visual benchmarkforevaluatingmathematicalreasoningrobustnessofvisionlanguagemodels. InInternationalConference on Learning Representations (ICLR), 2025

  10. [10]

    We-Math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024

    Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang. We-Math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024

  11. [11]

    MV-MATH: Evaluating multimodal math reasoning in multi-visual contexts

    Peijie Wang, Zhong-Zhi Li, Fei Yin, Xin Yang, Dekang Ran, and Cheng-Lin Liu. MV-MATH: Evaluating multimodal math reasoning in multi-visual contexts. In Proceedings of CVPR, 2025

  12. [12]

    M3Kang: Evaluating multilingual multimodal mathematical reasoning in vision-language models

    Aleix Torres-Camps, Nathaniel Mitrani Hadida, Víctor Conchello Vendrell, Àlex Batlle Casellas, Arnau Padrés Masdemont, and Jordi Ros-Giralt. M3Kang: Evaluating multilingual multimodal mathematical reasoning in vision-language models. arXiv preprint arXiv:2601.16218, 2026

  13. [13]

    Qwen2.5-VL technical report

    Qwen Team. Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923, 2025

  14. [14]

    InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025

  15. [15]

    Honey-Data-15M: A large-scale open multimodal instruction-tuning dataset

    Open-Bee Team. Honey-Data-15M: A large-scale open multimodal instruction-tuning dataset. https:// huggingface.co/datasets/Open-Bee/Honey-Data-15M, 2025

  16. [16]

    MMFineReason: Closing the multimodal reasoning gap via open data-centric methods

    Honglin Lin, Zheng Liu, Yun Zhu, Chonghan Qin, Juekai Lin, Xiaoran Shang, Conghui He, Wentao Zhang, and Lijun Wu. MMFineReason: Closing the multimodal reasoning gap via open data-centric methods. arXiv preprint arXiv:2601.21821, 2026

  17. [17]

    MAmmoTH-VL: Eliciting multimodal reasoning with instruction tuning at scale

    Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xiang Yue. MAmmoTH-VL: Eliciting multimodal reasoning with instruction tuning at scale. In Proceedings of ACL, 2025

  18. [18]

    VisualWebInstruct: Scaling up multimodal instruction data through web search

    Yiming Jia, Jiachen Li, Xiang Yue, Bo Li, Ping Nie, Kai Zou, and Wenhu Chen. VisualWebInstruct: Scaling up multimodal instruction data through web search. arXiv preprint arXiv:2503.10582, 2025

  19. [19]

    MathCoder-VL: Bridging vision and code for enhanced multimodal mathematical reasoning

    Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, and Hongsheng Li. MathCoder-VL: Bridging vision and code for enhanced multimodal mathematical reasoning. In Findings of ACL, 2025

  20. [20]

    MMEvol: Empowering multimodal large language models with Evol-Instruct

    RunLuo,HaonanZhang,LongzeChen,Ting-EnLin,XiongLiu,YuchuanWu,MinYang,MinzhengWang,Pengpeng Zeng, Lianli Gao, and others. MMEvol: Empowering multimodal large language models with Evol-Instruct. arXiv preprint arXiv:2409.05840, 2024

  21. [21]

    Renjie Pi, Felix Bai, Qibin Chen, Simon Wang, Jiulong Shan, Kieran Liu, and Meng Cao. MR. Judge: Multimodal reasoner as a judge. arXiv preprint arXiv:2505.13403, 2025

  22. [22]

    Judge Anything: MLLM as a judge across any modality

    Shu Pu, Yaochen Wang, Dongping Chen, Yuhang Chen, Guohao Wang, Qi Qin, Zhongyi Zhang, Zhiyuan Zhang, Zetong Zhou, Shuang Gong, and others. Judge Anything: MLLM as a judge across any modality. arXiv preprint arXiv:2503.17489, 2025

  23. [23]

    Visual-RFT: Visual reinforcement fine-tuning

    ZiyuLiu,ZeyiSun,YuhangZang,XiaoyiDong,YuhangCao,HaodongDuan,DahuaLin,andJiaqiWang. Visual-RFT: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785, 2025

  24. [24]

    Vision-R1: Incentivizing reasoning capability in multimodal large language models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-R1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025

  25. [25]

    R1-Onevision: Advancing generalized multimodal reasoning through cross-modal formalization

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-Onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615, 2025

  26. [26]

    Infi-MMR: Curriculum-based unlocking multimodal reasoning via phased reinforcement learning in multimodal small language models

    ZeyuLiu,YuhangLiu,GuanghaoZhu,CongkaiXie,ZhenLi,JianboYuan,XinyaoWang,QingLi,Shing-ChiCheung, Shengyu Zhang, Fei Wu, and Hongxia Yang. Infi-MMR: Curriculum-based unlocking multimodal reasoning via phased reinforcement learning in multimodal small language models. arXiv preprint arXiv:2505.23091, 2025. 15

  27. [27]

    MM-Eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, and others. MM-Eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2503.07365, 2025

  28. [28]

    VL-Rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. VL-Rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025

  29. [29]

    Skywork R1V2: Multimodal hybrid reinforcement learning for reasoning

    Peiyu Wang, Yichen Wei, Yi Peng, Xiaokun Wang, Weijie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, Xuchen Song, Yang Liu, and Yahui Zhou. Skywork R1V2: Multimodal hybrid reinforcement learning for reasoning. arXiv preprint arXiv:2504.16656, 2025

  30. [30]

    Open Vision Reasoner: Transferring linguistic cognitive behavior for visual reasoning

    Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, and others. Open Vision Reasoner: Transferring linguistic cognitive behavior for visual reasoning. arXiv preprint arXiv:2507.05255, 2025

  31. [31]

    Dual-uncertainty guided policy learning for multimodal reasoning

    Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, and Dong Yu. Dual-uncertainty guided policy learning for multimodal reasoning. arXiv preprint arXiv:2510.01444, 2025

  32. [32]

    More than the final answer: Improving visual extraction and logical consistency in vision-language models

    Hoang Anh Just, Yifei Fan, Handong Zhao, Jiuxiang Gu, Ruiyi Zhang, Simon Jenni, Kushal Kafle, Ruoxi Jia, and Jing Shi. More than the final answer: Improving visual extraction and logical consistency in vision-language models. arXiv preprint arXiv:2512.12487, 2025

  33. [33]

    V-Zero: Self-improving multimodal reasoning with zero annotation

    Han Wang, Yi Yang, Jingyuan Hu, Minfeng Zhu, and Wei Chen. V-Zero: Self-improving multimodal reasoning with zero annotation. arXiv preprint arXiv:2601.10094, 2026

  34. [34]

    iReasoner: Trajectory-aware intrinsic reasoning supervision for self-evolving large multimodal models

    Meghana Sunil, Manikandarajan Venmathimaran, and Muthu Subash Kavitha. iReasoner: Trajectory-aware intrinsic reasoning supervision for self-evolving large multimodal models. In Findings of the Association for Computational Linguistics (ACL), 2026. arXiv:2601.05877

  35. [35]

    Fromnarrowtopanoramicvision: Attention-guided cold-start reshapes multimodal reasoning

    RuilinLuo,ChufanShi,YizhenZhang,ChengYang,SongtaoJiang,TongkunGuan,RuizheChen,RuihangChu,Peng Wang,MingkunYang,YujiuYang,JunyangLin,andZhiboYang. Fromnarrowtopanoramicvision: Attention-guided cold-start reshapes multimodal reasoning. In International Conference on Learning Representations (ICLR), 2026. arXiv:2603.03825

  36. [36]

    PaLMR: Towards faithful visual reasoning via multimodal process alignment

    Yantao Li, Qiang Hui, Chenyang Yan, Kanzhi Cheng, Fang Zhao, Chao Tan, Huanling Gao, Jianbing Zhang, Kai Wang, Xinyu Dai, and Shiguo Lian. PaLMR: Towards faithful visual reasoning via multimodal process alignment. In CVPR Findings, 2026. arXiv:2603.06652

  37. [37]

    Visually-guided policy optimization for multimodal reasoning

    Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, and Xiangxiang Chu. Visually-guided policy optimization for multimodal reasoning. arXiv preprint arXiv:2604.09349, 2026

  38. [38]

    Attend to evidence: Evidence-anchored spatial attention supervision for multimodal RLVR

    Ruina Hu, Chen Wang, Lai Wei, Jionghao Bai, Bin Yu, Weiran Huang, Kai Wang, and Yue Wang. Attend to evidence: Evidence-anchored spatial attention supervision for multimodal RLVR. arXiv preprint arXiv:2605.30912, 2026

  39. [39]

    TRON: Targeted rule-verifiable online environments for visual reasoning RL

    Tianze Yang, Yucheng Shi, Ruitong Sun, Jingyuan Huang, Ninghao Liu, and Jin Sun. TRON: Targeted rule-verifiable online environments for visual reasoning RL. arXiv preprint arXiv:2606.01599, 2026

  40. [40]

    See less, see right: Bi-directional perceptual shaping for multimodal reasoning

    Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, and Rui Wang. See less, see right: Bi-directional perceptual shaping for multimodal reasoning. arXiv preprint arXiv:2512.22120, 2026

  41. [41]

    R1-V: Reinforcing super generalization ability in vision-language models with less than three dollars

    Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, Vinci, and Zihao Yue. R1-V: Reinforcing super generalization ability in vision-language models with less than three dollars. Technical report, 2025.https://github.com/ StarsfieldAI/R1-V

  42. [42]

    OpenVLThinker: Complex vision-language reasoning via iterative SFT-RL cycles

    Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. OpenVLThinker: Complex vision-language reasoning via iterative SFT-RL cycles. arXiv preprint arXiv:2503.17352, 2025

  43. [43]

    ThinkLite-VL: Reasoning-enhanced vision-language models with sample-efficient reinforcement fine-tuning

    Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. ThinkLite-VL: Reasoning-enhanced vision-language models with sample-efficient reinforcement fine-tuning. arXiv preprint arXiv:2504.07934, 2025. 16

  44. [44]

    VLAA-Thinker: SFT or RL? An early investigation into training R1-like reasoning large vision-language models

    HardyChen, HaoqinTu, FaliWang, HuiLiu, XianfengTang, XinyaDu, YuyinZhou, andCihangXie. VLAA-Thinker: SFT or RL? An early investigation into training R1-like reasoning large vision-language models. Transactions on Machine Learning Research, 2025

  45. [45]

    WeThink: Toward general-purpose vision-language reasoning via reinforcement learning

    Jie Yang, Feipeng Ma, Zitian Wang, Dacheng Yin, Kang Rong, Fengyun Rao, and Ruimao Zhang. WeThink: Toward general-purpose vision-language reasoning via reinforcement learning. arXiv preprint arXiv:2506.07905, 2025

  46. [46]

    We-Math 2.0: A versatile MathBook system for incentivizing visual mathematical reasoning

    Runqi Qiao, Qiuna Tan, Peiqing Yang, Yanzi Wang, Xiaowan Wang, Enhui Wan, Sitong Zhou, Guanting Dong, Yuchen Zeng, Yida Xu, Jie Wang, Chong Sun, Chen Li, and Honggang Zhang. We-Math 2.0: A versatile MathBook system for incentivizing visual mathematical reasoning. arXiv preprint arXiv:2508.10433, 2025

  47. [47]

    NoisyRollout: Reinforcing visual reasoning with data augmentation

    Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, and Michael Qizhe Shieh. NoisyRollout: Reinforcing visual reasoning with data augmentation. Advances in Neural Information Processing Systems, 2025. arXiv:2504.13055

  48. [48]

    thinking with images

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. DeepEyes: Incentivizing “thinking with images” via reinforcement learning. In International Conference on Learning Representations (ICLR), 2026. arXiv:2505.14362

  49. [49]

    MMR1: Enhancing multimodal reasoning with variance-aware sampling and open resources

    Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, Deli Zhao, Wei Lu, Yu Rong, Aixin Sun, and Shijian Lu. MMR1: Enhancing multimodal reasoning with variance-aware sampling and open resources. arXiv preprint arXiv:2509.21268, 2025

  50. [50]

    ReVisual-R1: An open-source 7B multimodal large language model for deep reasoning

    Yuhao Chen, Shubin Huang, Hongyi Yu, Long Li, Zihan Wang, Xinyi Wang, Yuwei Yan, Lifan Yuan, Zhihao Bai, Mengmeng Liu, Jiongnan Liu, Mengjie Wang, Wei Tang, Liuxin Zhang, Junlong Wu, Mingsheng Long, Hao Zhao, Jianzhuang Liu, and Yiming Yang. ReVisual-R1: An open-source 7B multimodal large language model for deep reasoning. arXiv preprint arXiv:2506.04207, 2025

  51. [51]

    Perception-aware policy optimization for multimodal reasoning

    Zhenghai Wang, Wenxuan Zhang, Wenhao Yu, Tianhao Wu, Heng Ji, Hongming Zhang, Dong Yu, Manling Li, and Kaixin Ma. Perception-aware policy optimization for multimodal reasoning. In International Conference on Learning Representations (ICLR), 2026. arXiv:2507.06448

  52. [52]

    OpenMMReasoner: Pushing the frontiers of multimodal reasoning with an open and reproducible recipe

    Kaichen Lin, Bo Li, Yuanhan Zhang, Yifei Sun, Yixiu Liu, Pengyun Wang, Yuhao Dong, Wenjia Liu, Xinyu Wang, Zhiqi Bu, Ziwei Liu, and Chunyuan Li. OpenMMReasoner: Pushing the frontiers of multimodal reasoning with an open and reproducible recipe. arXiv preprint arXiv:2511.16334, 2025

  53. [53]

    Self-Refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-Refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Sy...

  54. [54]

    Reflexion: Language agents with verbal reinforcement learning

    NoahShinn,FedericoCassano,EdwardBerman,AshwinGopinath,KarthikNarasimhan,andShunyuYao. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems, 2023

  55. [55]

    Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

  56. [56]

    CRITIC: Large language models can self-correct with tool-interactive critiquing

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. CRITIC: Large language models can self-correct with tool-interactive critiquing. In International Conference on Learning Representations (ICLR), 2024

  57. [57]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations (ICLR), 2024

  58. [58]

    Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

  59. [59]

    Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J

    Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper 17 Snoek, Jeffrey Pennington, J...

  60. [60]

    Briefly summarize the verification’s conclusion (1--2 sentences). 25

  61. [61]

    Assess the QUALITY of the verification: is it logically sound, or does it contain self- contradictions, arithmetic errors, or unsupported claims?

  62. [62]

    -- If verification is low-quality or self-contradictory, trust the initial answer

    Decide the final answer: -- If verification is high-quality AND explicitly rejects the initial answer, trust the verification. -- If verification is low-quality or self-contradictory, trust the initial answer. -- If they agree, keep the answer

  63. [63]

    If the Solver and Verifier disagree AND you are not confident in either, output <require_rethink> true</require_rethink>

  64. [64]

    Examples: <final_answer>A</final_answer> or <final_answer>42</final_answer>

    Output your exact final answer inside the tag <final_answer>X</final_answer>. Examples: <final_answer>A</final_answer> or <final_answer>42</final_answer>

  65. [65]

    User prompt template

    Output your confidence as <confidence>0--100</confidence>. User prompt template. Question: {question} Context: {context} (omitted if empty) Options: {choices} (omitted if not multiple choice) Solver’s answer: {hypothesis} Verification report: {verification_text} Post-processing.Thedecider’sresponseisparsedbyregularexpressionsfor <final_answer>, <confidenc...