pith. sign in

arxiv: 2606.01599 · v1 · pith:KKZQZ4KMnew · submitted 2026-06-01 · 💻 cs.AI

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

Pith reviewed 2026-06-28 14:53 UTC · model grok-4.3

classification 💻 cs.AI
keywords visual reasoningreinforcement learningonline environmentsrule verificationmultimodal benchmarksRL post-trainingvision-language modelsgenerator-verifier
0
0 comments X

The pith

TRON generates fresh visual reasoning tasks on demand via rule-verifiable programs, enabling scalable RL post-training that improves results on ten external benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TRON as an online substrate of 520 environments that create training rollouts dynamically: each instance draws a latent visual state, renders an image, poses a question, and verifies the answer exactly through a generator-verifier program. This replaces fixed static datasets with an unbounded stream of controllable, difficulty-adjusted examples across five ability categories. The same environments support both a single model trained on all buckets and per-bucket specialist models. RL post-training with TRON raises performance on ten held-out multimodal reasoning benchmarks for Qwen3-VL-4B, Qwen2.5-VL-7B, and MiMo-VL-7B-SFT.

Core claim

TRON supplies a substrate of controllable generator-verifier programs across 520 environments in five buckets that produce an unbounded stream of fresh visual reasoning instances with exact verification. This allows RL post-training to draw curriculum-matched samples without data collection limits and yields consistent gains on ten external multimodal benchmarks for three vision-language models.

What carries the argument

The generator-verifier program that samples a latent visual state, renders an image, asks a question, and exactly verifies the answer.

If this is right

  • A single model can train across all five ability buckets using the same substrate.
  • Per-bucket specialist models can be trained without collecting new data.
  • Performance gains appear consistently across three different vision-language models on ten held-out benchmarks.
  • The substrate permits analysis of generation reliability, instance diversity, and base-model pass rates by difficulty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The online generation approach could extend to other RL domains that need verifiable signals, such as textual or code-based reasoning.
  • Curriculum control over difficulty levels might allow training runs to adapt dynamically to a model's current weaknesses.
  • Checks for near-duplicates across environments indicate the substrate could scale to larger numbers of environments while preserving diversity.

Load-bearing premise

The generated instances supply training signals whose distribution and verification rules do not introduce systematic biases that block generalization to external benchmarks.

What would settle it

If RL post-training with TRON produces no improvement or a drop in scores on the ten external multimodal benchmarks relative to static-dataset baselines, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.01599 by Jingyuan Huang, Jin Sun, Ninghao Liu, Ruitong Sun, Tianze Yang, Yucheng Shi.

Figure 1
Figure 1. Figure 1: TRON: diverse, ability-targeted, auditable environments for visual reasoning RL. TRON organizes 520 rule-verifiable generators into ability buckets covering spatial, mathematical, diagram, pattern, and counting skills. Each environment produces fresh difficulty-controlled image–question rollouts with a deterministic verifier; a substrate analysis (Section 5.1) checks generation quality, instance and level … view at source ↗
Figure 2
Figure 2. Figure 2: Model-free audit of the 520 training envi [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Spatial Reasoning examples. Rows show maze navigation and cube-net opposite-face reasoning; [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mathematical Reasoning examples. Rows show exterior-angle geometry and probability [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual Diagram Understanding examples. Rows show scientific graph interpretation and [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual Pattern & Logical Reasoning examples. Rows show matrix pattern completion and color [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Counting & Quantitative Estimation examples. Rows show occluded-object counting and [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training dynamics for ability specialists. The left panel shows validation-accuracy gain over [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
read the original abstract

Reinforcement learning (RL) for visual reasoning needs scalable, verifiable, and controllable training signals. Existing visual RL post-training trains on static curated datasets, with fixed image-question-answer samples bounded by their collection budget. In this work, we introduce TRON (Targeted, Rule-verifiable Online eNvironments), an online environment substrate: a training rollout is generated on demand by a controllable generator-verifier program that samples a fresh latent visual state, renders an image, asks a question, and exactly verifies the answer. A single run can therefore draw an unbounded stream of fresh instances at the difficulty level required by the current curriculum. The current TRON suite contains 520 environments organized into five ability buckets (spatial, mathematical, diagram, pattern/logic, and counting); the same substrate supports both a single full model trained on all buckets and per-bucket ability-specialist models, with no additional data collection. We also introduce a substrate analysis covering generation reliability, instance and level diversity, cross-environment near-duplicates, and base-model pass rate by difficulty level. RL post-training with METHOD consistently improves performance on ten external multimodal reasoning benchmarks across Qwen3-VL-4B, Qwen2.5-VL-7B, and MiMo-VL-7B-SFT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces TRON, a suite of 520 controllable generator-verifier programs organized into five ability buckets (spatial, mathematical, diagram, pattern/logic, counting) that produce on-demand visual reasoning instances for RL post-training. It claims that RL using these environments yields consistent performance gains on ten external multimodal reasoning benchmarks across three VL models (Qwen3-VL-4B, Qwen2.5-VL-7B, MiMo-VL-7B-SFT). The manuscript also presents a substrate analysis of generation reliability, instance diversity, near-duplicates, and base-model pass rates by difficulty.

Significance. If the reported benchmark gains prove robust and transferable, TRON would provide a scalable alternative to static curated datasets by enabling unbounded, difficulty-controllable, rule-verifiable training signals. The substrate analysis of internal properties (reliability, diversity, pass rates) is a positive step toward reproducibility and curriculum design.

major comments (2)
  1. [Abstract] Abstract and experimental results: the headline claim of consistent improvements on ten external benchmarks is presented without any reported baselines, ablation studies, statistical tests, variance across runs, or effect sizes, preventing assessment of whether the gains are attributable to TRON or to other training choices.
  2. [Substrate analysis] Substrate analysis section: while internal properties (generation reliability, instance diversity, near-duplicates, base-model pass rates) are examined, no direct distributional comparison (feature-space distances, reasoning-type histograms, image statistics, or question-phrasing overlap) is reported between TRON rollouts and the ten held-out benchmark distributions; this leaves the generalization assumption untested and load-bearing for the transfer claim.
minor comments (1)
  1. [Abstract] Abstract contains the placeholder 'METHOD' in place of TRON when describing the RL post-training procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, clarifying our experimental reporting and analysis approach while committing to revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental results: the headline claim of consistent improvements on ten external benchmarks is presented without any reported baselines, ablation studies, statistical tests, variance across runs, or effect sizes, preventing assessment of whether the gains are attributable to TRON or to other training choices.

    Authors: We agree that the abstract and experimental results would benefit from more explicit reporting to allow readers to assess the source of the gains. In the revised manuscript we will add (i) comparisons against standard baselines including SFT-only and alternative RL post-training methods, (ii) ablation studies that isolate the contribution of individual ability buckets, and (iii) statistical details consisting of means, standard deviations across three random seeds, and Cohen’s d effect sizes for the reported benchmark improvements. These additions will be placed in both the abstract and the main results section. revision: yes

  2. Referee: [Substrate analysis] Substrate analysis section: while internal properties (generation reliability, instance diversity, near-duplicates, base-model pass rates) are examined, no direct distributional comparison (feature-space distances, reasoning-type histograms, image statistics, or question-phrasing overlap) is reported between TRON rollouts and the ten held-out benchmark distributions; this leaves the generalization assumption untested and load-bearing for the transfer claim.

    Authors: The substrate analysis is deliberately scoped to internal properties that support reproducibility and curriculum construction. We acknowledge that explicit distributional comparisons would provide additional evidence for the transfer assumption. In revision we will add reasoning-type histograms and basic image statistics (resolution, color distribution, object density) comparing TRON rollouts to the ten external benchmarks. Full feature-space distances and question-phrasing overlap metrics are computationally intensive and will be included only if space permits; otherwise they will be noted as future work. The primary evidence for transfer remains the consistent gains on the held-out benchmarks themselves. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical gains on external benchmarks are not reduced to training inputs by construction.

full rationale

The paper's central claim is an empirical observation: RL post-training on the 520 TRON environments yields measured improvements on ten separately held-out external multimodal benchmarks. No equations, fitted parameters, or self-citations are invoked that would make the reported benchmark deltas equivalent to quantities defined inside the TRON generators or training loop. The substrate analysis addresses internal properties of the generated instances but does not redefine the external evaluation. This is a standard non-circular experimental setup; the generalization question is one of evidence strength rather than definitional reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that rule-based generator-verifier programs can produce training signals that transfer to external benchmarks, plus the implicit assumption that RL post-training benefits from unlimited fresh verifiable instances.

axioms (1)
  • domain assumption Reinforcement learning with verifiable rewards improves visual reasoning capabilities in multimodal models
    The paper invokes this to explain why post-training on TRON yields benchmark gains.
invented entities (1)
  • TRON environments no independent evidence
    purpose: To generate on-demand visual reasoning instances with exact answer verification
    New substrate introduced by the paper; no independent evidence outside the described system is provided.

pith-pipeline@v0.9.1-grok · 5773 in / 1331 out tokens · 31906 ms · 2026-06-28T14:53:11.790335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

    cs.AI 2026-06 unverdicted novelty 5.0

    VeriEvol decouples prompt difficulty evolution from answer reliability verification to scale verified data for visual math reasoning, lifting benchmark accuracy from 35.42 to 54.73 and adding +3.88 in GRPO RL.

Reference graph

Works this paper leans on

50 extracted references · 19 linked inside Pith · cited by 1 Pith paper

  1. [1]

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin, et al. Qwen2.5-VL technical report...

  3. [3]

    David G. T. Barrett, Felix Hill, Adam Santoro, Ari S. Morcos, and Timothy Lillicrap. Measuring abstract reasoning in neural networks. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 511–520, 2018

  4. [4]

    Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles.Advances in Neural Information Processing Systems, 38:3613–3661, 2026

    Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Jiaze Chen, Xuefeng Li, Qiying Yu, et al. Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles.Advances in Neural Information Processing Systems, 38:3613–3661, 2026

  5. [5]

    R1-V: Reinforcing super generalization ability in vision-language models with less than $3

    Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-V: Reinforcing super generalization ability in vision-language models with less than $3. https://github.com/Deep-Agent/R1-V, 2025

  6. [6]

    PuzzleVQA: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns.arXiv preprint arXiv:2403.13315, 2024

    Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, and Soujanya Poria. PuzzleVQA: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns.arXiv preprint arXiv:2403.13315, 2024

  7. [7]

    On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

    François Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

  8. [8]

    Leveraging procedural generation to benchmark reinforcement learning

    Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119, pages 2048–2056. PMLR, 2020

  9. [9]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM international conference on multimedia, pages 11198–11201, 2024

  10. [10]

    G-LLaV A: Solving geometric problem with multi-modal large language model.arXiv preprint arXiv:2312.11370, 2023

    Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, and Lingpeng Kong. G-LLaV A: Solving geometric problem with multi-modal large language model.arXiv preprint arXiv:2312.11370, 2023

  11. [11]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  12. [12]

    ChartLlama: A multimodal LLM for chart understanding and generation.arXiv preprint arXiv:2311.16483, 2023

    Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. ChartLlama: A multimodal LLM for chart understanding and generation.arXiv preprint arXiv:2311.16483, 2023

  13. [13]

    Measuring coding challenge competence with APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021. 11

  14. [14]

    Vision-R1: Incentivizing reasoning capability in multimodal large language models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-R1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025

  15. [15]

    Lawrence Zitnick, and Ross Girshick

    Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2901–2910, 2017

  16. [16]

    Tülu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brah- man, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tülu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

  17. [17]

    Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025

    Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, et al. Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025

  18. [18]

    Jiang, Ziju Shen, et al

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q. Jiang, Ziju Shen, et al. NuminaMath: The largest public dataset in AI4Maths with 860k pairs of competition math problems and solutions. Technical report, Hugging Face / Project Numina, 2024

  19. [19]

    Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

  20. [20]

    SynLogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond

    Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, and Junxian He. SynLogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. a...

  21. [21]

    Visual-RFT: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-RFT: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

  22. [22]

    MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations (ICLR), 2024

  23. [23]

    ChartGemma: Visual instruction-tuning for chart reasoning in the wild.arXiv preprint arXiv:2407.04172, 2024

    Ahmed Masry, Megh Thakkar, Aayush Bajaj, Aaryaman Kartha, Enamul Hoque, and Shafiq Joty. ChartGemma: Visual instruction-tuning for chart reasoning in the wild.arXiv preprint arXiv:2407.04172, 2024

  24. [24]

    ChartQAPro: A more diverse and challenging benchmark for chart question answering

    Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmoham- madi, et al. ChartQAPro: A more diverse and challenging benchmark for chart question answering. arXiv preprint arXiv:2504.05506, 2025

  25. [25]

    MM-Eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. MM-Eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

  26. [26]

    Gym-v: A unified vision environment system for agentic vision research.arXiv preprint arXiv:2603.15432, 2026

    Fanqing Meng, Lingxiao Du, Jiawei Gu, Jiaqi Liao, Linjie Li, Zijian Wu, Xiangyan Liu, Ziqi Zhao, Mengkang Hu, Zichen Liu, et al. Gym-v: A unified vision environment system for agentic vision research.arXiv preprint arXiv:2603.15432, 2026. 12

  27. [27]

    Patel, Yuke Zhu, and Anima Anandkumar

    Weili Nie, Zhiding Yu, Lei Mao, Ankit B. Patel, Yuke Zhu, and Anima Anandkumar. Bongard- LOGO: A new benchmark for human-level concept learning and reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  28. [28]

    LMM-R1: Empowering 3b LMMs with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

    Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. LMM-R1: Empowering 3b LMMs with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

  29. [29]

    We-Math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

    Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang. We-Math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

  30. [30]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  31. [31]

    VLM-R1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. VLM-R1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

  32. [32]

    Math-LLaV A: Bootstrapping mathematical reasoning for multimodal large language models.arXiv preprint arXiv:2406.17294, 2024

    Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee. Math-LLaV A: Bootstrapping mathematical reasoning for multimodal large language models.arXiv preprint arXiv:2406.17294, 2024

  33. [33]

    Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards

    Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kad- dour, and Andreas Köpf. Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2505.24760

  34. [34]

    Reason-RFT: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-RFT: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

  35. [35]

    VL-Rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. VL-Rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025

  36. [36]

    Is a picture worth a thousand words? delving into spatial reasoning for vision language models

    Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  37. [37]

    CharXiv: Charting gaps in realistic chart understanding in multimodal LLMs.arXiv preprint arXiv:2406.18521, 2024

    Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. CharXiv: Charting gaps in realistic chart understanding in multimodal LLMs.arXiv preprint arXiv:2406.18521, 2024

  38. [38]

    LogicVista: Multimodal LLM logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

    Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. LogicVista: Multimodal LLM logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

  39. [39]

    Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar

    Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar. LeanDojo: Theorem proving with retrieval-augmented language models. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023. 13

  40. [40]

    R1-Onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-Onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

  41. [41]

    DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  42. [42]

    MME-Reasoning: A comprehensive benchmark for logical reasoning in MLLMs.arXiv preprint arXiv:2505.21327, 2025

    Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, et al. MME-Reasoning: A comprehensive benchmark for logical reasoning in MLLMs.arXiv preprint arXiv:2505.21327, 2025

  43. [43]

    Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

    Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, et al. Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

  44. [44]

    Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments.arXiv preprint arXiv:2511.07317, 2025

    Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, et al. Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments.arXiv preprint arXiv:2511.07317, 2025

  45. [45]

    RA VEN: A dataset for relational and analogical visual rEasoNing

    Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. RA VEN: A dataset for relational and analogical visual rEasoNing. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5317–5327, 2019

  46. [46]

    MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems?arXiv preprint arXiv:2403.14624, 2024

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems?arXiv preprint arXiv:2403.14624, 2024

  47. [47]

    Mavis: Mathematical visual instruction tuning with an automatic data engine

    Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Ziyu Guo, Yichi Zhang, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Shanghang Zhang, Gao Peng, et al. Mavis: Mathematical visual instruction tuning with an automatic data engine. InInternational Conference on Learning Representations, volume 2025, pages 87955–87989, 2025

  48. [48]

    MM- HELIX: Boosting multimodal long-chain reflective reasoning with holistic platform and adaptive hybrid policy optimization.arXiv preprint arXiv:2510.08540, 2025

    Xiangyu Zhao, Junming Lin, Tianhao Liang, Yifan Zhou, Wenhao Chai, Yuzhe Gu, Weiyun Wang, Kai Chen, Gen Luo, Wenwei Zhang, Junchi Yan, Hua Yang, Haodong Duan, and Xue Yang. MM- HELIX: Boosting multimodal long-chain reflective reasoning with holistic platform and adaptive hybrid policy optimization.arXiv preprint arXiv:2510.08540, 2025

  49. [49]

    miniF2F: A cross-system benchmark for formal olympiad-level mathematics

    Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. miniF2F: A cross-system benchmark for formal olympiad-level mathematics. InInternational Conference on Learning Representations (ICLR), 2022

  50. [50]

    "" 2Clock angle QA -- analog clock showing a time. 3Questions: angle between hands, time shown, angle after N minutes, overlap count. 4

    Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. DynaMath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836, 2024. 14 A. Fine-Grained Environment Coverage Table 6 expands the high-level suite composition in Table 1. The entries are representative r...