TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

Jingyuan Huang; Jin Sun; Ninghao Liu; Ruitong Sun; Tianze Yang; Yucheng Shi

arxiv: 2606.01599 · v1 · pith:KKZQZ4KMnew · submitted 2026-06-01 · 💻 cs.AI

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

Tianze Yang , Yucheng Shi , Ruitong Sun , Jingyuan Huang , Ninghao Liu , Jin Sun This is my paper

Pith reviewed 2026-06-28 14:53 UTC · model grok-4.3

classification 💻 cs.AI

keywords visual reasoningreinforcement learningonline environmentsrule verificationmultimodal benchmarksRL post-trainingvision-language modelsgenerator-verifier

0 comments

The pith

TRON generates fresh visual reasoning tasks on demand via rule-verifiable programs, enabling scalable RL post-training that improves results on ten external benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TRON as an online substrate of 520 environments that create training rollouts dynamically: each instance draws a latent visual state, renders an image, poses a question, and verifies the answer exactly through a generator-verifier program. This replaces fixed static datasets with an unbounded stream of controllable, difficulty-adjusted examples across five ability categories. The same environments support both a single model trained on all buckets and per-bucket specialist models. RL post-training with TRON raises performance on ten held-out multimodal reasoning benchmarks for Qwen3-VL-4B, Qwen2.5-VL-7B, and MiMo-VL-7B-SFT.

Core claim

TRON supplies a substrate of controllable generator-verifier programs across 520 environments in five buckets that produce an unbounded stream of fresh visual reasoning instances with exact verification. This allows RL post-training to draw curriculum-matched samples without data collection limits and yields consistent gains on ten external multimodal benchmarks for three vision-language models.

What carries the argument

The generator-verifier program that samples a latent visual state, renders an image, asks a question, and exactly verifies the answer.

If this is right

A single model can train across all five ability buckets using the same substrate.
Per-bucket specialist models can be trained without collecting new data.
Performance gains appear consistently across three different vision-language models on ten held-out benchmarks.
The substrate permits analysis of generation reliability, instance diversity, and base-model pass rates by difficulty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The online generation approach could extend to other RL domains that need verifiable signals, such as textual or code-based reasoning.
Curriculum control over difficulty levels might allow training runs to adapt dynamically to a model's current weaknesses.
Checks for near-duplicates across environments indicate the substrate could scale to larger numbers of environments while preserving diversity.

Load-bearing premise

The generated instances supply training signals whose distribution and verification rules do not introduce systematic biases that block generalization to external benchmarks.

What would settle it

If RL post-training with TRON produces no improvement or a drop in scores on the ten external multimodal benchmarks relative to static-dataset baselines, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.01599 by Jingyuan Huang, Jin Sun, Ninghao Liu, Ruitong Sun, Tianze Yang, Yucheng Shi.

**Figure 1.** Figure 1: TRON: diverse, ability-targeted, auditable environments for visual reasoning RL. TRON organizes 520 rule-verifiable generators into ability buckets covering spatial, mathematical, diagram, pattern, and counting skills. Each environment produces fresh difficulty-controlled image–question rollouts with a deterministic verifier; a substrate analysis (Section 5.1) checks generation quality, instance and level … view at source ↗

**Figure 2.** Figure 2: Model-free audit of the 520 training envi [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Spatial Reasoning examples. Rows show maze navigation and cube-net opposite-face reasoning; [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Mathematical Reasoning examples. Rows show exterior-angle geometry and probability [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Visual Diagram Understanding examples. Rows show scientific graph interpretation and [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Visual Pattern & Logical Reasoning examples. Rows show matrix pattern completion and color [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Counting & Quantitative Estimation examples. Rows show occluded-object counting and [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Training dynamics for ability specialists. The left panel shows validation-accuracy gain over [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

read the original abstract

Reinforcement learning (RL) for visual reasoning needs scalable, verifiable, and controllable training signals. Existing visual RL post-training trains on static curated datasets, with fixed image-question-answer samples bounded by their collection budget. In this work, we introduce TRON (Targeted, Rule-verifiable Online eNvironments), an online environment substrate: a training rollout is generated on demand by a controllable generator-verifier program that samples a fresh latent visual state, renders an image, asks a question, and exactly verifies the answer. A single run can therefore draw an unbounded stream of fresh instances at the difficulty level required by the current curriculum. The current TRON suite contains 520 environments organized into five ability buckets (spatial, mathematical, diagram, pattern/logic, and counting); the same substrate supports both a single full model trained on all buckets and per-bucket ability-specialist models, with no additional data collection. We also introduce a substrate analysis covering generation reliability, instance and level diversity, cross-environment near-duplicates, and base-model pass rate by difficulty level. RL post-training with METHOD consistently improves performance on ten external multimodal reasoning benchmarks across Qwen3-VL-4B, Qwen2.5-VL-7B, and MiMo-VL-7B-SFT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRON supplies a concrete online generator-verifier substrate for scalable visual reasoning RL data, but the external benchmark gains rest on an untested assumption that the generated instances match the test distributions.

read the letter

The main point is that TRON replaces static datasets with on-demand generation of fresh, exactly verifiable visual reasoning instances. The setup uses controllable programs to sample latent states, render images, ask questions, and check answers, which lets training draw new examples at curriculum-controlled difficulty. They built 520 environments across five buckets (spatial, mathematical, diagram, pattern/logic, counting) and added a substrate analysis on reliability, diversity, near-duplicates, and base-model pass rates. That engineering is the part that could actually be reused.

The results section claims consistent gains after RL on three VL models across ten held-out multimodal benchmarks. If the full paper shows clean ablations and the generators are released, that would be worth looking at for anyone scaling RL post-training.

The soft spot is the transfer argument. Nothing in the description compares TRON rollouts to the benchmark distributions on image statistics, question phrasing, or reasoning types. Without that, the gains could come from the models picking up generator-specific patterns rather than general visual reasoning. The abstract also gives no experimental details on the RL procedure, baselines, or variance, so the evidence for the headline claim stays thin.

This is for groups already running RL on multimodal models who need more controllable data. It is worth sending to peer review because the substrate is specific enough to reproduce and test, even if the generalization evidence will need tightening.

Referee Report

2 major / 1 minor

Summary. The paper introduces TRON, a suite of 520 controllable generator-verifier programs organized into five ability buckets (spatial, mathematical, diagram, pattern/logic, counting) that produce on-demand visual reasoning instances for RL post-training. It claims that RL using these environments yields consistent performance gains on ten external multimodal reasoning benchmarks across three VL models (Qwen3-VL-4B, Qwen2.5-VL-7B, MiMo-VL-7B-SFT). The manuscript also presents a substrate analysis of generation reliability, instance diversity, near-duplicates, and base-model pass rates by difficulty.

Significance. If the reported benchmark gains prove robust and transferable, TRON would provide a scalable alternative to static curated datasets by enabling unbounded, difficulty-controllable, rule-verifiable training signals. The substrate analysis of internal properties (reliability, diversity, pass rates) is a positive step toward reproducibility and curriculum design.

major comments (2)

[Abstract] Abstract and experimental results: the headline claim of consistent improvements on ten external benchmarks is presented without any reported baselines, ablation studies, statistical tests, variance across runs, or effect sizes, preventing assessment of whether the gains are attributable to TRON or to other training choices.
[Substrate analysis] Substrate analysis section: while internal properties (generation reliability, instance diversity, near-duplicates, base-model pass rates) are examined, no direct distributional comparison (feature-space distances, reasoning-type histograms, image statistics, or question-phrasing overlap) is reported between TRON rollouts and the ten held-out benchmark distributions; this leaves the generalization assumption untested and load-bearing for the transfer claim.

minor comments (1)

[Abstract] Abstract contains the placeholder 'METHOD' in place of TRON when describing the RL post-training procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, clarifying our experimental reporting and analysis approach while committing to revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract and experimental results: the headline claim of consistent improvements on ten external benchmarks is presented without any reported baselines, ablation studies, statistical tests, variance across runs, or effect sizes, preventing assessment of whether the gains are attributable to TRON or to other training choices.

Authors: We agree that the abstract and experimental results would benefit from more explicit reporting to allow readers to assess the source of the gains. In the revised manuscript we will add (i) comparisons against standard baselines including SFT-only and alternative RL post-training methods, (ii) ablation studies that isolate the contribution of individual ability buckets, and (iii) statistical details consisting of means, standard deviations across three random seeds, and Cohen’s d effect sizes for the reported benchmark improvements. These additions will be placed in both the abstract and the main results section. revision: yes
Referee: [Substrate analysis] Substrate analysis section: while internal properties (generation reliability, instance diversity, near-duplicates, base-model pass rates) are examined, no direct distributional comparison (feature-space distances, reasoning-type histograms, image statistics, or question-phrasing overlap) is reported between TRON rollouts and the ten held-out benchmark distributions; this leaves the generalization assumption untested and load-bearing for the transfer claim.

Authors: The substrate analysis is deliberately scoped to internal properties that support reproducibility and curriculum construction. We acknowledge that explicit distributional comparisons would provide additional evidence for the transfer assumption. In revision we will add reasoning-type histograms and basic image statistics (resolution, color distribution, object density) comparing TRON rollouts to the ten external benchmarks. Full feature-space distances and question-phrasing overlap metrics are computationally intensive and will be included only if space permits; otherwise they will be noted as future work. The primary evidence for transfer remains the consistent gains on the held-out benchmarks themselves. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical gains on external benchmarks are not reduced to training inputs by construction.

full rationale

The paper's central claim is an empirical observation: RL post-training on the 520 TRON environments yields measured improvements on ten separately held-out external multimodal benchmarks. No equations, fitted parameters, or self-citations are invoked that would make the reported benchmark deltas equivalent to quantities defined inside the TRON generators or training loop. The substrate analysis addresses internal properties of the generated instances but does not redefine the external evaluation. This is a standard non-circular experimental setup; the generalization question is one of evidence strength rather than definitional reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that rule-based generator-verifier programs can produce training signals that transfer to external benchmarks, plus the implicit assumption that RL post-training benefits from unlimited fresh verifiable instances.

axioms (1)

domain assumption Reinforcement learning with verifiable rewards improves visual reasoning capabilities in multimodal models
The paper invokes this to explain why post-training on TRON yields benchmark gains.

invented entities (1)

TRON environments no independent evidence
purpose: To generate on-demand visual reasoning instances with exact answer verification
New substrate introduced by the paper; no independent evidence outside the described system is provided.

pith-pipeline@v0.9.1-grok · 5773 in / 1331 out tokens · 31906 ms · 2026-06-28T14:53:11.790335+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct
cs.AI 2026-06 unverdicted novelty 5.0

VeriEvol decouples prompt difficulty evolution from answer reliability verification to scale verified data for visual math reasoning, lifting benchmark accuracy from 35.42 to 54.73 and adding +3.88 in GRPO RL.

Reference graph

Works this paper leans on

50 extracted references · 19 linked inside Pith · cited by 1 Pith paper

[1]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[2]

Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin, et al. Qwen2.5-VL technical report...

Pith/arXiv arXiv 2025
[3]

David G. T. Barrett, Felix Hill, Adam Santoro, Ari S. Morcos, and Timothy Lillicrap. Measuring abstract reasoning in neural networks. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 511–520, 2018

2018
[4]

Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles.Advances in Neural Information Processing Systems, 38:3613–3661, 2026

Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Jiaze Chen, Xuefeng Li, Qiying Yu, et al. Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles.Advances in Neural Information Processing Systems, 38:3613–3661, 2026

2026
[5]

R1-V: Reinforcing super generalization ability in vision-language models with less than $3

Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-V: Reinforcing super generalization ability in vision-language models with less than $3. https://github.com/Deep-Agent/R1-V, 2025

2025
[6]

PuzzleVQA: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns.arXiv preprint arXiv:2403.13315, 2024

Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, and Soujanya Poria. PuzzleVQA: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns.arXiv preprint arXiv:2403.13315, 2024

arXiv 2024
[7]

On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

François Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

Pith/arXiv arXiv 1911
[8]

Leveraging procedural generation to benchmark reinforcement learning

Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119, pages 2048–2056. PMLR, 2020

2048
[9]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM international conference on multimedia, pages 11198–11201, 2024

2024
[10]

G-LLaV A: Solving geometric problem with multi-modal large language model.arXiv preprint arXiv:2312.11370, 2023

Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, and Lingpeng Kong. G-LLaV A: Solving geometric problem with multi-modal large language model.arXiv preprint arXiv:2312.11370, 2023

arXiv 2023
[11]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025
[12]

ChartLlama: A multimodal LLM for chart understanding and generation.arXiv preprint arXiv:2311.16483, 2023

Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. ChartLlama: A multimodal LLM for chart understanding and generation.arXiv preprint arXiv:2311.16483, 2023

arXiv 2023
[13]

Measuring coding challenge competence with APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021. 11

2021
[14]

Vision-R1: Incentivizing reasoning capability in multimodal large language models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-R1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025

Pith/arXiv arXiv 2025
[15]

Lawrence Zitnick, and Ross Girshick

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2901–2910, 2017

2017
[16]

Tülu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brah- man, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tülu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

Pith/arXiv arXiv 2024
[17]

Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025

Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, et al. Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025

arXiv 2025
[18]

Jiang, Ziju Shen, et al

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q. Jiang, Ziju Shen, et al. NuminaMath: The largest public dataset in AI4Maths with 860k pairs of competition math problems and solutions. Technical report, Hugging Face / Project Numina, 2024

2024
[19]

Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

2022
[20]

SynLogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond

Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, and Junxian He. SynLogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. a...

arXiv 2025
[21]

Visual-RFT: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-RFT: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

Pith/arXiv arXiv 2025
[22]

MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations (ICLR), 2024

2024
[23]

ChartGemma: Visual instruction-tuning for chart reasoning in the wild.arXiv preprint arXiv:2407.04172, 2024

Ahmed Masry, Megh Thakkar, Aayush Bajaj, Aaryaman Kartha, Enamul Hoque, and Shafiq Joty. ChartGemma: Visual instruction-tuning for chart reasoning in the wild.arXiv preprint arXiv:2407.04172, 2024

arXiv 2024
[24]

ChartQAPro: A more diverse and challenging benchmark for chart question answering

Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmoham- madi, et al. ChartQAPro: A more diverse and challenging benchmark for chart question answering. arXiv preprint arXiv:2504.05506, 2025

arXiv 2025
[25]

MM-Eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. MM-Eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

Pith/arXiv arXiv 2025
[26]

Gym-v: A unified vision environment system for agentic vision research.arXiv preprint arXiv:2603.15432, 2026

Fanqing Meng, Lingxiao Du, Jiawei Gu, Jiaqi Liao, Linjie Li, Zijian Wu, Xiangyan Liu, Ziqi Zhao, Mengkang Hu, Zichen Liu, et al. Gym-v: A unified vision environment system for agentic vision research.arXiv preprint arXiv:2603.15432, 2026. 12

Pith/arXiv arXiv 2026
[27]

Patel, Yuke Zhu, and Anima Anandkumar

Weili Nie, Zhiding Yu, Lei Mao, Ankit B. Patel, Yuke Zhu, and Anima Anandkumar. Bongard- LOGO: A new benchmark for human-level concept learning and reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020
[28]

LMM-R1: Empowering 3b LMMs with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. LMM-R1: Empowering 3b LMMs with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

Pith/arXiv arXiv 2025
[29]

We-Math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang. We-Math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

Pith/arXiv arXiv 2024
[30]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[31]

VLM-R1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. VLM-R1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

Pith/arXiv arXiv 2025
[32]

Math-LLaV A: Bootstrapping mathematical reasoning for multimodal large language models.arXiv preprint arXiv:2406.17294, 2024

Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee. Math-LLaV A: Bootstrapping mathematical reasoning for multimodal large language models.arXiv preprint arXiv:2406.17294, 2024

arXiv 2024
[33]

Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards

Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kad- dour, and Andreas Köpf. Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2505.24760

arXiv 2025
[34]

Reason-RFT: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-RFT: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

arXiv 2025
[35]

VL-Rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. VL-Rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025

Pith/arXiv arXiv 2025
[36]

Is a picture worth a thousand words? delving into spatial reasoning for vision language models

Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[37]

CharXiv: Charting gaps in realistic chart understanding in multimodal LLMs.arXiv preprint arXiv:2406.18521, 2024

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. CharXiv: Charting gaps in realistic chart understanding in multimodal LLMs.arXiv preprint arXiv:2406.18521, 2024

arXiv 2024
[38]

LogicVista: Multimodal LLM logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. LogicVista: Multimodal LLM logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

Pith/arXiv arXiv 2024
[39]

Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar

Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar. LeanDojo: Theorem proving with retrieval-augmented language models. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023. 13

2023
[40]

R1-Onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-Onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

Pith/arXiv arXiv 2025
[41]

DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Pith/arXiv arXiv 2025
[42]

MME-Reasoning: A comprehensive benchmark for logical reasoning in MLLMs.arXiv preprint arXiv:2505.21327, 2025

Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, et al. MME-Reasoning: A comprehensive benchmark for logical reasoning in MLLMs.arXiv preprint arXiv:2505.21327, 2025

arXiv 2025
[43]

Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, et al. Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

arXiv 2025
[44]

Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments.arXiv preprint arXiv:2511.07317, 2025

Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, et al. Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments.arXiv preprint arXiv:2511.07317, 2025

Pith/arXiv arXiv 2025
[45]

RA VEN: A dataset for relational and analogical visual rEasoNing

Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. RA VEN: A dataset for relational and analogical visual rEasoNing. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5317–5327, 2019

2019
[46]

MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems?arXiv preprint arXiv:2403.14624, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems?arXiv preprint arXiv:2403.14624, 2024

Pith/arXiv arXiv 2024
[47]

Mavis: Mathematical visual instruction tuning with an automatic data engine

Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Ziyu Guo, Yichi Zhang, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Shanghang Zhang, Gao Peng, et al. Mavis: Mathematical visual instruction tuning with an automatic data engine. InInternational Conference on Learning Representations, volume 2025, pages 87955–87989, 2025

2025
[48]

MM- HELIX: Boosting multimodal long-chain reflective reasoning with holistic platform and adaptive hybrid policy optimization.arXiv preprint arXiv:2510.08540, 2025

Xiangyu Zhao, Junming Lin, Tianhao Liang, Yifan Zhou, Wenhao Chai, Yuzhe Gu, Weiyun Wang, Kai Chen, Gen Luo, Wenwei Zhang, Junchi Yan, Hua Yang, Haodong Duan, and Xue Yang. MM- HELIX: Boosting multimodal long-chain reflective reasoning with holistic platform and adaptive hybrid policy optimization.arXiv preprint arXiv:2510.08540, 2025

arXiv 2025
[49]

miniF2F: A cross-system benchmark for formal olympiad-level mathematics

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. miniF2F: A cross-system benchmark for formal olympiad-level mathematics. InInternational Conference on Learning Representations (ICLR), 2022

2022
[50]

"" 2Clock angle QA -- analog clock showing a time. 3Questions: angle between hands, time shown, angle after N minutes, overlap count. 4

Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. DynaMath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836, 2024. 14 A. Fine-Grained Environment Coverage Table 6 expands the high-level suite composition in Table 1. The entries are representative r...

arXiv 2024

[1] [1]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[2] [2]

Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin, et al. Qwen2.5-VL technical report...

Pith/arXiv arXiv 2025

[3] [3]

David G. T. Barrett, Felix Hill, Adam Santoro, Ari S. Morcos, and Timothy Lillicrap. Measuring abstract reasoning in neural networks. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 511–520, 2018

2018

[4] [4]

Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles.Advances in Neural Information Processing Systems, 38:3613–3661, 2026

Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Jiaze Chen, Xuefeng Li, Qiying Yu, et al. Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles.Advances in Neural Information Processing Systems, 38:3613–3661, 2026

2026

[5] [5]

R1-V: Reinforcing super generalization ability in vision-language models with less than $3

Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-V: Reinforcing super generalization ability in vision-language models with less than $3. https://github.com/Deep-Agent/R1-V, 2025

2025

[6] [6]

PuzzleVQA: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns.arXiv preprint arXiv:2403.13315, 2024

Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, and Soujanya Poria. PuzzleVQA: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns.arXiv preprint arXiv:2403.13315, 2024

arXiv 2024

[7] [7]

On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

François Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

Pith/arXiv arXiv 1911

[8] [8]

Leveraging procedural generation to benchmark reinforcement learning

Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. InProceedings of the 37th International Conference on Machine Learning (ICML), volume 119, pages 2048–2056. PMLR, 2020

2048

[9] [9]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM international conference on multimedia, pages 11198–11201, 2024

2024

[10] [10]

G-LLaV A: Solving geometric problem with multi-modal large language model.arXiv preprint arXiv:2312.11370, 2023

Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, and Lingpeng Kong. G-LLaV A: Solving geometric problem with multi-modal large language model.arXiv preprint arXiv:2312.11370, 2023

arXiv 2023

[11] [11]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025

[12] [12]

ChartLlama: A multimodal LLM for chart understanding and generation.arXiv preprint arXiv:2311.16483, 2023

Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. ChartLlama: A multimodal LLM for chart understanding and generation.arXiv preprint arXiv:2311.16483, 2023

arXiv 2023

[13] [13]

Measuring coding challenge competence with APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021. 11

2021

[14] [14]

Vision-R1: Incentivizing reasoning capability in multimodal large language models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-R1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025

Pith/arXiv arXiv 2025

[15] [15]

Lawrence Zitnick, and Ross Girshick

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2901–2910, 2017

2017

[16] [16]

Tülu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brah- man, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tülu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

Pith/arXiv arXiv 2024

[17] [17]

Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025

Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, et al. Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025

arXiv 2025

[18] [18]

Jiang, Ziju Shen, et al

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q. Jiang, Ziju Shen, et al. NuminaMath: The largest public dataset in AI4Maths with 860k pairs of competition math problems and solutions. Technical report, Hugging Face / Project Numina, 2024

2024

[19] [19]

Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

2022

[20] [20]

SynLogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond

Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, and Junxian He. SynLogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. a...

arXiv 2025

[21] [21]

Visual-RFT: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-RFT: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

Pith/arXiv arXiv 2025

[22] [22]

MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations (ICLR), 2024

2024

[23] [23]

ChartGemma: Visual instruction-tuning for chart reasoning in the wild.arXiv preprint arXiv:2407.04172, 2024

Ahmed Masry, Megh Thakkar, Aayush Bajaj, Aaryaman Kartha, Enamul Hoque, and Shafiq Joty. ChartGemma: Visual instruction-tuning for chart reasoning in the wild.arXiv preprint arXiv:2407.04172, 2024

arXiv 2024

[24] [24]

ChartQAPro: A more diverse and challenging benchmark for chart question answering

Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmoham- madi, et al. ChartQAPro: A more diverse and challenging benchmark for chart question answering. arXiv preprint arXiv:2504.05506, 2025

arXiv 2025

[25] [25]

MM-Eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. MM-Eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

Pith/arXiv arXiv 2025

[26] [26]

Gym-v: A unified vision environment system for agentic vision research.arXiv preprint arXiv:2603.15432, 2026

Fanqing Meng, Lingxiao Du, Jiawei Gu, Jiaqi Liao, Linjie Li, Zijian Wu, Xiangyan Liu, Ziqi Zhao, Mengkang Hu, Zichen Liu, et al. Gym-v: A unified vision environment system for agentic vision research.arXiv preprint arXiv:2603.15432, 2026. 12

Pith/arXiv arXiv 2026

[27] [27]

Patel, Yuke Zhu, and Anima Anandkumar

Weili Nie, Zhiding Yu, Lei Mao, Ankit B. Patel, Yuke Zhu, and Anima Anandkumar. Bongard- LOGO: A new benchmark for human-level concept learning and reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020

[28] [28]

LMM-R1: Empowering 3b LMMs with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. LMM-R1: Empowering 3b LMMs with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

Pith/arXiv arXiv 2025

[29] [29]

We-Math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang. We-Math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

Pith/arXiv arXiv 2024

[30] [30]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[31] [31]

VLM-R1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. VLM-R1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

Pith/arXiv arXiv 2025

[32] [32]

Math-LLaV A: Bootstrapping mathematical reasoning for multimodal large language models.arXiv preprint arXiv:2406.17294, 2024

Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee. Math-LLaV A: Bootstrapping mathematical reasoning for multimodal large language models.arXiv preprint arXiv:2406.17294, 2024

arXiv 2024

[33] [33]

Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards

Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kad- dour, and Andreas Köpf. Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2505.24760

arXiv 2025

[34] [34]

Reason-RFT: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-RFT: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

arXiv 2025

[35] [35]

VL-Rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. VL-Rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025

Pith/arXiv arXiv 2025

[36] [36]

Is a picture worth a thousand words? delving into spatial reasoning for vision language models

Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024

[37] [37]

CharXiv: Charting gaps in realistic chart understanding in multimodal LLMs.arXiv preprint arXiv:2406.18521, 2024

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. CharXiv: Charting gaps in realistic chart understanding in multimodal LLMs.arXiv preprint arXiv:2406.18521, 2024

arXiv 2024

[38] [38]

LogicVista: Multimodal LLM logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. LogicVista: Multimodal LLM logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

Pith/arXiv arXiv 2024

[39] [39]

Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar

Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar. LeanDojo: Theorem proving with retrieval-augmented language models. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023. 13

2023

[40] [40]

R1-Onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-Onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

Pith/arXiv arXiv 2025

[41] [41]

DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Pith/arXiv arXiv 2025

[42] [42]

MME-Reasoning: A comprehensive benchmark for logical reasoning in MLLMs.arXiv preprint arXiv:2505.21327, 2025

Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, et al. MME-Reasoning: A comprehensive benchmark for logical reasoning in MLLMs.arXiv preprint arXiv:2505.21327, 2025

arXiv 2025

[43] [43]

Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, et al. Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

arXiv 2025

[44] [44]

Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments.arXiv preprint arXiv:2511.07317, 2025

Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, et al. Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments.arXiv preprint arXiv:2511.07317, 2025

Pith/arXiv arXiv 2025

[45] [45]

RA VEN: A dataset for relational and analogical visual rEasoNing

Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. RA VEN: A dataset for relational and analogical visual rEasoNing. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5317–5327, 2019

2019

[46] [46]

MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems?arXiv preprint arXiv:2403.14624, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. MathVerse: Does your multi-modal LLM truly see the diagrams in visual math problems?arXiv preprint arXiv:2403.14624, 2024

Pith/arXiv arXiv 2024

[47] [47]

Mavis: Mathematical visual instruction tuning with an automatic data engine

Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Ziyu Guo, Yichi Zhang, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Shanghang Zhang, Gao Peng, et al. Mavis: Mathematical visual instruction tuning with an automatic data engine. InInternational Conference on Learning Representations, volume 2025, pages 87955–87989, 2025

2025

[48] [48]

MM- HELIX: Boosting multimodal long-chain reflective reasoning with holistic platform and adaptive hybrid policy optimization.arXiv preprint arXiv:2510.08540, 2025

Xiangyu Zhao, Junming Lin, Tianhao Liang, Yifan Zhou, Wenhao Chai, Yuzhe Gu, Weiyun Wang, Kai Chen, Gen Luo, Wenwei Zhang, Junchi Yan, Hua Yang, Haodong Duan, and Xue Yang. MM- HELIX: Boosting multimodal long-chain reflective reasoning with holistic platform and adaptive hybrid policy optimization.arXiv preprint arXiv:2510.08540, 2025

arXiv 2025

[49] [49]

miniF2F: A cross-system benchmark for formal olympiad-level mathematics

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. miniF2F: A cross-system benchmark for formal olympiad-level mathematics. InInternational Conference on Learning Representations (ICLR), 2022

2022

[50] [50]

"" 2Clock angle QA -- analog clock showing a time. 3Questions: angle between hands, time shown, angle after N minutes, overlap count. 4

Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. DynaMath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836, 2024. 14 A. Fine-Grained Environment Coverage Table 6 expands the high-level suite composition in Table 1. The entries are representative r...

arXiv 2024