OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

arxiv: 2503.17352 · v3 · pith:DDVCFKBVnew · submitted 2025-03-21 · 💻 cs.CV · cs.CL

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Yihe Deng , Hritik Bansal , Fan Yin , Nanyun Peng , Wei Wang , Kai-Wei Chang This is my paper

Pith reviewed 2026-05-19 06:54 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords vision-language modelschain-of-thought reasoningsupervised fine-tuningreinforcement learningiterative trainingmultimodal reasoningvisual reasoning benchmarks

0 comments p. Extension

pith:DDVCFKBV Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{DDVCFKBV}

Prints a linked pith:DDVCFKBV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Alternating SFT and RL cycles enable 7B vision-language models to develop complex chain-of-thought reasoning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that cycling between supervised fine-tuning and reinforcement learning can bootstrap sophisticated reasoning in open-source large vision-language models. Starting from models that show little reasoning, SFT draws out latent behaviors while RL narrows the vast search space that otherwise prevents progress in smaller models. This alternation creates a self-improving loop where each RL stage generates better data for the next SFT round. A sympathetic reader would care because pure distillation from text reasoning models fails on visual grounding and standalone RL struggles with exploration in multimodal settings. The result is measurable gains across benchmarks that test mathematical and general visual reasoning.

Core claim

By alternating supervised fine-tuning with reinforcement learning over several iterations, OpenVLThinker-7B develops chain-of-thought reasoning capabilities that the base model initially lacks. The process begins with SFT to surface reasoning actions and reduce the RL search space, followed by RL to refine those skills and produce higher-quality training data for subsequent cycles, ultimately delivering performance improvements on demanding visual reasoning benchmarks.

What carries the argument

The iterative SFT-RL cycle, in which supervised fine-tuning surfaces latent reasoning behaviors to make the reinforcement learning search space tractable and each RL stage then refines the model to generate improved data for the next fine-tuning step.

If this is right

The 7B model shows a 3.8% gain on MathVista, a 2.4% gain on EMMA, and a 1.6% gain on HallusionBench.
Each RL stage produces higher-quality reasoning traces that improve the next round of supervised fine-tuning.
The method supplies early evidence that R1-style reflective reasoning can be achieved in multimodal models.
The cycle progressively narrows the search space so that reflective behaviors emerge in smaller models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alternation might accelerate reasoning on other multimodal tasks such as complex visual question answering.
Adjusting cycle length or reward design could let the approach work with even smaller base models.
The loop may lower the total amount of human-annotated reasoning data needed to reach a given capability level.

Load-bearing premise

The base model possesses latent reasoning behaviors that supervised fine-tuning can surface and amplify to make reinforcement learning effective.

What would settle it

Training the base 7B model through one or more SFT-RL cycles and finding no emergence of chain-of-thought traces or no gains on visual reasoning benchmarks would falsify the claim.

read the original abstract

We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning, achieving notable performance gains on challenging visual reasoning tasks. While text-based reasoning models (e.g., Deepseek R1) show promising results in text-only tasks, distilling their reasoning into LVLMs via supervised fine-tuning (SFT) often results in performance degradation due to imprecise visual grounding. Conversely, purely reinforcement learning (RL)-based methods face a large search space, hindering the emergence of reflective behaviors in smaller models (e.g., 7B LVLMs). Surprisingly, alternating between SFT and RL ultimately results in significant performance improvements after a few iterations. Our analysis reveals that the base model rarely exhibits reasoning behaviors initially, but SFT effectively surfaces these latent actions and narrows the RL search space, accelerating the development of reasoning capabilities. Each subsequent RL stage further refines the model's reasoning skills, producing higher-quality SFT data for continued self-improvement. OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning, notably improving MathVista by 3.8%, EMMA by 2.4%, and HallusionBench by 1.6%. Beyond demonstrating the synergy between SFT and RL for complex reasoning tasks, our findings provide early evidence towards achieving R1-style reasoning in multimodal contexts. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cycling SFT and RL yields modest gains on visual reasoning benchmarks but the claimed search-space narrowing lacks direct measurements.

read the letter

The punchline is that cycling SFT and RL a few times gives OpenVLThinker some real but small lifts on visual reasoning benchmarks, yet the story about why it works rests on unmeasured assumptions. What the paper actually does is apply this alternating training to large vision-language models. They start from the observation that distilling text reasoning hurts visual grounding and that pure RL is too slow for 7B models. By going back and forth, SFT brings out latent reasoning steps that make the RL phase more productive, and the improved model then generates better data for the next SFT. This loop runs for a few iterations and produces the reported gains. They earn credit for releasing the full code, model weights, and training data. That turns the work into something others can reproduce and build on right away. The consistent improvements across MathVista, EMMA, HallusionBench and the rest show the method is at least stable. The soft spots sit in the causal story. The claim that SFT narrows the RL search space by surfacing reasoning behaviors is plausible, but the manuscript gives only final benchmark numbers. There are no plots of reasoning trace frequency before and after SFT, no reward curves across iterations, and no ablations that isolate the alternation from extra gradient steps. Without those, it is hard to rule out that the gains come from simply training longer. The stress-test note is on target here. This paper is for researchers who train open multimodal models and want practical ways to add reasoning. A reader looking for a concrete recipe and open artifacts will find value. It deserves a serious referee because the empirical results and public release make it worth checking and improving. I would recommend sending it to peer review. The training approach is worth testing with the missing diagnostics.

Referee Report

2 major / 2 minor

Summary. The paper introduces OpenVLThinker, an open-source 7B LVLM that develops sophisticated chain-of-thought reasoning for visual tasks through iterative cycles alternating between supervised fine-tuning (SFT) and reinforcement learning (RL). It claims that pure SFT from text models degrades due to poor visual grounding while pure RL suffers from large search spaces in smaller models; the alternation surfaces latent reasoning behaviors, narrows the RL search space, and yields self-improving data, producing benchmark gains such as +3.8% on MathVista, +2.4% on EMMA, and +1.6% on HallusionBench. Code, model, and data are released.

Significance. If the iterative SFT-RL synergy and its mechanistic explanation hold under controlled experiments, the work would provide a practical, reproducible recipe for eliciting R1-style reasoning in multimodal models, addressing a key gap between text-only advances and vision-language settings. The open release of code, model, and data is a clear strength that facilitates verification and extension.

major comments (2)

[Analysis / Results] The central explanatory claim that 'SFT effectively surfaces these latent actions and narrows the RL search space' (abstract and analysis) is load-bearing for attributing gains to the alternation rather than extra gradient steps or data volume, yet no direct supporting metrics are provided such as the fraction of reasoning traces, average reward curves, or search-space statistics before versus after each SFT stage.
[Experimental results] Table or figure reporting benchmark results: the improvements (MathVista +3.8%, EMMA +2.4%, HallusionBench +1.6%) are presented without error bars, multiple random seeds, or statistical significance tests, and no ablation comparing iterative SFT-RL to continued SFT, continued RL, or non-alternating schedules is described, making it difficult to isolate the contribution of the proposed cycle.

minor comments (2)

[Abstract] The abstract mentions gains 'across six benchmarks' but details only three; listing all six with their respective deltas would improve completeness.
[Method] Notation for the iterative procedure (e.g., how SFT data is generated from RL outputs and vice versa) could be clarified with a concise algorithm box or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough and constructive review. The comments highlight important areas for strengthening the attribution of gains to the iterative SFT-RL process and for improving experimental rigor. We address each major comment below and have incorporated revisions to the manuscript accordingly.

read point-by-point responses

Referee: [Analysis / Results] The central explanatory claim that 'SFT effectively surfaces these latent actions and narrows the RL search space' (abstract and analysis) is load-bearing for attributing gains to the alternation rather than extra gradient steps or data volume, yet no direct supporting metrics are provided such as the fraction of reasoning traces, average reward curves, or search-space statistics before versus after each SFT stage.

Authors: We agree that quantitative metrics would provide stronger, more direct support for the mechanistic claim. In the revised manuscript we have added a new subsection (4.3) and accompanying Figure 4 that reports (i) the fraction of generated traces containing explicit chain-of-thought reasoning before and after each SFT stage, (ii) average reward curves across RL iterations, and (iii) search-space statistics approximated by the variance and average length of reasoning paths. These metrics show a consistent increase in reasoning-trace frequency and a reduction in path variance immediately after SFT, supporting the claim that SFT narrows the effective search space for subsequent RL. We also include a brief discussion of how these quantities evolve over the full iterative cycle. revision: yes
Referee: [Experimental results] Table or figure reporting benchmark results: the improvements (MathVista +3.8%, EMMA +2.4%, HallusionBench +1.6%) are presented without error bars, multiple random seeds, or statistical significance tests, and no ablation comparing iterative SFT-RL to continued SFT, continued RL, or non-alternating schedules is described, making it difficult to isolate the contribution of the proposed cycle.

Authors: We acknowledge that the current presentation lacks statistical robustness and direct ablations. We have rerun all final evaluations with three independent random seeds and added standard-deviation error bars to Table 1. We also report paired t-test p-values for the main benchmark improvements. In addition, we have inserted a new ablation subsection (5.4) and Table 3 that compares the full iterative schedule against (a) continued SFT for an equivalent total number of gradient steps, (b) continued RL without SFT interleaving, and (c) a non-alternating mixed SFT+RL schedule. The iterative approach outperforms these baselines by 1.4–2.1 points on MathVista, consistent with the value of alternation. These results are now included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Empirical iterative training procedure with external benchmark evaluation

full rationale

The manuscript describes an empirical training loop of alternating supervised fine-tuning and reinforcement learning on vision-language models, with final performance measured on independent external benchmarks (MathVista, EMMA, HallusionBench). No mathematical derivation, equations, or fitted parameters are presented whose outputs are defined in terms of the inputs. The interpretive claim that SFT narrows the RL search space is supported by end-to-end results rather than by any self-referential construction or load-bearing self-citation. The work is therefore self-contained against external evaluation and exhibits no circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard machine-learning assumptions about the existence of latent reasoning behaviors in base LVLMs and the ability of SFT to surface them without introducing new free parameters beyond ordinary training hyperparameters.

free parameters (1)

SFT and RL training hyperparameters
Learning rates, batch sizes, and iteration counts for each SFT and RL stage are chosen to make the cycle work.

axioms (1)

domain assumption Base LVLM possesses latent reasoning behaviors that SFT can surface and that RL can then refine
Invoked in the analysis paragraph explaining why the first SFT stage narrows the RL search space.

pith-pipeline@v0.9.0 · 5825 in / 1291 out tokens · 37990 ms · 2026-05-19T06:54:31.810757+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SFT effectively surfaces these latent actions and narrows the RL search space, accelerating the development of reasoning capabilities
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

alternating between SFT and RL ultimately results in significant performance improvements after a few iterations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images
cs.CV 2026-04 unverdicted novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
cs.CV 2026-04 unverdicted novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
cs.CV 2025-05 conditional novelty 7.0

Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
cs.CV 2026-05 unverdicted novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
Reinforcing Multimodal Reasoning Against Visual Degradation
cs.CV 2026-05 unverdicted novelty 6.0

ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
cs.CV 2026-05 unverdicted novelty 6.0

Chart-FR1 uses Focus-CoT for linking reasoning to visual cues and Focus-GRPO reinforcement learning with efficiency rewards to outperform prior MLLMs on dense chart reasoning tasks.
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents
cs.CV 2026-04 unverdicted novelty 6.0

DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework
cs.CV 2025-09 unverdicted novelty 6.0

DRP decouples reasoning from perception in LMMs by using an LLM reasoner to query an LMM observer for visual details as needed, reducing visual grounding loss.
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
cs.CV 2025-06 unverdicted novelty 6.0

VILASR integrates visual drawing operations with reasoning in LVLMs via cold-start synthetic training, reflective rejection sampling, and reinforcement learning, yielding an 18.4% average gain on spatial reasoning benchmarks.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards
cs.CV 2026-04 unverdicted novelty 5.0

A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation
cs.CV 2025-10 unverdicted novelty 5.0

NoisyGRPO is an RL framework that perturbs visual inputs with Gaussian noise for exploration and computes trajectory advantages via Bayesian posterior fusion of noise prior and reward likelihood to improve multimodal ...
UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning
cs.CV 2025-08 unverdicted novelty 5.0

UAV-VL-R1 combines SFT and multi-stage GRPO reinforcement learning on a new 50,019-sample HRVQA-VL dataset to deliver substantially higher zero-shot accuracy on UAV visual reasoning tasks than both its 2B baseline and...
RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought
cs.CV 2025-06 unverdicted novelty 5.0

RealSR-R1 introduces VLCoT-GRPO with four rewards to add understanding and reasoning to real-world image super-resolution models.
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
cs.CV 2025-04 unverdicted novelty 5.0

Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.
Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 4.0

A pipeline of chain-of-thought data synthesis, LoRA-based supervised fine-tuning, rejection sampling, and rule-based reinforcement learning raises multi-image grounding accuracy by 9.04% on MIG-Bench and 4.41% on aver...

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 20 Pith papers · 36 internal anchors

[1]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

The claude 3 model family: Opus, sonnet, haiku

AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. claude-3 model card. 2024

work page 2024
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Mapqa: A dataset for question answering on choropleth maps.arXiv preprint arXiv:2211.08545, 2022

Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps.arXiv preprint arXiv:2211.08545, 2022

work page arXiv 2022
[5]

Vl- thinking: An r1-derived visual instruction tuning dataset for thinkable lvlms.https://github

Hardy Chen, Haoqin Tu, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Vl- thinking: An r1-derived visual instruction tuning dataset for thinkable lvlms.https://github. com/UCSC-VLAA/VL-Thinking, 2025

work page 2025
[6]

SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models.arXiv preprint arXiv:2504.11468, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning.arXiv preprint arXiv:2105.14517, 2021

Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning.arXiv preprint arXiv:2105.14517, 2021

work page arXiv 2021
[8]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

An empirical study on eliciting and improving r1-like reasoning models, 2025

Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models, 2025. URLhttps://arxiv.org/ abs/2503.04548

work page arXiv 2025
[11]

Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in Neural Information Processing Systems, 36, 2024

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[12]

RLHF Workflow: From Reward Modeling to Online RLHF

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024. 13 OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Virgo: A preliminary exploration on reproducing o1-like mllm.arXiv preprint arXiv:2501.01904, 2025

Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Virgo: A preliminary exploration on reproducing o1-like mllm.arXiv preprint arXiv:2501.01904, 2025

work page arXiv 2025
[14]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. Llama-adapter v2: Parameter-efficient visual instruction model.arXiv preprint arXiv:2304.15010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Sphinx-x: Scaling data and parameters for a family of multi-modal large language models

Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, and Yu Qiao. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. InInternational Conference on Machine Lea...

work page 2024
[17]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach, 2025. URLhttps://arxiv.org/abs/2502.05171

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Gemini 2.5 pro, May 2025

Google. Gemini 2.5 pro, May 2025. URL https://deepmind.google/technologies/ gemini/

work page 2025
[19]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...

work page 2024
[20]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale

Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xiang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale. arXiv preprint arXiv:2412.05237, 2024

work page arXiv 2024
[22]

Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444, 2025

Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444, 2025

work page arXiv 2025
[23]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

V-star: Training verifiers for self-taught reasoners.arXiv preprint arXiv:2402.06457, 2024

Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V-star: Training verifiers for self-taught reasoners.arXiv preprint arXiv:2402.06457, 2024. 14 OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

work page arXiv 2024
[25]

Open- reasoner-zero: An open source approach to scaling reinforcement learning on the base model, 2025

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, and Heung-Yeung Shum Xiangyu Zhang. Open- reasoner-zero: An open source approach to scaling reinforcement learning on the base model, 2025

work page 2025
[26]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.arXiv preprint arXiv:2406.09403, 2024

work page arXiv 2024
[27]

Vision-r1: Incentivizing reasoning capability in multimodal large language models,

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models,

work page
[28]

URLhttps://arxiv.org/abs/2503.06749

work page internal anchor Pith review Pith/arXiv arXiv
[29]

O1 replication journey – part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson?, 2024

Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu. O1 replication journey – part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson?, 2024. URLhttps://arxiv.org/abs/2411. 16489

work page 2024
[30]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

FigureQA: An Annotated Figure Dataset for Visual Reasoning

SamiraEbrahimi Kahou, Vincent Michalski, Adam Atkinson, ÁkosKádár, Adam Trischler, and Yoshua Bengio. Figureqa: Anannotatedfiguredatasetforvisualreasoning.arXivpreprintarXiv:1710.07300, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

The CoT collection: Improving zero-shot and few-shot learning of language models via chain-of- thought fine-tuning

Seungone Kim, Se Joo, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin Shin, and Minjoon Seo. The CoT collection: Improving zero-shot and few-shot learning of language models via chain-of- thought fine-tuning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 126...

work page doi:10.18653/v1/2023.emnlp-main 2023
[34]

URLhttps://aclanthology.org/2023.emnlp-main.782/

work page 2023
[35]

Mmr1: Advancing the frontiers of multimodal reasoning.https://github.com/LengSicong/MMR1, 2025

Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Hang Zhang, Yuming Jiang, Xin Li, Deli Zhao, Fan Wang, Yu Rong, Aixin Sun†, and Shijian Lu†. Mmr1: Advancing the frontiers of multimodal reasoning.https://github.com/LengSicong/MMR1, 2025

work page 2025
[36]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965– 10975, 2022. 15 OpenVLThinker: Complex Vision-Language Reasoning via Itera...

work page 2022
[38]

Symbolic chain-of-thought distillation: Small models can also "think" step-by-step

Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, and Yejin Choi. Symbolic chain-of-thought distillation: Small models can also "think" step-by-step. InACL, 2023

work page 2023
[39]

Visual instruction tuning.Advances in Neural Information Processing Systems (NeurIPS), 36, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in Neural Information Processing Systems (NeurIPS), 36, 2023

work page 2023
[40]

Llava- next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava- next: Improved reasoning, ocr, and world knowledge, January 2024. URLhttps://llava-vl. github.io/blog/2024-01-30-llava-next/

work page 2024
[41]

Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling, 2025

Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling, 2025. URL https://arxiv.org/abs/2502.06703

work page arXiv 2025
[42]

Noisyrollout: Reinforcing visual reasoning with data augmentation.arXiv preprint arXiv:2504.13055, 2025

Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, and Michael Qizhe Shieh. Noisyrollout: Reinforcing visual reasoning with data augmentation.arXiv preprint arXiv:2504.13055, 2025

work page arXiv 2025
[43]

There may not be aha moment in r1-zero-like training — a pilot study, 2025

Zichen Liu, Changyu Chen, Wenjun Li, Tianyu Pang, Chao Du, and Min Lin. There may not be aha moment in r1-zero-like training — a pilot study, 2025. Notion Blog

work page 2025
[44]

There may not be aha moment in r1-zero-like training — a pilot study.https://oatllm.notion.site/oat-zero,

Zichen Liu, Changyu Chen, Wenjun Li, Tianyu Pang, Chao Du, and Min Lin. There may not be aha moment in r1-zero-like training — a pilot study.https://oatllm.notion.site/oat-zero,

work page
[45]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog

work page 2025
[48]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems,

Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems,

work page
[50]

URLhttps://arxiv.org/abs/2412.09413

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning

Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 18798–18806, 2024. 16 OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

work page 2024
[52]

s1: Simple test-time scaling,

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling,

work page
[53]

URLhttps://arxiv.org/abs/2501.19393

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730– 27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730– 27744, 2022

work page 2022
[55]

Iterative reasoning preference optimization.Advances in Neural Information Processing Systems, 37:116617–116637, 2024

Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization.Advances in Neural Information Processing Systems, 37:116617–116637, 2024

work page 2024
[56]

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

RunqiQiao, QiunaTan, GuantingDong, MinhuiWu, ChongSun, XiaoshuaiSong, ZhuomaGongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

O1 replication journey: A strategic progress report - part 1.CoRR, abs/2410.18982, 2024

Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, and Pengfei Liu. O1 replication journey: A strategic progress report – part 1, 2024. URLhttps://arxiv.org/abs/2410.18982

work page arXiv 2024
[58]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021

work page 2021
[59]

Scaling test-time compute without verification or rl is suboptimal.arXiv preprint arXiv:2502.12118, 2025

Amrith Setlur, Nived Rajaraman, Sergey Levine, and Aviral Kumar. Scaling test-time compute without verification or rl is suboptimal, 2025. URLhttps://arxiv.org/abs/2502.12118

work page arXiv 2025
[60]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Math-llava: Bootstrapping mathematical reasoning for multimodal large language models

Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee. Math-llava: Bootstrapping mathematical reasoning for multimodal large language models.arXiv preprint arXiv:2406.17294, 2024

work page arXiv 2024
[63]

Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585, 2023

Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, et al. Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585, 2023

work page arXiv 2023
[64]

When to solve, when to verify: Compute-optimal problem solving and generative verification for llm reasoning.arXiv preprint arXiv:2504.01005, 2025

Nishad Singhi, Hritik Bansal, Arian Hosseini, Aditya Grover, Kai-Wei Chang, Marcus Rohrbach, and Anna Rohrbach. When to solve, when to verify: Compute-optimal problem solving and generative verification for llm reasoning.arXiv preprint arXiv:2504.01005, 2025. 17 OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

work page arXiv 2025
[65]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Open Thoughts

OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025

work page 2025
[67]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/

work page 2025
[68]

Llamav-o1: Rethinking step-by-step visual reasoning in llms

Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms.arXiv preprint arXiv:2501.06186, 2025

work page arXiv 2025
[70]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Contextual: Evaluating context- sensitive text-rich visual reasoning in large multimodal models.arXiv preprint arXiv:2401.13311, 2024

Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, and Nanyun Peng. Contextual: Evaluating context- sensitive text-rich visual reasoning in large multimodal models.arXiv preprint arXiv:2401.13311, 2024

work page arXiv 2024
[72]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024

work page 2024
[74]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

Visualprm: An effective process reward model for multimodal reasoning

Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, et al. Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025

work page arXiv 2025
[76]

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

work page arXiv 2025
[77]

Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning, 2025

Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning, 2025

work page 2025
[78]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step.arXiv preprint arXiv:2411.10440, 2024. 18 OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

AnYang,BaosongYang,BeichenZhang,BinyuanHui,BoZheng,BowenYu,ChengyuanLi,Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization, 2025. URLhttps://arxiv.org/abs/ 2503.10615

work page internal anchor Pith review Pith/arXiv arXiv 2025
[81]

Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search.arXiv preprint arXiv:2412.18319, 2024

Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search.arXiv preprint arXiv:2412.18319, 2024

work page arXiv 2024

Showing first 80 references.

[1] [1]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

The claude 3 model family: Opus, sonnet, haiku

AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. claude-3 model card. 2024

work page 2024

[3] [3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Mapqa: A dataset for question answering on choropleth maps.arXiv preprint arXiv:2211.08545, 2022

Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps.arXiv preprint arXiv:2211.08545, 2022

work page arXiv 2022

[5] [5]

Vl- thinking: An r1-derived visual instruction tuning dataset for thinkable lvlms.https://github

Hardy Chen, Haoqin Tu, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Vl- thinking: An r1-derived visual instruction tuning dataset for thinkable lvlms.https://github. com/UCSC-VLAA/VL-Thinking, 2025

work page 2025

[6] [6]

SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models.arXiv preprint arXiv:2504.11468, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning.arXiv preprint arXiv:2105.14517, 2021

Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning.arXiv preprint arXiv:2105.14517, 2021

work page arXiv 2021

[8] [8]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

An empirical study on eliciting and improving r1-like reasoning models, 2025

Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models, 2025. URLhttps://arxiv.org/ abs/2503.04548

work page arXiv 2025

[11] [11]

Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in Neural Information Processing Systems, 36, 2024

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[12] [12]

RLHF Workflow: From Reward Modeling to Online RLHF

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024. 13 OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Virgo: A preliminary exploration on reproducing o1-like mllm.arXiv preprint arXiv:2501.01904, 2025

Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Virgo: A preliminary exploration on reproducing o1-like mllm.arXiv preprint arXiv:2501.01904, 2025

work page arXiv 2025

[14] [14]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. Llama-adapter v2: Parameter-efficient visual instruction model.arXiv preprint arXiv:2304.15010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Sphinx-x: Scaling data and parameters for a family of multi-modal large language models

Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, and Yu Qiao. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. InInternational Conference on Machine Lea...

work page 2024

[17] [17]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach, 2025. URLhttps://arxiv.org/abs/2502.05171

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Gemini 2.5 pro, May 2025

Google. Gemini 2.5 pro, May 2025. URL https://deepmind.google/technologies/ gemini/

work page 2025

[19] [19]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...

work page 2024

[20] [20]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale

Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xiang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale. arXiv preprint arXiv:2412.05237, 2024

work page arXiv 2024

[22] [22]

Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444, 2025

Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444, 2025

work page arXiv 2025

[23] [23]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

V-star: Training verifiers for self-taught reasoners.arXiv preprint arXiv:2402.06457, 2024

Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V-star: Training verifiers for self-taught reasoners.arXiv preprint arXiv:2402.06457, 2024. 14 OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

work page arXiv 2024

[25] [25]

Open- reasoner-zero: An open source approach to scaling reinforcement learning on the base model, 2025

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, and Heung-Yeung Shum Xiangyu Zhang. Open- reasoner-zero: An open source approach to scaling reinforcement learning on the base model, 2025

work page 2025

[26] [26]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.arXiv preprint arXiv:2406.09403, 2024

work page arXiv 2024

[27] [27]

Vision-r1: Incentivizing reasoning capability in multimodal large language models,

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models,

work page

[28] [28]

URLhttps://arxiv.org/abs/2503.06749

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

O1 replication journey – part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson?, 2024

Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu. O1 replication journey – part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson?, 2024. URLhttps://arxiv.org/abs/2411. 16489

work page 2024

[30] [30]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

FigureQA: An Annotated Figure Dataset for Visual Reasoning

SamiraEbrahimi Kahou, Vincent Michalski, Adam Atkinson, ÁkosKádár, Adam Trischler, and Yoshua Bengio. Figureqa: Anannotatedfiguredatasetforvisualreasoning.arXivpreprintarXiv:1710.07300, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[33] [33]

The CoT collection: Improving zero-shot and few-shot learning of language models via chain-of- thought fine-tuning

Seungone Kim, Se Joo, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin Shin, and Minjoon Seo. The CoT collection: Improving zero-shot and few-shot learning of language models via chain-of- thought fine-tuning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 126...

work page doi:10.18653/v1/2023.emnlp-main 2023

[34] [34]

URLhttps://aclanthology.org/2023.emnlp-main.782/

work page 2023

[35] [35]

Mmr1: Advancing the frontiers of multimodal reasoning.https://github.com/LengSicong/MMR1, 2025

Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Hang Zhang, Yuming Jiang, Xin Li, Deli Zhao, Fan Wang, Yu Rong, Aixin Sun†, and Shijian Lu†. Mmr1: Advancing the frontiers of multimodal reasoning.https://github.com/LengSicong/MMR1, 2025

work page 2025

[36] [36]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965– 10975, 2022. 15 OpenVLThinker: Complex Vision-Language Reasoning via Itera...

work page 2022

[38] [38]

Symbolic chain-of-thought distillation: Small models can also "think" step-by-step

Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, and Yejin Choi. Symbolic chain-of-thought distillation: Small models can also "think" step-by-step. InACL, 2023

work page 2023

[39] [39]

Visual instruction tuning.Advances in Neural Information Processing Systems (NeurIPS), 36, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in Neural Information Processing Systems (NeurIPS), 36, 2023

work page 2023

[40] [40]

Llava- next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava- next: Improved reasoning, ocr, and world knowledge, January 2024. URLhttps://llava-vl. github.io/blog/2024-01-30-llava-next/

work page 2024

[41] [41]

Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling, 2025

Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling, 2025. URL https://arxiv.org/abs/2502.06703

work page arXiv 2025

[42] [42]

Noisyrollout: Reinforcing visual reasoning with data augmentation.arXiv preprint arXiv:2504.13055, 2025

Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, and Michael Qizhe Shieh. Noisyrollout: Reinforcing visual reasoning with data augmentation.arXiv preprint arXiv:2504.13055, 2025

work page arXiv 2025

[43] [43]

There may not be aha moment in r1-zero-like training — a pilot study, 2025

Zichen Liu, Changyu Chen, Wenjun Li, Tianyu Pang, Chao Du, and Min Lin. There may not be aha moment in r1-zero-like training — a pilot study, 2025. Notion Blog

work page 2025

[44] [44]

There may not be aha moment in r1-zero-like training — a pilot study.https://oatllm.notion.site/oat-zero,

Zichen Liu, Changyu Chen, Wenjun Li, Tianyu Pang, Chao Du, and Min Lin. There may not be aha moment in r1-zero-like training — a pilot study.https://oatllm.notion.site/oat-zero,

work page

[45] [45]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog

work page 2025

[48] [48]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems,

Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems,

work page

[50] [50]

URLhttps://arxiv.org/abs/2412.09413

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning

Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 18798–18806, 2024. 16 OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

work page 2024

[52] [52]

s1: Simple test-time scaling,

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling,

work page

[53] [53]

URLhttps://arxiv.org/abs/2501.19393

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730– 27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730– 27744, 2022

work page 2022

[55] [55]

Iterative reasoning preference optimization.Advances in Neural Information Processing Systems, 37:116617–116637, 2024

Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization.Advances in Neural Information Processing Systems, 37:116617–116637, 2024

work page 2024

[56] [56]

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

RunqiQiao, QiunaTan, GuantingDong, MinhuiWu, ChongSun, XiaoshuaiSong, ZhuomaGongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

O1 replication journey: A strategic progress report - part 1.CoRR, abs/2410.18982, 2024

Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, and Pengfei Liu. O1 replication journey: A strategic progress report – part 1, 2024. URLhttps://arxiv.org/abs/2410.18982

work page arXiv 2024

[58] [58]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021

work page 2021

[59] [59]

Scaling test-time compute without verification or rl is suboptimal.arXiv preprint arXiv:2502.12118, 2025

Amrith Setlur, Nived Rajaraman, Sergey Levine, and Aviral Kumar. Scaling test-time compute without verification or rl is suboptimal, 2025. URLhttps://arxiv.org/abs/2502.12118

work page arXiv 2025

[60] [60]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[61] [61]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Math-llava: Bootstrapping mathematical reasoning for multimodal large language models

Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee. Math-llava: Bootstrapping mathematical reasoning for multimodal large language models.arXiv preprint arXiv:2406.17294, 2024

work page arXiv 2024

[63] [63]

Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585, 2023

Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, et al. Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585, 2023

work page arXiv 2023

[64] [64]

When to solve, when to verify: Compute-optimal problem solving and generative verification for llm reasoning.arXiv preprint arXiv:2504.01005, 2025

Nishad Singhi, Hritik Bansal, Arian Hosseini, Aditya Grover, Kai-Wei Chang, Marcus Rohrbach, and Anna Rohrbach. When to solve, when to verify: Compute-optimal problem solving and generative verification for llm reasoning.arXiv preprint arXiv:2504.01005, 2025. 17 OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

work page arXiv 2025

[65] [65]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

Open Thoughts

OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025

work page 2025

[67] [67]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/

work page 2025

[68] [68]

Llamav-o1: Rethinking step-by-step visual reasoning in llms

Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms.arXiv preprint arXiv:2501.06186, 2025

work page arXiv 2025

[69] [70]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[70] [71]

Contextual: Evaluating context- sensitive text-rich visual reasoning in large multimodal models.arXiv preprint arXiv:2401.13311, 2024

Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, and Nanyun Peng. Contextual: Evaluating context- sensitive text-rich visual reasoning in large multimodal models.arXiv preprint arXiv:2401.13311, 2024

work page arXiv 2024

[71] [72]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[72] [73]

Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024

work page 2024

[73] [74]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[74] [75]

Visualprm: An effective process reward model for multimodal reasoning

Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, et al. Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025

work page arXiv 2025

[75] [76]

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

work page arXiv 2025

[76] [77]

Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning, 2025

Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning, 2025

work page 2025

[77] [78]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step.arXiv preprint arXiv:2411.10440, 2024. 18 OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

work page internal anchor Pith review Pith/arXiv arXiv 2024

[78] [79]

AnYang,BaosongYang,BeichenZhang,BinyuanHui,BoZheng,BowenYu,ChengyuanLi,Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[79] [80]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization, 2025. URLhttps://arxiv.org/abs/ 2503.10615

work page internal anchor Pith review Pith/arXiv arXiv 2025

[80] [81]

Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search.arXiv preprint arXiv:2412.18319, 2024

Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search.arXiv preprint arXiv:2412.18319, 2024

work page arXiv 2024