pith. sign in

arxiv: 2503.17352 · v3 · pith:DDVCFKBVnew · submitted 2025-03-21 · 💻 cs.CV · cs.CL

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Pith reviewed 2026-05-19 06:54 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords vision-language modelschain-of-thought reasoningsupervised fine-tuningreinforcement learningiterative trainingmultimodal reasoningvisual reasoning benchmarks
0
0 comments X p. Extension
pith:DDVCFKBV Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{DDVCFKBV}

Prints a linked pith:DDVCFKBV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Alternating SFT and RL cycles enable 7B vision-language models to develop complex chain-of-thought reasoning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that cycling between supervised fine-tuning and reinforcement learning can bootstrap sophisticated reasoning in open-source large vision-language models. Starting from models that show little reasoning, SFT draws out latent behaviors while RL narrows the vast search space that otherwise prevents progress in smaller models. This alternation creates a self-improving loop where each RL stage generates better data for the next SFT round. A sympathetic reader would care because pure distillation from text reasoning models fails on visual grounding and standalone RL struggles with exploration in multimodal settings. The result is measurable gains across benchmarks that test mathematical and general visual reasoning.

Core claim

By alternating supervised fine-tuning with reinforcement learning over several iterations, OpenVLThinker-7B develops chain-of-thought reasoning capabilities that the base model initially lacks. The process begins with SFT to surface reasoning actions and reduce the RL search space, followed by RL to refine those skills and produce higher-quality training data for subsequent cycles, ultimately delivering performance improvements on demanding visual reasoning benchmarks.

What carries the argument

The iterative SFT-RL cycle, in which supervised fine-tuning surfaces latent reasoning behaviors to make the reinforcement learning search space tractable and each RL stage then refines the model to generate improved data for the next fine-tuning step.

If this is right

  • The 7B model shows a 3.8% gain on MathVista, a 2.4% gain on EMMA, and a 1.6% gain on HallusionBench.
  • Each RL stage produces higher-quality reasoning traces that improve the next round of supervised fine-tuning.
  • The method supplies early evidence that R1-style reflective reasoning can be achieved in multimodal models.
  • The cycle progressively narrows the search space so that reflective behaviors emerge in smaller models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alternation might accelerate reasoning on other multimodal tasks such as complex visual question answering.
  • Adjusting cycle length or reward design could let the approach work with even smaller base models.
  • The loop may lower the total amount of human-annotated reasoning data needed to reach a given capability level.

Load-bearing premise

The base model possesses latent reasoning behaviors that supervised fine-tuning can surface and amplify to make reinforcement learning effective.

What would settle it

Training the base 7B model through one or more SFT-RL cycles and finding no emergence of chain-of-thought traces or no gains on visual reasoning benchmarks would falsify the claim.

read the original abstract

We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning, achieving notable performance gains on challenging visual reasoning tasks. While text-based reasoning models (e.g., Deepseek R1) show promising results in text-only tasks, distilling their reasoning into LVLMs via supervised fine-tuning (SFT) often results in performance degradation due to imprecise visual grounding. Conversely, purely reinforcement learning (RL)-based methods face a large search space, hindering the emergence of reflective behaviors in smaller models (e.g., 7B LVLMs). Surprisingly, alternating between SFT and RL ultimately results in significant performance improvements after a few iterations. Our analysis reveals that the base model rarely exhibits reasoning behaviors initially, but SFT effectively surfaces these latent actions and narrows the RL search space, accelerating the development of reasoning capabilities. Each subsequent RL stage further refines the model's reasoning skills, producing higher-quality SFT data for continued self-improvement. OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning, notably improving MathVista by 3.8%, EMMA by 2.4%, and HallusionBench by 1.6%. Beyond demonstrating the synergy between SFT and RL for complex reasoning tasks, our findings provide early evidence towards achieving R1-style reasoning in multimodal contexts. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces OpenVLThinker, an open-source 7B LVLM that develops sophisticated chain-of-thought reasoning for visual tasks through iterative cycles alternating between supervised fine-tuning (SFT) and reinforcement learning (RL). It claims that pure SFT from text models degrades due to poor visual grounding while pure RL suffers from large search spaces in smaller models; the alternation surfaces latent reasoning behaviors, narrows the RL search space, and yields self-improving data, producing benchmark gains such as +3.8% on MathVista, +2.4% on EMMA, and +1.6% on HallusionBench. Code, model, and data are released.

Significance. If the iterative SFT-RL synergy and its mechanistic explanation hold under controlled experiments, the work would provide a practical, reproducible recipe for eliciting R1-style reasoning in multimodal models, addressing a key gap between text-only advances and vision-language settings. The open release of code, model, and data is a clear strength that facilitates verification and extension.

major comments (2)
  1. [Analysis / Results] The central explanatory claim that 'SFT effectively surfaces these latent actions and narrows the RL search space' (abstract and analysis) is load-bearing for attributing gains to the alternation rather than extra gradient steps or data volume, yet no direct supporting metrics are provided such as the fraction of reasoning traces, average reward curves, or search-space statistics before versus after each SFT stage.
  2. [Experimental results] Table or figure reporting benchmark results: the improvements (MathVista +3.8%, EMMA +2.4%, HallusionBench +1.6%) are presented without error bars, multiple random seeds, or statistical significance tests, and no ablation comparing iterative SFT-RL to continued SFT, continued RL, or non-alternating schedules is described, making it difficult to isolate the contribution of the proposed cycle.
minor comments (2)
  1. [Abstract] The abstract mentions gains 'across six benchmarks' but details only three; listing all six with their respective deltas would improve completeness.
  2. [Method] Notation for the iterative procedure (e.g., how SFT data is generated from RL outputs and vice versa) could be clarified with a concise algorithm box or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough and constructive review. The comments highlight important areas for strengthening the attribution of gains to the iterative SFT-RL process and for improving experimental rigor. We address each major comment below and have incorporated revisions to the manuscript accordingly.

read point-by-point responses
  1. Referee: [Analysis / Results] The central explanatory claim that 'SFT effectively surfaces these latent actions and narrows the RL search space' (abstract and analysis) is load-bearing for attributing gains to the alternation rather than extra gradient steps or data volume, yet no direct supporting metrics are provided such as the fraction of reasoning traces, average reward curves, or search-space statistics before versus after each SFT stage.

    Authors: We agree that quantitative metrics would provide stronger, more direct support for the mechanistic claim. In the revised manuscript we have added a new subsection (4.3) and accompanying Figure 4 that reports (i) the fraction of generated traces containing explicit chain-of-thought reasoning before and after each SFT stage, (ii) average reward curves across RL iterations, and (iii) search-space statistics approximated by the variance and average length of reasoning paths. These metrics show a consistent increase in reasoning-trace frequency and a reduction in path variance immediately after SFT, supporting the claim that SFT narrows the effective search space for subsequent RL. We also include a brief discussion of how these quantities evolve over the full iterative cycle. revision: yes

  2. Referee: [Experimental results] Table or figure reporting benchmark results: the improvements (MathVista +3.8%, EMMA +2.4%, HallusionBench +1.6%) are presented without error bars, multiple random seeds, or statistical significance tests, and no ablation comparing iterative SFT-RL to continued SFT, continued RL, or non-alternating schedules is described, making it difficult to isolate the contribution of the proposed cycle.

    Authors: We acknowledge that the current presentation lacks statistical robustness and direct ablations. We have rerun all final evaluations with three independent random seeds and added standard-deviation error bars to Table 1. We also report paired t-test p-values for the main benchmark improvements. In addition, we have inserted a new ablation subsection (5.4) and Table 3 that compares the full iterative schedule against (a) continued SFT for an equivalent total number of gradient steps, (b) continued RL without SFT interleaving, and (c) a non-alternating mixed SFT+RL schedule. The iterative approach outperforms these baselines by 1.4–2.1 points on MathVista, consistent with the value of alternation. These results are now included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Empirical iterative training procedure with external benchmark evaluation

full rationale

The manuscript describes an empirical training loop of alternating supervised fine-tuning and reinforcement learning on vision-language models, with final performance measured on independent external benchmarks (MathVista, EMMA, HallusionBench). No mathematical derivation, equations, or fitted parameters are presented whose outputs are defined in terms of the inputs. The interpretive claim that SFT narrows the RL search space is supported by end-to-end results rather than by any self-referential construction or load-bearing self-citation. The work is therefore self-contained against external evaluation and exhibits no circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard machine-learning assumptions about the existence of latent reasoning behaviors in base LVLMs and the ability of SFT to surface them without introducing new free parameters beyond ordinary training hyperparameters.

free parameters (1)
  • SFT and RL training hyperparameters
    Learning rates, batch sizes, and iteration counts for each SFT and RL stage are chosen to make the cycle work.
axioms (1)
  • domain assumption Base LVLM possesses latent reasoning behaviors that SFT can surface and that RL can then refine
    Invoked in the analysis paragraph explaining why the first SFT stage narrows the RL search space.

pith-pipeline@v0.9.0 · 5825 in / 1291 out tokens · 37990 ms · 2026-05-19T06:54:31.810757+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

    cs.CV 2026-04 unverdicted novelty 8.0

    S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

  2. Learning Agentic Policy from Action Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

  3. Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...

  4. CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...

  5. Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

    cs.CV 2025-05 conditional novelty 7.0

    Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.

  6. RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

    cs.CV 2026-05 unverdicted novelty 6.0

    RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...

  7. Reinforcing Multimodal Reasoning Against Visual Degradation

    cs.CV 2026-05 unverdicted novelty 6.0

    ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.

  8. Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts

    cs.CV 2026-05 unverdicted novelty 6.0

    Chart-FR1 uses Focus-CoT for linking reasoning to visual cues and Focus-GRPO reinforcement learning with efficiency rewards to outperform prior MLLMs on dense chart reasoning tasks.

  9. DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents

    cs.CV 2026-04 unverdicted novelty 6.0

    DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.

  10. Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.

  11. Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework

    cs.CV 2025-09 unverdicted novelty 6.0

    DRP decouples reasoning from perception in LMMs by using an LLM reasoner to query an LMM observer for visual details as needed, reducing visual grounding loss.

  12. Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

    cs.CV 2025-06 unverdicted novelty 6.0

    VILASR integrates visual drawing operations with reasoning in LVLMs via cold-start synthetic training, reflective rejection sampling, and reinforcement learning, yielding an 18.4% average gain on spatial reasoning benchmarks.

  13. Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

    cs.LG 2026-04 unverdicted novelty 5.0

    A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

  14. ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

    cs.CV 2026-04 unverdicted novelty 5.0

    A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.

  15. Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

    cs.AI 2026-04 unverdicted novelty 5.0

    Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...

  16. NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

    cs.CV 2025-10 unverdicted novelty 5.0

    NoisyGRPO is an RL framework that perturbs visual inputs with Gaussian noise for exploration and computes trajectory advantages via Bayesian posterior fusion of noise prior and reward likelihood to improve multimodal ...

  17. UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning

    cs.CV 2025-08 unverdicted novelty 5.0

    UAV-VL-R1 combines SFT and multi-stage GRPO reinforcement learning on a new 50,019-sample HRVQA-VL dataset to deliver substantially higher zero-shot accuracy on UAV visual reasoning tasks than both its 2B baseline and...

  18. RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought

    cs.CV 2025-06 unverdicted novelty 5.0

    RealSR-R1 introduces VLCoT-GRPO with four rewards to add understanding and reasoning to real-world image super-resolution models.

  19. VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    cs.CV 2025-04 unverdicted novelty 5.0

    Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.

  20. Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

    cs.CV 2025-07 unverdicted novelty 4.0

    A pipeline of chain-of-thought data synthesis, LoRA-based supervised fine-tuning, rejection sampling, and rule-based reinforcement learning raises multi-image grounding accuracy by 9.04% on MIG-Bench and 4.41% on aver...

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 20 Pith papers · 36 internal anchors

  1. [1]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740, 2024

  2. [2]

    The claude 3 model family: Opus, sonnet, haiku

    AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. claude-3 model card. 2024

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  4. [4]

    Mapqa: A dataset for question answering on choropleth maps.arXiv preprint arXiv:2211.08545, 2022

    Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps.arXiv preprint arXiv:2211.08545, 2022

  5. [5]

    Vl- thinking: An r1-derived visual instruction tuning dataset for thinkable lvlms.https://github

    Hardy Chen, Haoqin Tu, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Vl- thinking: An r1-derived visual instruction tuning dataset for thinkable lvlms.https://github. com/UCSC-VLAA/VL-Thinking, 2025

  6. [6]

    SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

    Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models.arXiv preprint arXiv:2504.11468, 2025

  7. [7]

    Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning.arXiv preprint arXiv:2105.14517, 2021

    Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning.arXiv preprint arXiv:2105.14517, 2021

  8. [8]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024

  9. [9]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

  10. [10]

    An empirical study on eliciting and improving r1-like reasoning models, 2025

    Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models, 2025. URLhttps://arxiv.org/ abs/2503.04548

  11. [11]

    Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in Neural Information Processing Systems, 36, 2024

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in Neural Information Processing Systems, 36, 2024

  12. [12]

    RLHF Workflow: From Reward Modeling to Online RLHF

    Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024. 13 OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

  13. [13]

    Virgo: A preliminary exploration on reproducing o1-like mllm.arXiv preprint arXiv:2501.01904, 2025

    Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Virgo: A preliminary exploration on reproducing o1-like mllm.arXiv preprint arXiv:2501.01904, 2025

  14. [14]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  15. [15]

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. Llama-adapter v2: Parameter-efficient visual instruction model.arXiv preprint arXiv:2304.15010, 2023

  16. [16]

    Sphinx-x: Scaling data and parameters for a family of multi-modal large language models

    Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, and Yu Qiao. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. InInternational Conference on Machine Lea...

  17. [17]

    Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach, 2025. URLhttps://arxiv.org/abs/2502.05171

  18. [18]

    Gemini 2.5 pro, May 2025

    Google. Gemini 2.5 pro, May 2025. URL https://deepmind.google/technologies/ gemini/

  19. [19]

    Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...

  20. [20]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  21. [21]

    Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale

    Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xiang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale. arXiv preprint arXiv:2412.05237, 2024

  22. [22]

    Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444, 2025

    Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444, 2025

  23. [23]

    DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

    Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025

  24. [24]

    V-star: Training verifiers for self-taught reasoners.arXiv preprint arXiv:2402.06457, 2024

    Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V-star: Training verifiers for self-taught reasoners.arXiv preprint arXiv:2402.06457, 2024. 14 OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

  25. [25]

    Open- reasoner-zero: An open source approach to scaling reinforcement learning on the base model, 2025

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, and Heung-Yeung Shum Xiangyu Zhang. Open- reasoner-zero: An open source approach to scaling reinforcement learning on the base model, 2025

  26. [26]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.arXiv preprint arXiv:2406.09403, 2024

  27. [27]

    Vision-r1: Incentivizing reasoning capability in multimodal large language models,

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models,

  28. [28]

    URLhttps://arxiv.org/abs/2503.06749

  29. [29]

    O1 replication journey – part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson?, 2024

    Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu. O1 replication journey – part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson?, 2024. URLhttps://arxiv.org/abs/2411. 16489

  30. [30]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  31. [31]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  32. [32]

    FigureQA: An Annotated Figure Dataset for Visual Reasoning

    SamiraEbrahimi Kahou, Vincent Michalski, Adam Atkinson, ÁkosKádár, Adam Trischler, and Yoshua Bengio. Figureqa: Anannotatedfiguredatasetforvisualreasoning.arXivpreprintarXiv:1710.07300, 2017

  33. [33]

    The CoT collection: Improving zero-shot and few-shot learning of language models via chain-of- thought fine-tuning

    Seungone Kim, Se Joo, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin Shin, and Minjoon Seo. The CoT collection: Improving zero-shot and few-shot learning of language models via chain-of- thought fine-tuning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 126...

  34. [34]

    URLhttps://aclanthology.org/2023.emnlp-main.782/

  35. [35]

    Mmr1: Advancing the frontiers of multimodal reasoning.https://github.com/LengSicong/MMR1, 2025

    Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Hang Zhang, Yuming Jiang, Xin Li, Deli Zhao, Fan Wang, Yu Rong, Aixin Sun†, and Shijian Lu†. Mmr1: Advancing the frontiers of multimodal reasoning.https://github.com/LengSicong/MMR1, 2025

  36. [36]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  37. [37]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965– 10975, 2022. 15 OpenVLThinker: Complex Vision-Language Reasoning via Itera...

  38. [38]

    Symbolic chain-of-thought distillation: Small models can also "think" step-by-step

    Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, and Yejin Choi. Symbolic chain-of-thought distillation: Small models can also "think" step-by-step. InACL, 2023

  39. [39]

    Visual instruction tuning.Advances in Neural Information Processing Systems (NeurIPS), 36, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in Neural Information Processing Systems (NeurIPS), 36, 2023

  40. [40]

    Llava- next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava- next: Improved reasoning, ocr, and world knowledge, January 2024. URLhttps://llava-vl. github.io/blog/2024-01-30-llava-next/

  41. [41]

    Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling, 2025

    Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling, 2025. URL https://arxiv.org/abs/2502.06703

  42. [42]

    Noisyrollout: Reinforcing visual reasoning with data augmentation.arXiv preprint arXiv:2504.13055, 2025

    Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, and Michael Qizhe Shieh. Noisyrollout: Reinforcing visual reasoning with data augmentation.arXiv preprint arXiv:2504.13055, 2025

  43. [43]

    There may not be aha moment in r1-zero-like training — a pilot study, 2025

    Zichen Liu, Changyu Chen, Wenjun Li, Tianyu Pang, Chao Du, and Min Lin. There may not be aha moment in r1-zero-like training — a pilot study, 2025. Notion Blog

  44. [44]

    There may not be aha moment in r1-zero-like training — a pilot study.https://oatllm.notion.site/oat-zero,

    Zichen Liu, Changyu Chen, Wenjun Li, Tianyu Pang, Chao Du, and Min Lin. There may not be aha moment in r1-zero-like training — a pilot study.https://oatllm.notion.site/oat-zero,

  45. [45]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

  46. [46]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

  47. [47]

    Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog

  48. [48]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

  49. [49]

    Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems,

    Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems,

  50. [50]

    URLhttps://arxiv.org/abs/2412.09413

  51. [51]

    Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning

    Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 18798–18806, 2024. 16 OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

  52. [52]

    s1: Simple test-time scaling,

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling,

  53. [53]

    URLhttps://arxiv.org/abs/2501.19393

  54. [54]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730– 27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730– 27744, 2022

  55. [55]

    Iterative reasoning preference optimization.Advances in Neural Information Processing Systems, 37:116617–116637, 2024

    Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization.Advances in Neural Information Processing Systems, 37:116617–116637, 2024

  56. [56]

    We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

    RunqiQiao, QiunaTan, GuantingDong, MinhuiWu, ChongSun, XiaoshuaiSong, ZhuomaGongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

  57. [57]

    O1 replication journey: A strategic progress report - part 1.CoRR, abs/2410.18982, 2024

    Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, and Pengfei Liu. O1 replication journey: A strategic progress report – part 1, 2024. URLhttps://arxiv.org/abs/2410.18982

  58. [58]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021

  59. [59]

    Scaling test-time compute without verification or rl is suboptimal.arXiv preprint arXiv:2502.12118, 2025

    Amrith Setlur, Nived Rajaraman, Sergey Levine, and Aviral Kumar. Scaling test-time compute without verification or rl is suboptimal, 2025. URLhttps://arxiv.org/abs/2502.12118

  60. [60]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  61. [61]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

  62. [62]

    Math-llava: Bootstrapping mathematical reasoning for multimodal large language models

    Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee. Math-llava: Bootstrapping mathematical reasoning for multimodal large language models.arXiv preprint arXiv:2406.17294, 2024

  63. [63]

    Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585, 2023

    Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, et al. Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585, 2023

  64. [64]

    When to solve, when to verify: Compute-optimal problem solving and generative verification for llm reasoning.arXiv preprint arXiv:2504.01005, 2025

    Nishad Singhi, Hritik Bansal, Arian Hosseini, Aditya Grover, Kai-Wei Chang, Marcus Rohrbach, and Anna Rohrbach. When to solve, when to verify: Compute-optimal problem solving and generative verification for llm reasoning.arXiv preprint arXiv:2504.01005, 2025. 17 OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

  65. [65]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

  66. [66]

    Open Thoughts

    OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025

  67. [67]

    Qwq-32b: Embracing the power of reinforcement learning, March 2025

    Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/

  68. [68]

    Llamav-o1: Rethinking step-by-step visual reasoning in llms

    Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms.arXiv preprint arXiv:2501.06186, 2025

  69. [70]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  70. [71]

    Contextual: Evaluating context- sensitive text-rich visual reasoning in large multimodal models.arXiv preprint arXiv:2401.13311, 2024

    Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, and Nanyun Peng. Contextual: Evaluating context- sensitive text-rich visual reasoning in large multimodal models.arXiv preprint arXiv:2401.13311, 2024

  71. [72]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025

  72. [73]

    Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024

  73. [74]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  74. [75]

    Visualprm: An effective process reward model for multimodal reasoning

    Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, et al. Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025

  75. [76]

    Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

    Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

  76. [77]

    Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning, 2025

    Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning, 2025

  77. [78]

    LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

    Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step.arXiv preprint arXiv:2411.10440, 2024. 18 OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

  78. [79]

    AnYang,BaosongYang,BeichenZhang,BinyuanHui,BoZheng,BowenYu,ChengyuanLi,Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

  79. [80]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization, 2025. URLhttps://arxiv.org/abs/ 2503.10615

  80. [81]

    Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search.arXiv preprint arXiv:2412.18319, 2024

    Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search.arXiv preprint arXiv:2412.18319, 2024

Showing first 80 references.