R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
Pith reviewed 2026-05-19 07:09 UTC · model grok-4.3
pith:I5D2TXP4 Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{I5D2TXP4}
Prints a linked pith:I5D2TXP4 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Reinforcement learning on a non-SFT 2B vision-language model produces self-reflective visual reasoning and large accuracy gains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Applying reinforcement learning with rule-based rewards directly to the base Qwen2-VL-2B model on the SAT dataset elicits emergent self-reflection, longer responses, and a jump to 59.47 percent accuracy on CVBench, beating the base model by about 30 percent and supervised fine-tuning runs by about 2 percent.
What carries the argument
Reinforcement learning with simple rule-based incentives applied straight to the non-SFT Qwen2-VL-2B checkpoint on the SAT dataset.
If this is right
- CVBench accuracy reaches 59.47 percent, an improvement of roughly 30 percent over the base model.
- Self-reflective reasoning trajectories and longer responses appear during the RL training run.
- RL applied to already-instruct-tuned models produces only trivial reasoning chains.
- Adding a naive length reward does not reliably elicit genuine reasoning.
Where Pith is reading between the lines
- Base models that have never seen supervised fine-tuning may keep more capacity for emergent complex behavior under RL.
- The same direct-RL recipe could be tested on other small vision-language models to see whether the aha moment generalizes.
- Dataset choice may interact with the absence of SFT in ways that make self-reflection easier to surface.
Load-bearing premise
The observed accuracy rise and self-reflective trajectories are produced by the reinforcement learning process rather than by the particular SAT dataset, random seeds, or other implementation choices.
What would settle it
Re-running the identical RL procedure on the same base model but with a fresh random seed or a different training set that yields neither self-reflection nor the reported accuracy increase would falsify the claim.
read the original abstract
Recently DeepSeek R1 demonstrated how reinforcement learning with simple rule-based incentives can enable autonomous development of complex reasoning in large language models, characterized by the "aha moment", in which the model manifest self-reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics. In this report, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding both SFT setting by ~2%. In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models. aiming to shed light on the challenges involved. Our key observations include: (1) applying RL on instruct model often results in trivial reasoning trajectories, and (2) naive length reward are ineffective in eliciting reasoning capabilities. The project code is available at https://github.com/turningpoint-ai/VisualThinker-R1-Zero
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims the first successful replication of DeepSeek R1's emergent 'aha moment' (self-reflection and increased response length) in multimodal visual reasoning. Starting from the non-SFT Qwen2-VL-2B model and applying RL with rule-based rewards directly on the SAT dataset, the resulting model reaches 59.47% accuracy on CVBench (approximately 30% above the base model and 2% above an SFT baseline). The authors also document failed attempts when applying the same RL procedure to instruct-tuned models and release the training code.
Significance. If the central empirical findings hold, the work shows that R1-style reasoning behaviors can emerge in small (2B) multimodal models without prior SFT, lowering the resource barrier for visual reasoning systems. The explicit release of code at https://github.com/turningpoint-ai/VisualThinker-R1-Zero is a clear strength that enables direct reproduction and extension.
major comments (2)
- [Results] Results section: The manuscript reports final accuracy and selected self-reflective trajectories after RL on the SAT dataset, yet provides no ablation that holds the SAT data fixed while changing the training objective (e.g., SFT on the identical SAT examples or RL with a non-reasoning reward). This omission leaves open the possibility that longer responses and self-reflection are driven by dataset statistics rather than the RL dynamics, directly weakening the attribution required by the central claim.
- [Experiments] Experiments / Training procedure: No quantitative tracking of the 'aha moment' is supplied (e.g., plots of average response length or frequency of reflective phrases across training steps, with error bars or multiple seeds). The claim of emergent behavior therefore rests on qualitative trajectory examples whose reproducibility cannot be assessed from the reported data.
minor comments (2)
- [Abstract] Abstract: The phrase 'approximately ~30%' contains a redundant symbol; 'approximately 30%' is sufficient.
- [Method] The manuscript would benefit from a short appendix table listing the exact reward formulation, learning-rate schedule, and number of training steps so that the RL setup can be reproduced without inspecting the GitHub repository.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the attribution of emergent reasoning behaviors to the RL procedure. We address each major point below and will incorporate the suggested analyses in the revised manuscript to strengthen the central claims.
read point-by-point responses
-
Referee: [Results] Results section: The manuscript reports final accuracy and selected self-reflective trajectories after RL on the SAT dataset, yet provides no ablation that holds the SAT data fixed while changing the training objective (e.g., SFT on the identical SAT examples or RL with a non-reasoning reward). This omission leaves open the possibility that longer responses and self-reflection are driven by dataset statistics rather than the RL dynamics, directly weakening the attribution required by the central claim.
Authors: We agree that additional controlled ablations are necessary to isolate the contribution of RL dynamics from dataset effects. In the revised manuscript we will add (i) an SFT baseline trained on the identical SAT examples used for RL and (ii) an RL run that employs a reward based solely on final-answer correctness without length or reflection incentives. These results will be presented alongside the original findings to better support the attribution of the observed 'aha moment' to the RL objective. revision: yes
-
Referee: [Experiments] Experiments / Training procedure: No quantitative tracking of the 'aha moment' is supplied (e.g., plots of average response length or frequency of reflective phrases across training steps, with error bars or multiple seeds). The claim of emergent behavior therefore rests on qualitative trajectory examples whose reproducibility cannot be assessed from the reported data.
Authors: We acknowledge the value of quantitative evidence for reproducibility. The revised manuscript will include plots tracking average response length and the frequency of reflective phrases (such as 'wait' or 'reconsider') over training steps. Where computational resources permit, we will report results across multiple random seeds with error bars to allow assessment of the stability of the emergent behavior. revision: yes
Circularity Check
Empirical replication study with no definitional or self-referential derivations
full rationale
The paper reports direct experimental outcomes from applying reinforcement learning to the Qwen2-VL-2B base model on the SAT dataset, including measured accuracy of 59.47% on CVBench, response length increases, and selected trajectories. No equations, parameter fits presented as predictions, ansatzes, or mathematical derivations appear in the text. Central claims rest on observable benchmark results and code release rather than self-citations or reductions to inputs by construction. The work is self-contained as an empirical replication attempt, with any attribution questions addressable via external controls rather than internal definitional circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The SAT dataset supplies training signals that elicit genuine visual reasoning rather than superficial patterns.
Forward citations
Cited by 22 Pith papers
-
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning
A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
-
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.
-
Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
A masking-based think-answer distillation method for VLMs that selectively hides reasoning prefixes and uses self-paced scheduling to improve visual anchoring and benchmark performance.
-
Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
A new distillation method uses token-wise salient reasoning-prefix masking and self-paced scheduling to anchor student VLM thinking on visual inputs, outperforming prior distillation approaches on multimodal reasoning...
-
Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
A reasoning-prefix masking strategy during VLM distillation encourages students to anchor their thinking on visual evidence, yielding better multimodal reasoning than prior distillation baselines.
-
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
-
Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification
ReID-R achieves competitive person re-identification performance using chain-of-thought reasoning and reinforcement learning with only 14.3K non-trivial samples, about 20.9% of typical data scales, while providing int...
-
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
-
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
-
Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training
Mobile-R1 introduces a hierarchical three-stage curriculum that combines format alignment, verifiable action feedback, and multi-turn environment training to improve exploration and self-correction in VLM-based mobile...
-
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
-
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
UI-R1 shows rule-based RL with GRPO on 136 GUI tasks improves a 3B MLLM's action prediction accuracy by 6-22% over its base model and matches larger SFT-trained models.
-
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.
-
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.
-
Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning
Geo-R1 uses reasoning-centric reinforcement fine-tuning to improve few-shot performance and generalization in geospatial referring expression understanding over supervised baselines.
-
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.
-
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
-
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
Reference graph
Works this paper leans on
-
[1]
Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generaliza- tion ability in vision-language models with less than $3.https://github.com/Deep-Agent/ R1-V, 2025. Accessed: 2025-02-02
work page 2025
-
[2]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page 2025
-
[3]
EvolvingLMMs-Lab. open-r1-multimodal. https://github.com/EvolvingLMMs-Lab/ open-r1-multimodal, 2025. Accessed: March 6, 2025. 9
work page 2025
-
[4]
FanqingM. R1-multimodal-journey. https://github.com/FanqingM/ R1-Multimodal-Journey, 2025. Accessed: March 6, 2025
work page 2025
-
[5]
Smith, Wei-Chiu Ma, and Ranjay Krishna
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive, 2024
work page 2024
-
[6]
Visual sketchpad: Sketching as a visual chain of thought for multimodal language models
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. Advances in Neural Information Processing Systems , 37:139348–139379, 2025
work page 2025
-
[7]
Visual spatial reasoning, 2023
Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning, 2023
work page 2023
- [8]
-
[9]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022
work page 2022
-
[10]
Sat: Spatial aptitude training for multimodal language models
Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spatial aptitude training for multimodal language models. arXiv preprint arXiv:2412.07755, 2024
-
[11]
Proximal policy optimization algorithms, 2017
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017
work page 2017
-
[12]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Vlm-r1: A stable and generalizable r1-style large vision-language model
Haozhan Shen, Zilun Zhang, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision-language model. https://github.com/ om-ai-lab/VLM-R1 , 2025. Accessed: 2025-02-15
work page 2025
-
[14]
Llamav-o1: Rethinking step-by-step visual reasoning in llms
Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms. arXiv preprint arXiv:2501.06186, 2025
-
[15]
Cambrian-1: A fully open, vision-centric exploration of multimodal llms
Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. Advances in Neural Information Processing Systems, 37:87310–87356, 2025
work page 2025
-
[16]
Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024
work page 2024
-
[17]
Xiaodong Wang and Peixi Peng. Open-r1-video. https://github.com/ Wang-Xiaodong1899/Open-R1-Video , 2025
work page 2025
-
[18]
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Rest-mcts*: Llm self-training via process reward guided tree search, 2024
Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search, 2024
work page 2024
-
[20]
Easyr1: An efficient, scalable, multi-modality rl training framework
Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng andDongdong Kuang, and Yuwen Xiong. Easyr1: An efficient, scalable, multi-modality rl training framework. https://github. com/hiyouga/EasyR1, 2025. 10
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.