SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Pith reviewed 2026-05-17 15:39 UTC · model grok-4.3
The pith
SFT induces pseudo reasoning paths that undermine subsequent RL in vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that supervised fine-tuning on expert-generated reasoning traces induces pseudo reasoning paths. These paths may appear similar to native reasoning but consist of prolonged, hesitant, less informative steps and incorrect reasoning. This imitation helps models learn output formats yet locks them into rigid modes that reduce the gains from later reinforcement learning. In contrast, reinforcement learning that directly optimizes with combined perception and cognition signals produces more adaptive reasoning behavior without the same imitative constraints.
What carries the argument
Pseudo reasoning paths: imitative step-by-step traces copied from expert models during supervised fine-tuning that resemble but fail to match the flexible, informative reasoning that reinforcement learning can develop.
If this is right
- SFT teaches basic reasoning formats but restricts models from improving beyond imitative patterns in subsequent RL stages.
- RL with rewards that combine perception accuracy and cognitive step quality supports the emergence of genuine adaptive reasoning.
- Models trained without prior SFT or with careful RL setups reach higher accuracy on visual reasoning benchmarks than those following the standard SFT-first sequence.
- The order of training methods matters because early imitation can embed structural habits that later optimization struggles to overwrite.
Where Pith is reading between the lines
- Curating SFT data to remove or shorten hesitant steps might reduce the negative carry-over effect into RL without discarding the format-learning benefit.
- The same imitation problem could appear when training language-only models on chain-of-thought data, suggesting the finding is not limited to vision inputs.
- Reusing the six-step dataset pipeline for new domains would allow direct tests of whether the pseudo-path effect generalizes beyond the original visual tasks.
Load-bearing premise
The performance differences between SFT-then-RL and RL-only training arise specifically from the induction of pseudo reasoning paths rather than from differences in data difficulty, reward design, or training details.
What would settle it
Run SFT-then-RL and RL-only training on the exact same data splits, reward functions, and hyperparameters, then inspect the generated reasoning traces for hesitation length and error rates; if the performance gap vanishes or hesitation does not appear after SFT, the central claim would not hold.
read the original abstract
This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs. Constructed via a six-step pipeline involving captioning, reasoning distillation, answer rewrite and verification, VLAA-Thinking comprises high-quality, step-by-step visual reasoning traces for SFT, along with a more challenging RL split from the same data source. Using this dataset, we conduct extensive experiments comparing SFT, RL and their combinations. Results show that while SFT helps models learn reasoning formats, it often locks aligned models into imitative, rigid reasoning modes that impede further learning. In contrast, building on the Group Relative Policy Optimization (GRPO) with a novel mixed reward module integrating both perception and cognition signals, our RL approach fosters more genuine, adaptive reasoning behavior. Notably, our model VLAA-Thinker, based on Qwen2.5VL 3B, achieves top-1 performance on Open LMM Reasoning Leaderboard (https://huggingface.co/spaces/opencompass/Open_LMM_Reasoning_Leaderboard) among 4B scale LVLMs, surpassing the previous state-of-the-art by 1.8%. We hope our findings provide valuable insights in developing reasoning-capable LVLMs and can inform future research in this area.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that SFT can significantly undermine subsequent RL for training reasoning LVLMs by inducing 'pseudo reasoning paths' (prolonged, hesitant, less informative, or incorrect traces imitated from experts). It introduces the VLAA-Thinking dataset, constructed via a six-step pipeline of captioning, reasoning distillation, answer rewrite, and verification, with an SFT split and a more challenging RL split. Experiments compare SFT, RL, and combinations using GRPO with a novel mixed perception+cognition reward; the resulting VLAA-Thinker (Qwen2.5VL 3B) achieves top-1 on the Open LMM Reasoning Leaderboard among 4B-scale models, surpassing prior SOTA by 1.8%.
Significance. If substantiated, the result would challenge the standard SFT-then-RL pipeline for multimodal reasoning models and emphasize the value of direct RL with carefully designed rewards for fostering adaptive rather than imitative behavior. The introduction of VLAA-Thinking and the leaderboard result constitute concrete contributions to the empirical study of training order effects in LVLMs.
major comments (1)
- The central attribution of the SFT-then-RL vs. RL-only performance gap to induction of pseudo-reasoning paths is load-bearing for the main claim, yet the manuscript provides no explicit statement that the two regimes were matched on data difficulty distribution, exact formulation of the mixed reward, GRPO hyperparameters, learning-rate schedules, or batching. Without such controls, alternative explanations (data difficulty, reward design, or hyperparameter differences) cannot be ruled out.
minor comments (1)
- The abstract and experimental description mention 'extensive experiments' and leaderboard gains but omit details on baseline controls, statistical significance testing, or ablations isolating the mixed reward; these should be added for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The concern regarding experimental controls is important for strengthening the attribution of results to pseudo-reasoning paths, and we address it directly below with plans for revision.
read point-by-point responses
-
Referee: The central attribution of the SFT-then-RL vs. RL-only performance gap to induction of pseudo-reasoning paths is load-bearing for the main claim, yet the manuscript provides no explicit statement that the two regimes were matched on data difficulty distribution, exact formulation of the mixed reward, GRPO hyperparameters, learning-rate schedules, or batching. Without such controls, alternative explanations (data difficulty, reward design, or hyperparameter differences) cannot be ruled out.
Authors: We agree that explicit documentation of controls is essential to support the central claim. In our experiments, the SFT-then-RL and RL-only regimes were matched on the following: both drew from the identical VLAA-Thinking dataset with the same difficulty distribution (using the SFT split for format learning where applicable and the more challenging RL split for the reinforcement phase in both settings); the mixed perception+cognition reward was formulated and weighted identically; and GRPO was executed with the same hyperparameters, learning-rate schedules, and batch sizes. We will revise the manuscript by adding a dedicated subsection (and accompanying table) in the Experiments section that explicitly lists these matched settings. This addition will directly address the referee's concern and help rule out alternative explanations. revision: yes
Circularity Check
No significant circularity in empirical comparison study
full rationale
This paper is an empirical investigation that introduces the VLAA-Thinking dataset via a six-step pipeline and compares SFT, RL, and combined training regimes on LVLMs using GRPO with a mixed perception-cognition reward. The central claims rest on experimental results, qualitative inspection of reasoning traces, and leaderboard performance rather than any mathematical derivation chain, equations, or fitted parameters that reduce to inputs by construction. No self-definitional steps, uniqueness theorems, or ansatzes imported via self-citation are present in the provided text. The work is self-contained against external benchmarks such as the Open LMM Reasoning Leaderboard and is replicable given the dataset and model details.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
-
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.
-
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space
DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
-
Asking like Socrates: Socrates helps VLMs understand remote sensing images
RS-EoT uses a SocraticAgent self-play system and two-stage RL to train VLMs for genuine iterative reasoning and visual inspection on remote sensing VQA and grounding tasks, achieving SOTA results.
-
Latent Visual Reasoning
Latent Visual Reasoning enables autoregressive generation of latent visual states that reconstruct critical image tokens, yielding gains on perception-heavy VQA benchmarks such as 71.67% on MMVP.
-
Teacher-Guided Policy Optimization for LLM Distillation
TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
-
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
-
Characterizing Model-Native Skills
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
-
Generalization in LLM Problem Solving: The Case of the Shortest Path
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
-
Watch Before You Answer: Learning from Visually Grounded Post-Training
Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.
-
Teaching an Agent to Sketch One Part at a Time
A multi-modal LM agent is trained to produce vector sketches part-by-part via supervised fine-tuning and process-reward RL on the new ControlSketch-Part dataset with automatic part annotations.
-
DeepEyesV2: Toward Agentic Multimodal Model
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
-
WebSailor: Navigating Super-human Reasoning for Web Agent
WebSailor trains open-source web agents to match proprietary performance on complex information-seeking tasks by generating high-uncertainty scenarios and using a new RL method called DUPO.
-
How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors
IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
-
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
-
TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots
TeamPath introduces a reinforcement-learning-powered multimodal AI copilot for pathology that generates reasoned diagnoses and integrates image and transcriptomic data.
-
Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning
MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gain...
-
Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions
Structured reflection makes error diagnosis and repair an explicit trainable step that improves reliability and reduces redundant calls in tool-using LLM agents.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
Reference graph
Works this paper leans on
-
[1]
**Replace references to “description”, “caption” and ”rationale”** with wording that references **“the image.”** - For example, “The description says...” could become “The image shows...” - “The caption suggests...” could become “The image suggests...” - “Based on the rationale...” could become “Based on the image...” - Make sure the replacement sounds na...
-
[2]
**Preserve all line breaks, punctuation, and spacing** as much as possible, and make **no additional edits** outside of these replacements
-
[3]
—— Here is the input: {input} Figure 10: Prompt for answer rewriting with GPT-4-Turbo
You should only output the rewritten content. —— Here is the input: {input} Figure 10: Prompt for answer rewriting with GPT-4-Turbo. Prompt for Verification You are a fair evaluator. You will be given a groundtruth and an answer from a model. If the answer aligns with the groundtruth, output ”Yes”. Otherwise, output ”No”. Your output should only be ”Yes” ...
work page 2024
-
[4]
MathVista: The Test Mini split of MathVista dataset; overall accuracy
-
[5]
MathVision: The Full test set of MathVision; overall accuracy
-
[6]
MathVerse: The Test Mini split of MathVerse; accuracy of ”Vision Only”
-
[7]
DynaMath: The Full test set of DynaMath; overall accuracy
-
[8]
WeMath: The Test Mini split of WeMath; ”Score (Strict)”
-
[9]
LogicVista: The Full test set of LogicVista; overall accuracy. C Details of GRPO Experiments C.1 Training We adapt our code from OpenRLHF framework (Hu et al., 2024). To suit for our need of deploying a reward model on the same machine, we offload the reward model to CPU and only move it to GPU when performing rollouts and scoring. This design saves valua...
work page 2024
-
[10]
Identify the given information: - AB is tangent to circle O at point B. - The angle ∠ AOB = 36∘
-
[11]
Understand the properties: - Since 𝐴𝐵 is tangent to the circle at 𝐵 , the radius 𝑂𝐵 is perpendicular to the tangent 𝐴𝐵 . Therefore, ∠ OBA = 90. - In triangle 𝑂𝐴𝐵 , the sum of the angles is 180∘. So, we can find ∠ 𝐴𝑂𝐵 : ∠ 𝐴𝑂𝐵 + ∠ 𝑂𝐵𝐴 + ∠ 𝑂𝐴𝐵 = 180∘ 36∘ + 90∘ + ∠ 𝑂𝐴𝐵 = 180∘ ∠ 𝑂𝐴𝐵 = 54∘
-
[12]
Determine the measure of ∠ 𝑪 - Since ∠ 𝑂𝐴𝐵 is an inscribed angle that intercepts the arc BC , the measure of the central angle ∠ 𝐵𝑂𝐶 that intercepts the same arc BC is twice the measure of the inscribed angle ∠ 𝑂𝐴𝐵 : ∠ 𝐵𝑂𝐶 = 2 × ∠ 𝑂𝐴𝐵 = 2 × 54∘ = 108∘ - Now, in triangle OBC , the sum of the angles is 180∘. So, we can find ∠ 𝐶: ∠ 𝑂𝐵𝐶 + ∠ 𝑂𝐶𝐵 + ∠ 𝐵𝑂𝐶 = 180∘...
-
[13]
- Points C and D are on the semicircle, with D being the midpoint of arc BC
Understanding the Problem: - We have a semicircle with diameter AB and center O. - Points C and D are on the semicircle, with D being the midpoint of arc BC. - Chords AC and BD intersect at point P, forming an angle DPB of 60 degrees. - We need to find
-
[14]
- Midpoint of Arc: Since D is the midpoint of arc BC, arcs BD and DC are equal
Key Properties and Theorems: - Intersecting Chords Theorem: The measure of an angle formed by two intersecting chords is half the sum of the measures of the intercepted arcs. - Midpoint of Arc: Since D is the midpoint of arc BC, arcs BD and DC are equal. - Inscribed Angle Theorem: An angle inscribed in a circle is half the measure of its intercepted arc
-
[15]
Applying the Intersecting Chords Theorem: - Angle DPB is formed by chords BD and AC, so: - Given , we have:
-
[16]
Determining Arc Measures: - Since D is the midpoint of arc BC, arcs BD and DC are equal. Let each be . - Thus, arc BC = . - The total semicircle is , so: - Substituting into the earlier equation: - Therefore, arcs BD and DC are each and arc AC is
-
[17]
- By the Inscribed Angle Theorem:
Finding Angle ADC: - Angle ADC is an inscribed angle intercepting arc AC. - By the Inscribed Angle Theorem:
-
[18]
Calculating the Tangent: - . Answer: (Option C) tan(/uni2220ADC) AP/uni22C5PC=BP/uni22C5PD tan(/uni2220ADC) /uni03B8tan(/uni03B8) (0,0) (−r,0) (r,0) PA/uni22C5PC=PB/uni22C5PD r=1 (0,0) (−1,0) (1,0) (/uni03B8+180) 2 2/uni03B1 /uni03B1 1 2(arcDB+arcAC) DB+arcAC=120 AC=180−arcCB BD=arcDC=x arcBC=2x AC=180−arcBC=180−2x 1 2(arcDB+arcAC)=60 1 2(x+(180−2x))=60 1...
-
[24]
How many objects are left? •Original Answer: 3 Input Image <think> Okay, let's see
Simplify the Negative Sign (Optional): Since , the equation can also be written as: Final Answer: y=a/uni22C5sin(−5x) ( /uni03C0 10,5) a x=/uni03C0 10 y=5 5=a/uni22C5sin(−5/uni22C5/uni03C0 10) −5/uni22C5/uni03C0 10=−/uni03C0 2 5=a/uni22C5sin(−/uni03C0 2) sin(−/uni03C0 2)=−1 5=a/uni22C5(−1) 5=a/uni22C5(−1) a=−5 y=−5/uni22C5sin(−5x) y=a/uni22C5sin(−5x) ( /u...
-
[25]
To find the value of , I'll substitute the coordinates of the point into the equation
sin(−/uni03C0 2)=−1 5=−a a:a=−5 y=−5/uni22C5sin(−5/uni22C5x) y=−5/uni22C5sin(−5/uni22C5x) Input Image <think> First, I recognize that the function given is , and it passes through the point . To find the value of , I'll substitute the coordinates of the point into the equation. Substituting and , the equation becomes: Simplifying the argument of the sine ...
-
[26]
Substitute the Point into the Equation: Plug and into the equation:
-
[27]
Simplify the Argument of the Sine Function: So, the equation becomes:
-
[28]
Evaluate the Sine Function: Substituting this back into the equation:
-
[29]
Write the Final Equation: Substitute back into the original function:
-
[30]
How many objects are left? •Original Answer: 3 Input Image <think> Okay, let's see
Simplify the Negative Sign (Optional): Since , the equation can also be written as: Final Answer: y=a/uni22C5sin(−5x) ( /uni03C0 10,5) a x=/uni03C0 10 y=5 5=a/uni22C5sin(−5/uni22C5/uni03C0 10) −5/uni22C5/uni03C0 10=−/uni03C0 2 5=a/uni22C5sin(−/uni03C0 2) sin(−/uni03C0 2)=−1 5=a/uni22C5(−1) 5=a/uni22C5(−1) a=−5 y=−5/uni22C5sin(−5x) y=a/uni22C5sin(−5x) ( /u...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.