pith. machine review for the scientific record. sign in

arxiv: 2504.11468 · v1 · pith:O672LEEWnew · submitted 2025-04-10 · 💻 cs.CL

SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

Pith reviewed 2026-05-17 15:39 UTC · model grok-4.3

classification 💻 cs.CL
keywords supervised fine-tuningreinforcement learningvision-language modelsreasoning pathspseudo reasoningmultimodal reasoningtraining order
0
0 comments X

The pith

SFT induces pseudo reasoning paths that undermine subsequent RL in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines the common practice of using supervised fine-tuning before reinforcement learning to train large vision-language models for step-by-step reasoning. It argues that SFT causes models to imitate expert traces in ways that produce long, hesitant, and sometimes inaccurate steps. These imitative patterns then interfere with the model's ability to develop more flexible and effective reasoning during the RL phase. The authors support this by creating a new dataset of visual reasoning examples and running controlled comparisons of training orders and methods. Their findings indicate that RL approaches can foster more natural reasoning when not preceded by standard SFT.

Core claim

The paper establishes that supervised fine-tuning on expert-generated reasoning traces induces pseudo reasoning paths. These paths may appear similar to native reasoning but consist of prolonged, hesitant, less informative steps and incorrect reasoning. This imitation helps models learn output formats yet locks them into rigid modes that reduce the gains from later reinforcement learning. In contrast, reinforcement learning that directly optimizes with combined perception and cognition signals produces more adaptive reasoning behavior without the same imitative constraints.

What carries the argument

Pseudo reasoning paths: imitative step-by-step traces copied from expert models during supervised fine-tuning that resemble but fail to match the flexible, informative reasoning that reinforcement learning can develop.

If this is right

  • SFT teaches basic reasoning formats but restricts models from improving beyond imitative patterns in subsequent RL stages.
  • RL with rewards that combine perception accuracy and cognitive step quality supports the emergence of genuine adaptive reasoning.
  • Models trained without prior SFT or with careful RL setups reach higher accuracy on visual reasoning benchmarks than those following the standard SFT-first sequence.
  • The order of training methods matters because early imitation can embed structural habits that later optimization struggles to overwrite.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Curating SFT data to remove or shorten hesitant steps might reduce the negative carry-over effect into RL without discarding the format-learning benefit.
  • The same imitation problem could appear when training language-only models on chain-of-thought data, suggesting the finding is not limited to vision inputs.
  • Reusing the six-step dataset pipeline for new domains would allow direct tests of whether the pseudo-path effect generalizes beyond the original visual tasks.

Load-bearing premise

The performance differences between SFT-then-RL and RL-only training arise specifically from the induction of pseudo reasoning paths rather than from differences in data difficulty, reward design, or training details.

What would settle it

Run SFT-then-RL and RL-only training on the exact same data splits, reward functions, and hyperparameters, then inspect the generated reasoning traces for hesitation length and error rates; if the performance gap vanishes or hesitation does not appear after SFT, the central claim would not hold.

read the original abstract

This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs. Constructed via a six-step pipeline involving captioning, reasoning distillation, answer rewrite and verification, VLAA-Thinking comprises high-quality, step-by-step visual reasoning traces for SFT, along with a more challenging RL split from the same data source. Using this dataset, we conduct extensive experiments comparing SFT, RL and their combinations. Results show that while SFT helps models learn reasoning formats, it often locks aligned models into imitative, rigid reasoning modes that impede further learning. In contrast, building on the Group Relative Policy Optimization (GRPO) with a novel mixed reward module integrating both perception and cognition signals, our RL approach fosters more genuine, adaptive reasoning behavior. Notably, our model VLAA-Thinker, based on Qwen2.5VL 3B, achieves top-1 performance on Open LMM Reasoning Leaderboard (https://huggingface.co/spaces/opencompass/Open_LMM_Reasoning_Leaderboard) among 4B scale LVLMs, surpassing the previous state-of-the-art by 1.8%. We hope our findings provide valuable insights in developing reasoning-capable LVLMs and can inform future research in this area.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that SFT can significantly undermine subsequent RL for training reasoning LVLMs by inducing 'pseudo reasoning paths' (prolonged, hesitant, less informative, or incorrect traces imitated from experts). It introduces the VLAA-Thinking dataset, constructed via a six-step pipeline of captioning, reasoning distillation, answer rewrite, and verification, with an SFT split and a more challenging RL split. Experiments compare SFT, RL, and combinations using GRPO with a novel mixed perception+cognition reward; the resulting VLAA-Thinker (Qwen2.5VL 3B) achieves top-1 on the Open LMM Reasoning Leaderboard among 4B-scale models, surpassing prior SOTA by 1.8%.

Significance. If substantiated, the result would challenge the standard SFT-then-RL pipeline for multimodal reasoning models and emphasize the value of direct RL with carefully designed rewards for fostering adaptive rather than imitative behavior. The introduction of VLAA-Thinking and the leaderboard result constitute concrete contributions to the empirical study of training order effects in LVLMs.

major comments (1)
  1. The central attribution of the SFT-then-RL vs. RL-only performance gap to induction of pseudo-reasoning paths is load-bearing for the main claim, yet the manuscript provides no explicit statement that the two regimes were matched on data difficulty distribution, exact formulation of the mixed reward, GRPO hyperparameters, learning-rate schedules, or batching. Without such controls, alternative explanations (data difficulty, reward design, or hyperparameter differences) cannot be ruled out.
minor comments (1)
  1. The abstract and experimental description mention 'extensive experiments' and leaderboard gains but omit details on baseline controls, statistical significance testing, or ablations isolating the mixed reward; these should be added for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The concern regarding experimental controls is important for strengthening the attribution of results to pseudo-reasoning paths, and we address it directly below with plans for revision.

read point-by-point responses
  1. Referee: The central attribution of the SFT-then-RL vs. RL-only performance gap to induction of pseudo-reasoning paths is load-bearing for the main claim, yet the manuscript provides no explicit statement that the two regimes were matched on data difficulty distribution, exact formulation of the mixed reward, GRPO hyperparameters, learning-rate schedules, or batching. Without such controls, alternative explanations (data difficulty, reward design, or hyperparameter differences) cannot be ruled out.

    Authors: We agree that explicit documentation of controls is essential to support the central claim. In our experiments, the SFT-then-RL and RL-only regimes were matched on the following: both drew from the identical VLAA-Thinking dataset with the same difficulty distribution (using the SFT split for format learning where applicable and the more challenging RL split for the reinforcement phase in both settings); the mixed perception+cognition reward was formulated and weighted identically; and GRPO was executed with the same hyperparameters, learning-rate schedules, and batch sizes. We will revise the manuscript by adding a dedicated subsection (and accompanying table) in the Experiments section that explicitly lists these matched settings. This addition will directly address the referee's concern and help rule out alternative explanations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical comparison study

full rationale

This paper is an empirical investigation that introduces the VLAA-Thinking dataset via a six-step pipeline and compares SFT, RL, and combined training regimes on LVLMs using GRPO with a mixed perception-cognition reward. The central claims rest on experimental results, qualitative inspection of reasoning traces, and leaderboard performance rather than any mathematical derivation chain, equations, or fitted parameters that reduce to inputs by construction. No self-definitional steps, uniqueness theorems, or ansatzes imported via self-citation are present in the provided text. The work is self-contained against external benchmarks such as the Open LMM Reasoning Leaderboard and is replicable given the dataset and model details.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study on training paradigms; no mathematical axioms, free parameters, or invented entities beyond standard machine-learning assumptions about reward signals and policy optimization.

pith-pipeline@v0.9.0 · 5644 in / 1115 out tokens · 92163 ms · 2026-05-17T15:39:06.693023+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning Agentic Policy from Action Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

  2. CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...

  3. Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

    cs.LG 2026-04 unverdicted novelty 7.0

    RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.

  4. Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

    cs.CV 2025-12 unverdicted novelty 7.0

    DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.

  5. Asking like Socrates: Socrates helps VLMs understand remote sensing images

    cs.CV 2025-11 unverdicted novelty 7.0

    RS-EoT uses a SocraticAgent self-play system and two-stage RL to train VLMs for genuine iterative reasoning and visual inspection on remote sensing VQA and grounding tasks, achieving SOTA results.

  6. Latent Visual Reasoning

    cs.CV 2025-09 unverdicted novelty 7.0

    Latent Visual Reasoning enables autoregressive generation of latent visual states that reconstruct critical image tokens, yielding gains on perception-heavy VQA benchmarks such as 71.67% on MMVP.

  7. Teacher-Guided Policy Optimization for LLM Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.

  8. One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

  9. Characterizing Model-Native Skills

    cs.AI 2026-04 conditional novelty 6.0

    Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...

  10. Generalization in LLM Problem Solving: The Case of the Shortest Path

    cs.AI 2026-04 unverdicted novelty 6.0

    LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.

  11. Watch Before You Answer: Learning from Visually Grounded Post-Training

    cs.CV 2026-04 unverdicted novelty 6.0

    Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.

  12. Teaching an Agent to Sketch One Part at a Time

    cs.AI 2026-03 unverdicted novelty 6.0

    A multi-modal LM agent is trained to produce vector sketches part-by-part via supervised fine-tuning and process-reward RL on the new ControlSketch-Part dataset with automatic part annotations.

  13. DeepEyesV2: Toward Agentic Multimodal Model

    cs.CV 2025-11 unverdicted novelty 6.0

    DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.

  14. WebSailor: Navigating Super-human Reasoning for Web Agent

    cs.CL 2025-07 conditional novelty 6.0

    WebSailor trains open-source web agents to match proprietary performance on complex information-seeking tasks by generating high-uncertainty scenarios and using a new RL method called DUPO.

  15. How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors

    cs.AI 2026-05 unverdicted novelty 5.0

    IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.

  16. Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

    cs.CV 2026-04 unverdicted novelty 5.0

    HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.

  17. TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots

    q-bio.QM 2025-11 unverdicted novelty 5.0

    TeamPath introduces a reinforcement-learning-powered multimodal AI copilot for pathology that generates reasoned diagnoses and integrates image and transcriptomic data.

  18. Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning

    cs.AI 2025-09 unverdicted novelty 5.0

    MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gain...

  19. Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions

    cs.CV 2025-09 unverdicted novelty 5.0

    Structured reflection makes error diagnosis and repair an explicit trainable step that improves reliability and reduces redundant calls in tool-using LLM agents.

  20. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    cs.AI 2025-07 accept novelty 4.0

    The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 20 Pith papers

  1. [1]

    description

    **Replace references to “description”, “caption” and ”rationale”** with wording that references **“the image.”** - For example, “The description says...” could become “The image shows...” - “The caption suggests...” could become “The image suggests...” - “Based on the rationale...” could become “Based on the image...” - Make sure the replacement sounds na...

  2. [2]

    **Preserve all line breaks, punctuation, and spacing** as much as possible, and make **no additional edits** outside of these replacements

  3. [3]

    —— Here is the input: {input} Figure 10: Prompt for answer rewriting with GPT-4-Turbo

    You should only output the rewritten content. —— Here is the input: {input} Figure 10: Prompt for answer rewriting with GPT-4-Turbo. Prompt for Verification You are a fair evaluator. You will be given a groundtruth and an answer from a model. If the answer aligns with the groundtruth, output ”Yes”. Otherwise, output ”No”. Your output should only be ”Yes” ...

  4. [4]

    MathVista: The Test Mini split of MathVista dataset; overall accuracy

  5. [5]

    MathVision: The Full test set of MathVision; overall accuracy

  6. [6]

    MathVerse: The Test Mini split of MathVerse; accuracy of ”Vision Only”

  7. [7]

    DynaMath: The Full test set of DynaMath; overall accuracy

  8. [8]

    WeMath: The Test Mini split of WeMath; ”Score (Strict)”

  9. [9]

    A”, “A)”, “(a)

    LogicVista: The Full test set of LogicVista; overall accuracy. C Details of GRPO Experiments C.1 Training We adapt our code from OpenRLHF framework (Hu et al., 2024). To suit for our need of deploying a reward model on the same machine, we offload the reward model to CPU and only move it to GPU when performing rollouts and scoring. This design saves valua...

  10. [10]

    - The angle ∠ AOB = 36∘

    Identify the given information: - AB is tangent to circle O at point B. - The angle ∠ AOB = 36∘

  11. [11]

    Therefore, ∠ OBA = 90

    Understand the properties: - Since 𝐴𝐵 is tangent to the circle at 𝐵 , the radius 𝑂𝐵 is perpendicular to the tangent 𝐴𝐵 . Therefore, ∠ OBA = 90. - In triangle 𝑂𝐴𝐵 , the sum of the angles is 180∘. So, we can find ∠ 𝐴𝑂𝐵 : ∠ 𝐴𝑂𝐵 + ∠ 𝑂𝐵𝐴 + ∠ 𝑂𝐴𝐵 = 180∘ 36∘ + 90∘ + ∠ 𝑂𝐴𝐵 = 180∘ ∠ 𝑂𝐴𝐵 = 54∘

  12. [12]

    Determine the measure of ∠ 𝑪 - Since ∠ 𝑂𝐴𝐵 is an inscribed angle that intercepts the arc BC , the measure of the central angle ∠ 𝐵𝑂𝐶 that intercepts the same arc BC is twice the measure of the inscribed angle ∠ 𝑂𝐴𝐵 : ∠ 𝐵𝑂𝐶 = 2 × ∠ 𝑂𝐴𝐵 = 2 × 54∘ = 108∘ - Now, in triangle OBC , the sum of the angles is 180∘. So, we can find ∠ 𝐶: ∠ 𝑂𝐵𝐶 + ∠ 𝑂𝐶𝐵 + ∠ 𝐵𝑂𝐶 = 180∘...

  13. [13]

    - Points C and D are on the semicircle, with D being the midpoint of arc BC

    Understanding the Problem: - We have a semicircle with diameter AB and center O. - Points C and D are on the semicircle, with D being the midpoint of arc BC. - Chords AC and BD intersect at point P, forming an angle DPB of 60 degrees. - We need to find

  14. [14]

    - Midpoint of Arc: Since D is the midpoint of arc BC, arcs BD and DC are equal

    Key Properties and Theorems: - Intersecting Chords Theorem: The measure of an angle formed by two intersecting chords is half the sum of the measures of the intercepted arcs. - Midpoint of Arc: Since D is the midpoint of arc BC, arcs BD and DC are equal. - Inscribed Angle Theorem: An angle inscribed in a circle is half the measure of its intercepted arc

  15. [15]

    Applying the Intersecting Chords Theorem: - Angle DPB is formed by chords BD and AC, so: - Given , we have:

  16. [16]

    Let each be

    Determining Arc Measures: - Since D is the midpoint of arc BC, arcs BD and DC are equal. Let each be . - Thus, arc BC = . - The total semicircle is , so: - Substituting into the earlier equation: - Therefore, arcs BD and DC are each and arc AC is

  17. [17]

    - By the Inscribed Angle Theorem:

    Finding Angle ADC: - Angle ADC is an inscribed angle intercepting arc AC. - By the Inscribed Angle Theorem:

  18. [18]

    Calculating the Tangent: - . Answer: (Option C) tan(/uni2220ADC) AP/uni22C5PC=BP/uni22C5PD tan(/uni2220ADC) /uni03B8tan(/uni03B8) (0,0) (−r,0) (r,0) PA/uni22C5PC=PB/uni22C5PD r=1 (0,0) (−1,0) (1,0) (/uni03B8+180) 2 2/uni03B1 /uni03B1 1 2(arcDB+arcAC) DB+arcAC=120 AC=180−arcCB BD=arcDC=x arcBC=2x AC=180−arcBC=180−2x 1 2(arcDB+arcAC)=60 1 2(x+(180−2x))=60 1...

  19. [24]

    How many objects are left? •Original Answer: 3 Input Image <think> Okay, let's see

    Simplify the Negative Sign (Optional): Since , the equation can also be written as: Final Answer: y=a/uni22C5sin(−5x) ( /uni03C0 10,5) a x=/uni03C0 10 y=5 5=a/uni22C5sin(−5/uni22C5/uni03C0 10) −5/uni22C5/uni03C0 10=−/uni03C0 2 5=a/uni22C5sin(−/uni03C0 2) sin(−/uni03C0 2)=−1 5=a/uni22C5(−1) 5=a/uni22C5(−1) a=−5 y=−5/uni22C5sin(−5x) y=a/uni22C5sin(−5x) ( /u...

  20. [25]

    To find the value of , I'll substitute the coordinates of the point into the equation

    sin(−/uni03C0 2)=−1 5=−a a:a=−5 y=−5/uni22C5sin(−5/uni22C5x) y=−5/uni22C5sin(−5/uni22C5x) Input Image <think> First, I recognize that the function given is , and it passes through the point . To find the value of , I'll substitute the coordinates of the point into the equation. Substituting and , the equation becomes: Simplifying the argument of the sine ...

  21. [26]

    Substitute the Point into the Equation: Plug and into the equation:

  22. [27]

    Simplify the Argument of the Sine Function: So, the equation becomes:

  23. [28]

    Evaluate the Sine Function: Substituting this back into the equation:

  24. [29]

    Write the Final Equation: Substitute back into the original function:

  25. [30]

    How many objects are left? •Original Answer: 3 Input Image <think> Okay, let's see

    Simplify the Negative Sign (Optional): Since , the equation can also be written as: Final Answer: y=a/uni22C5sin(−5x) ( /uni03C0 10,5) a x=/uni03C0 10 y=5 5=a/uni22C5sin(−5/uni22C5/uni03C0 10) −5/uni22C5/uni03C0 10=−/uni03C0 2 5=a/uni22C5sin(−/uni03C0 2) sin(−/uni03C0 2)=−1 5=a/uni22C5(−1) 5=a/uni22C5(−1) a=−5 y=−5/uni22C5sin(−5x) y=a/uni22C5sin(−5x) ( /u...