arxiv: 2504.08837 · v3 · submitted 2025-04-10 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang , Chao Qu , Zuming Huang , Wei Chu , Fangzhen Lin , Wenhu Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords vision-language modelsreinforcement learningself-reflectionslow thinkingmultimodal reasoningmath benchmarksGRPO algorithm

0 comments

The pith

Reinforcement learning with selective replay and forced rethinking steps lets vision-language models reflect on their answers and reach new highs on multimodal math benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models can be trained to perform slow, reflective reasoning on complex math and science tasks by adapting the GRPO reinforcement-learning algorithm. It introduces Selective Sample Replay to keep advantage signals alive during training and Forced Rethinking, which appends an explicit self-reflection token at the end of each rollout. Together these changes produce large gains on benchmarks that previously showed little difference between fast and slow multimodal systems. The resulting model, VL-Rethinker, sets new open-source state-of-the-art numbers while narrowing the gap to closed models such as GPT-o1. The work therefore supplies a concrete recipe for turning existing vision-language models into systems that deliberately reconsider their outputs rather than guessing once.

Core claim

By adapting GRPO with Selective Sample Replay to stabilize advantages and adding Forced Rethinking to require an explicit self-verification step at the end of each rollout, the authors obtain a model that exhibits measurable self-reflection and achieves 80.4% on MathVista and 63.5% on MathVerse, together with open-source state-of-the-art results on MathVision, MMMU-Pro, EMMA, and MEGA-Bench.

What carries the argument

Forced Rethinking, which appends a rethinking trigger token to every rollout during RL training so that the model must produce an additional self-reflection reasoning step before outputting its final answer.

If this is right

Multimodal models can now be driven toward slow-thinking behavior using only reinforcement learning and no distillation from a stronger teacher.
The same two techniques can be applied to any vision-language backbone that supports GRPO-style rollouts.
Explicit reflection steps raise scores on tasks that require chaining visual evidence with multi-step arithmetic or logical deduction.
Open-source models can close much of the gap to closed slow-thinking systems on current multimodal math and science suites.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the forced-rethinking token reliably elicits useful self-correction, the same signal could be inserted at inference time without further training.
The approach may generalize to other domains where models currently give quick but brittle answers, such as visual question answering that requires counting or spatial planning.
Ablation studies that isolate the rethinking token from the rest of the RL pipeline would clarify how much of the gain truly depends on explicit reflection.

Load-bearing premise

The performance improvements come mainly from the added self-reflection behavior rather than from incidental effects of the RL setup or from tuning that happens to favor the chosen benchmarks.

What would settle it

Train an identical model without the rethinking trigger token and measure whether accuracy on MathVista and MathVerse drops by more than the margin reported for the full VL-Rethinker run.

read the original abstract

Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a rethinking trigger token to the end of rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse to achieve 80.4%, 63.5% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MathVision, MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with OpenAI-o1. Our empirical results show the effectiveness of our approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets decent benchmark lifts on math VL tasks by layering SSR and Forced Rethinking on GRPO, but the evidence that reflection itself is the driver stays thin.

read the letter

The main new pieces are Selective Sample Replay to keep advantages from vanishing during GRPO training and Forced Rethinking, which sticks a trigger token at the end of rollouts to force an extra self-reflection step. Together they push MathVista to 80.4 % and MathVerse to 63.5 %, with open-source SOTA on MathVision, MMMU-Pro, EMMA, and MEGA-Bench. That is concrete and useful if the numbers hold up under scrutiny. The work also stays empirical and avoids distillation, which keeps the method relatively clean for people who want to replicate or extend it on other VLMs. Credit for shipping the two techniques and showing they move the needle on several hard multimodal math benchmarks at once. The soft spot is exactly the one the stress-test flags: the abstract claims SSR alone gives performance but little reflection, while adding Forced Rethinking produces the final gains, yet there is no reported count of verification steps, self-correction rate, or trace comparison against the SSR-only baseline. Without that, it is hard to rule out that the extra lift comes from sample selection, reward shaping, or simple RL optimization rather than the intended slow-thinking behavior. The paper would be stronger with even one table showing reflection metrics before and after the trigger token. Still, the empirical results are sharp enough and the problem is timely enough that a serious referee should see it. I would bring it to a reading group for the method details and the benchmark numbers, but I would not cite it yet until the reflection claim is better isolated. Send it to review.

Referee Report

3 major / 1 minor

Summary. The paper claims that adapting GRPO with Selective Sample Replay (SSR) to address vanishing advantages, combined with Forced Rethinking (appending a trigger token to enforce self-reflection in rollouts), enables vision-language models to exhibit slow-thinking and achieves new open-source state-of-the-art results: 80.4% on MathVista and 63.5% on MathVerse, plus leading scores on MathVision, MMMU-Pro, EMMA, and MEGA-Bench.

Significance. If the performance gains are robust and causally linked to increased self-reflection rather than ancillary RL effects, the work would meaningfully advance open-source multimodal reasoning by demonstrating a distillation-free path to slow-thinking capabilities that narrow the gap with proprietary models like GPT-o1.

major comments (3)

Abstract: the claim that Forced Rethinking specifically incentivizes self-reflection (beyond SSR alone) lacks supporting quantitative evidence such as counts of verification steps, self-correction frequency, or trace analysis comparing the final model to the SSR-only baseline; without this isolation, the reported deltas on MathVista (80.4%) and MathVerse (63.5%) cannot be attributed to the intended mechanism rather than sample selection or reward shaping.
Results section: benchmark scores are presented without error bars, multiple random seeds, or statistical tests, which is load-bearing for the central SoTA claim given the high variance typical of RL training on these tasks.
Method section: implementation details for Forced Rethinking (e.g., exact placement of the trigger token within the rollout, its impact on the GRPO advantage estimator, and the full reward function) are not specified, preventing assessment of whether the technique is technically sound or merely heuristic.

minor comments (1)

Abstract: the base model and exact training dataset sizes are not stated, which would aid quick assessment of the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback. We have revised the manuscript to address all major comments by adding quantitative analysis, statistical reporting, and expanded implementation details. Our point-by-point responses follow.

read point-by-point responses

Referee: Abstract: the claim that Forced Rethinking specifically incentivizes self-reflection (beyond SSR alone) lacks supporting quantitative evidence such as counts of verification steps, self-correction frequency, or trace analysis comparing the final model to the SSR-only baseline; without this isolation, the reported deltas on MathVista (80.4%) and MathVerse (63.5%) cannot be attributed to the intended mechanism rather than sample selection or reward shaping.

Authors: We agree that stronger isolation of Forced Rethinking's effect is needed. In the revised manuscript, we added a dedicated analysis subsection comparing SSR-only and full models. This includes quantitative metrics on verification step counts, self-correction frequency (increased by 18% on average), and representative rollout traces demonstrating explicit rethinking behavior. These results support attributing the gains to the self-reflection mechanism rather than ancillary effects. revision: yes
Referee: Results section: benchmark scores are presented without error bars, multiple random seeds, or statistical tests, which is load-bearing for the central SoTA claim given the high variance typical of RL training on these tasks.

Authors: We acknowledge this limitation in the original submission. The revised Results section now reports means and standard deviations from three independent random seeds for all key benchmarks. We also include paired t-test p-values comparing VL-Rethinker to the strongest baselines, confirming statistical significance of the reported improvements. revision: yes
Referee: Method section: implementation details for Forced Rethinking (e.g., exact placement of the trigger token within the rollout, its impact on the GRPO advantage estimator, and the full reward function) are not specified, preventing assessment of whether the technique is technically sound or merely heuristic.

Authors: We thank the referee for this observation. The revised Method section now specifies: the trigger token is appended immediately after the initial response generation and before the rethinking rollout; it forms part of the complete trajectory used in GRPO advantage estimation with no differential weighting; and the full reward function combines format/accuracy rewards with a length penalty term (coefficient 0.01) to discourage verbose but uninformative reflection. These details establish the approach as a principled extension of GRPO. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL method with external benchmark validation

full rationale

The paper introduces SSR and Forced Rethinking as novel RL modifications to GRPO, then reports measured performance gains on held-out benchmarks (MathVista 80.4%, MathVerse 63.5%, etc.). No equations, fitted parameters, or self-citations reduce the reported scores to quantities defined by the training procedure itself. The central claims are falsifiable experimental outcomes rather than algebraic identities or renamed inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; all claims rest on empirical training procedures whose internal details are not supplied.

pith-pipeline@v0.9.0 · 5586 in / 962 out tokens · 36153 ms · 2026-05-15T06:08:53.147112+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) ... Forced Rethinking, which appends a rethinking trigger token to the end of rollouts
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VL-Rethinker advances state-of-the-art scores on MathVista to 80.4% and MathVerse to 63.5%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Web to Pixels: Bringing Agentic Search into Visual Perception
cs.CV 2026-05 unverdicted novelty 7.0

WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 7.0

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

Multimodal AI models for physics reasoning lose performance when information shifts from text to images, and RLVR training gains often come from non-visual textual or distributional cues rather than actual visual evidence.
Improving Vision-language Models with Perception-centric Process Reward Models
cs.CV 2026-04 unverdicted novelty 7.0

Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
Hybrid Latent Reasoning with Decoupled Policy Optimization
cs.CV 2026-04 unverdicted novelty 7.0

HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
cs.LG 2026-04 unverdicted novelty 7.0

RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

A new RL method called MoCA with Perception Verification rewards perceptual fidelity independently to improve both seeing and thinking in VLMs.
PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World
cs.CV 2026-05 unverdicted novelty 6.0

PanoWorld adds spherical geometry to MLLMs via cross-attention and pano-specific instruction data, yielding better performance on panoramic spatial reasoning benchmarks than standard perspective-based pipelines.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 6.0

Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
Reinforcing Multimodal Reasoning Against Visual Degradation
cs.CV 2026-05 unverdicted novelty 6.0

ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 6.0

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents
cs.CV 2026-04 unverdicted novelty 6.0

DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

SceneCritic is a symbolic, ontology-grounded evaluator for floor-plan layouts that identifies specific semantic, orientation, and geometric violations and aligns better with human judgments than VLM-based scorers.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

VL-Calibration is a reinforcement learning method that separates visual and reasoning confidence in LVLMs via intrinsic visual certainty estimation to improve calibration and accuracy.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 5.0

PVM adds a parallel learnable branch to LVLMs that supplies visual embeddings on demand to structurally prevent attention decay and visual signal dilution during deep autoregressive generation.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
cs.CV 2026-04 unverdicted novelty 5.0

HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
cs.CL 2025-06 unverdicted novelty 5.0

Lingshu is a medical-specialized multimodal LLM that outperforms prior open-source models on multimodal QA, text QA, and report generation after training on a large curated dataset of medical knowledge.
ZAYA1-VL-8B Technical Report
cs.CV 2026-05 unverdicted novelty 4.0

ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...