arxiv: 2503.06749 · v4 · submitted 2025-03-09 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Bohan Jia, Fei Zhao, Shaohui Lin, Shaosheng Cao, Wenxuan Huang, Xu Tang, Yao Hu, Zhe Xu, Zheyu Ye, Zijie Zhai

Pith reviewed 2026-05-11 08:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords multimodal reasoningreinforcement learningchain of thoughtvision language modelsMathVista benchmarkcold start trainingvisual math problems

0 comments

The pith

Automatically built multimodal reasoning data followed by targeted RL training activates complex visual math reasoning in MLLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reinforcement learning can elicit advanced reasoning behaviors in multimodal large language models, but only after an initial cold-start phase supplies suitable examples. The authors generate 200,000 chain-of-thought traces by bridging an existing vision-language model with a text reasoning model and applying filters to remove low-quality outputs. This dataset initializes the model, after which a progressive suppression strategy combined with group-relative policy optimization refines reasoning on a smaller set of math problems. If the method works, vision-language models gain the capacity to question, reflect, and solve image-based mathematical tasks without requiring extensive human-annotated reasoning data.

Core claim

We introduce Vision-R1, a multimodal large language model trained first on a 200K automatically constructed multimodal CoT dataset called Vision-R1-cold for initialization, followed by Progressive Thinking Suppression Training using Group Relative Policy Optimization with a hard formatting reward on a 10K multimodal math dataset. This process incentivizes the emergence of complex reasoning capabilities such as questioning and reflection, leading to an average improvement of approximately 6% across multimodal math reasoning benchmarks, with the 7B version achieving 73.5% on MathVista.

What carries the argument

The two-stage pipeline of cold-start initialization on the automatically generated 200K Vision-R1-cold multimodal CoT dataset followed by Progressive Thinking Suppression Training with GRPO.

If this is right

Larger-scale RL training with additional multimodal math data produces further accuracy gains, as demonstrated by the 32B and 72B variants reaching 76.4% and 78.2% on MathVista.
Direct application of RL without the preceding cold-start dataset fails to activate complex reasoning patterns in MLLMs.
The method enables performance within 0.4% of leading proprietary reasoning models on standard multimodal math benchmarks while using only automatically generated data.
Progressive suppression during RL mitigates overthinking and supports learning of correct reasoning paths on visual math problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same automatic dataset construction approach could be tested on non-math visual reasoning domains such as science diagrams or spatial puzzles.
Biases potentially introduced during automatic filtering might limit generalization to problem types underrepresented in the source models.
Combining the pipeline with longer context windows or additional modalities could extend the range of solvable multimodal tasks.

Load-bearing premise

The 200K multimodal CoT dataset constructed automatically via modality bridging and filtering must be of high enough quality to serve as effective cold-start data without introducing systematic errors or biases that would undermine later RL refinement.

What would settle it

Retraining the base model with RL directly on the 10K math dataset without the 200K cold-start dataset and observing no activation of questioning or reflection behaviors on MathVista would show the initialization step is not required for the reported gains.

read the original abstract

DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1. To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model's ability to learn correct and complex reasoning processes on a 10K multimodal math dataset. Comprehensive experiments show our model achieves an average improvement of $\sim$6% across various multimodal math reasoning benchmarks. Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1. Scaling up the amount of multimodal math data in the RL training, Vision-R1-32B and Vison-R1-72B achieves 76.4% and 78.2% MathVista benchmark scores, respectively. The datasets and code will be released in: https://github.com/Osilly/Vision-R1 .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Vision-R1 shows solid benchmark gains on multimodal math via synthetic CoT cold-start plus PTST, but the unverified quality of the 200K dataset leaves the source of the improvement unclear.

read the letter

The main thing to know is that this paper gets a 7B MLLM to 73.5% on MathVista using a synthetic-data RL pipeline, within 0.4% of o1, with bigger models scaling to 78%. The average lift across benchmarks is around 6% after the full process. That is the concrete result worth noting first. What is actually new is the modality-bridging construction of the 200K Vision-R1-cold CoT set from an existing MLLM and DeepSeek-R1, followed by the Progressive Thinking Suppression Training step to curb overthinking before GRPO on the 10K math subset. The scaling runs to 32B and 72B and the plan to release data and code are also useful. The work does a reasonable job demonstrating that the RL-for-reasoning approach can be adapted to vision-language models without hand-labeled traces, and the reported numbers are consistent across several multimodal math benchmarks. The soft spot is the dataset. The central claim depends on the 200K synthetic traces being high-quality cold-start material, yet the paper gives no human validation, error statistics, or ablation that removes the cold-start phase. Without those checks it is difficult to tell whether the gains come from PTST and GRPO or from whatever reasoning quality was already present in the bridged data. The filtering criteria are also not quantified. This paper is for researchers working on RL-based reasoning in MLLMs, particularly for visual math and science tasks. A reader who wants to try similar pipelines will find the steps described clearly enough to start from. It deserves peer review because the empirical results and scaling are there and the method is reproducible in principle, even if the data-quality controls need tightening before the attribution is fully convincing. I would send it out.

Referee Report

2 major / 2 minor

Summary. The paper proposes Vision-R1, a multimodal large language model that first constructs a 200K synthetic multimodal chain-of-thought dataset (Vision-R1-cold) via modality bridging between an existing MLLM and DeepSeek-R1 followed by filtering, uses it for cold-start supervised fine-tuning, then applies Progressive Thinking Suppression Training (PTST) and Group Relative Policy Optimization (GRPO) with a hard formatting reward on a 10K multimodal math dataset. It reports an average ~6% improvement across multimodal math reasoning benchmarks, with the 7B variant reaching 73.5% on MathVista (0.4% below OpenAI o1) and larger 32B/72B variants reaching 76.4% and 78.2%.

Significance. If the synthetic cold-start data is shown to be high-quality and the gains are attributable to the RL stage rather than data artifacts, the work would demonstrate a practical, annotation-free route to eliciting complex multimodal reasoning via RL, with clear scaling behavior to larger models. This could meaningfully advance the field by reducing reliance on human-curated reasoning traces for MLLMs.

major comments (2)

[Vision-R1-cold dataset construction] Vision-R1-cold dataset construction (described in the method section following the abstract): No quantitative quality metrics, human validation error rates, or checks for factual accuracy, reasoning depth, or modality mismatches are reported for the 200K traces produced by modality bridging and filtering. This is load-bearing because the central claim attributes the ~6% benchmark gains and near-parity with o1 to the subsequent PTST+GRPO stage; without evidence that the cold-start data is free of systematic biases, downstream numbers alone cannot isolate the contribution of the proposed RL components.
[Experiments] Experiments section (ablation and training details): The manuscript contains no ablation that removes the cold-start phase on the 200K dataset or trains directly with GRPO on the 10K math set from a non-reasoning base model. Such an ablation is required to test whether the reported improvements arise from PTST/GRPO or from artifacts already present in the synthetic CoT data.

minor comments (2)

[Abstract] Abstract: 'Vison-R1-72B' is a typographical error and should read 'Vision-R1-72B'.
[Method] The filtering criteria and exact prompts used for modality bridging are described at a high level but lack sufficient detail (e.g., specific thresholds or example traces) for full reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of our methodology and experimental design that require clarification and strengthening. We address each major comment below and commit to revisions that improve the manuscript's rigor without altering its core contributions.

read point-by-point responses

Referee: [Vision-R1-cold dataset construction] Vision-R1-cold dataset construction (described in the method section following the abstract): No quantitative quality metrics, human validation error rates, or checks for factual accuracy, reasoning depth, or modality mismatches are reported for the 200K traces produced by modality bridging and filtering. This is load-bearing because the central claim attributes the ~6% benchmark gains and near-parity with o1 to the subsequent PTST+GRPO stage; without evidence that the cold-start data is free of systematic biases, downstream numbers alone cannot isolate the contribution of the proposed RL components.

Authors: We acknowledge that the original manuscript does not report quantitative quality metrics or human validation results for the Vision-R1-cold dataset. The construction process, detailed in the methods, uses modality bridging between an existing MLLM and DeepSeek-R1 followed by automated filtering for coherence, relevance, and format consistency. While downstream benchmark improvements and scaling behavior provide indirect support for data quality, we agree that explicit validation is necessary to isolate the RL stage's contribution. In the revised manuscript, we will add a dedicated subsection with human evaluation on a 500-sample subset, reporting error rates for factual accuracy, reasoning depth, and modality mismatches, along with inter-annotator agreement. This addition will directly address potential systematic biases. revision: yes
Referee: [Experiments] Experiments section (ablation and training details): The manuscript contains no ablation that removes the cold-start phase on the 200K dataset or trains directly with GRPO on the 10K math set from a non-reasoning base model. Such an ablation is required to test whether the reported improvements arise from PTST/GRPO or from artifacts already present in the synthetic CoT data.

Authors: We agree that an ablation isolating the cold-start phase is valuable for attributing gains specifically to PTST and GRPO. The introduction notes that direct RL on MLLMs without reasoning initialization struggles to activate complex behaviors such as reflection. All reported RL results start from the cold-start model. To address this, the revised manuscript will include a new ablation attempting GRPO directly from the base non-reasoning MLLM on the 10K dataset, with results on training stability and final benchmark performance. This will demonstrate the practical necessity of the cold-start and confirm that the observed ~6% gains stem from the proposed RL components rather than data artifacts alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training pipeline relies on external models and benchmarks

full rationale

The paper constructs its 200K Vision-R1-cold dataset by applying an existing MLLM plus DeepSeek-R1 via modality bridging and filtering, then performs cold-start followed by PTST + GRPO on a separate 10K math set, and reports accuracy numbers on standard external benchmarks such as MathVista. No equation, prediction, or central claim reduces by construction to a fitted parameter, self-definition, or self-citation chain; the performance deltas are measured outcomes rather than tautological outputs of the input construction. This is the normal case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim depends on the unverified quality of the automatically generated 200K CoT dataset and the effectiveness of the newly proposed PTST schedule; no independent evidence for either is supplied beyond final benchmark numbers.

axioms (2)

ad hoc to paper Existing MLLMs and DeepSeek-R1 can generate high-quality multimodal chain-of-thought traces via modality bridging and filtering
Invoked to justify the 200K Vision-R1-cold dataset used for cold-start initialization
ad hoc to paper Progressive Thinking Suppression Training can mitigate overthinking while preserving reasoning accuracy
Central to the RL phase on the 10K math dataset

invented entities (2)

Vision-R1-cold dataset no independent evidence
purpose: Cold-start initialization data for RL
Synthetically constructed 200K multimodal CoT examples
Progressive Thinking Suppression Training (PTST) no independent evidence
purpose: Gradual refinement of reasoning length and correctness
New training schedule paired with GRPO

pith-pipeline@v0.9.0 · 5669 in / 1528 out tokens · 75953 ms · 2026-05-11T08:53:51.349360+00:00 · methodology

discussion (0)

Forward citations

Cited by 53 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images
cs.CV 2026-04 unverdicted novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
From Web to Pixels: Bringing Agentic Search into Visual Perception
cs.CV 2026-05 unverdicted novelty 7.0

WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
cs.CV 2026-05 conditional novelty 7.0

MLLMs exhibit a large perception-reasoning gap on perspective-conditioned spatial reasoning in omnidirectional images, with accuracy falling from 57% on basic direction tasks to under 1% on compositional reasoning, th...
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

MIRL uses mutual information to guide trajectory selection and provide separate rewards for visual perception in RLVR for VLMs, achieving 70.22% average accuracy with 25% fewer full trajectories.
Improving Vision-language Models with Perception-centric Process Reward Models
cs.CV 2026-04 unverdicted novelty 7.0

Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
cs.CV 2026-04 unverdicted novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
Hybrid Latent Reasoning with Decoupled Policy Optimization
cs.CV 2026-04 unverdicted novelty 7.0

HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 7.0

Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-p...
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
cs.CL 2026-04 unverdicted novelty 7.0

UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization
cs.CV 2026-04 unverdicted novelty 7.0

Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
cs.LG 2026-04 unverdicted novelty 7.0

RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
cs.CV 2026-03 unverdicted novelty 7.0

V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...
Video-R1: Reinforcing Video Reasoning in MLLMs
cs.CV 2025-03 conditional novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 6.0

Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
cs.CV 2026-05 unverdicted novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
cs.LG 2026-05 unverdicted novelty 6.0

A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
Reinforcing Multimodal Reasoning Against Visual Degradation
cs.CV 2026-05 unverdicted novelty 6.0

ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
cs.CV 2026-05 unverdicted novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
cs.CV 2026-05 unverdicted novelty 6.0

Chart-FR1 uses Focus-CoT for linking reasoning to visual cues and Focus-GRPO reinforcement learning with efficiency rewards to outperform prior MLLMs on dense chart reasoning tasks.
Co-Evolving Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
cs.CV 2026-04 unverdicted novelty 6.0

ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution
cs.CV 2026-04 unverdicted novelty 6.0

CharTide decouples chart-to-code data into three perspectives and uses inquiry-driven RL with atomic QA verification to let smaller VLMs surpass GPT-4o on chart-to-code tasks.
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification
cs.CV 2026-04 unverdicted novelty 6.0

ReID-R achieves competitive person re-identification performance using chain-of-thought reasoning and reinforcement learning with only 14.3K non-trivial samples, about 20.9% of typical data scales, while providing int...
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
cs.LG 2026-04 unverdicted novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
cs.CV 2026-04 unverdicted novelty 6.0

AVRT transfers reasoning to audio-visual models by distilling traces from single-modality teachers via LLM merger followed by SFT cold-start and RL, achieving SOTA on OmniBench, DailyOmni, and MMAR with 3B/7B models.
Generalization in LLM Problem Solving: The Case of the Shortest Path
cs.AI 2026-04 unverdicted novelty 6.0

LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification
cs.CL 2026-04 unverdicted novelty 6.0

GRASP improves multimodal sarcasm target identification by anchoring visual regions in grounded chain-of-thought reasoning and using dual-stage optimization on a new balanced dataset.
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning
cs.CV 2026-04 unverdicted novelty 6.0

AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning
cs.IR 2026-04 unverdicted novelty 6.0

ReRec uses reinforcement fine-tuning with dual-graph reward shaping, reasoning-aware advantage estimation, and online curriculum scheduling to improve LLM reasoning and performance in recommendation tasks.
ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding
cs.CL 2026-04 unverdicted novelty 6.0

ChemVLR prioritizes reasoning in perception for chemical VLMs by identifying descriptors such as functional groups before generating answers, using a 760k curated dataset and three-stage training to reach SOTA performance.
Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning
cs.CV 2026-04 unverdicted novelty 6.0

SciTikZer-8B uses a new dataset, benchmark, and dual self-consistency RL to generate TikZ code for scientific graphics, outperforming much larger models like Gemini-2.5-Pro.
CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
cs.CV 2026-04 unverdicted novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
cs.AI 2026-04 unverdicted novelty 6.0

CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
cs.CV 2025-04 unverdicted novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
SuperFace: Preference-Aligned Facial Expression Estimation Beyond Pseudo Supervision
cs.CV 2026-05 unverdicted novelty 5.0

SuperFace refines ARKit facial expression estimation by using human preference feedback on rendered faces to optimize beyond noisy pseudo-label supervision from capture software.
Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models
cs.SD 2026-04 unverdicted novelty 5.0

A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.
SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units
cs.CV 2026-04 unverdicted novelty 5.0

SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs
cs.CL 2026-04 unverdicted novelty 5.0

DLR is a new reinforced latent reasoning method for VLMs that decomposes queries, uses continuous visual latents, and outperforms text-only and multimodal CoT baselines on vision-centric benchmarks with better interpr...
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.