hub Canonical reference

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu · 2025 · cs.CV · arXiv 2503.07365

Canonical reference. 83% of citing Pith papers cite this work as background.

69 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 69 citing papers arXiv PDF

abstract

DeepSeek R1, and o1 have demonstrated powerful reasoning capabilities in the text domain through stable large-scale reinforcement learning. To enable broader applications, some works have attempted to transfer these capabilities to multimodal reasoning. However, these efforts have been limited by the limited difficulty of selected tasks and relatively small training scales, making it challenging to demonstrate strong multimodal reasoning abilities. To address this gap, we introduce the MMK12 dataset and MM-EUREKA with 7B and 32B parameters. The former is a high-quality multimodal mathematics reasoning dataset featuring diverse knowledge domains with human-verified answers and solution processes. The latter is a multimodal model employing rule-based reinforcement learning on MMK12, utilizing online filtering and two-stage training strategy to enhance training stability. MM-EUREKA demonstrates remarkable performance gains in multimodal mathematical reasoning, outperforming previous powerful models like InternVL2.5-78B or InternVL2.5-38B-MPO. In particular, MM-EUREKA achieves competitive or superior performance compared to both open-source and closed-source models, and trails slightly behind o1 in multidisciplinary reasoning tasks. We open-source our complete pipeline to foster further research in this area. We release all our codes, models, data, etc. at https://github.com/ModalMinds/MM-EUREKA

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 baseline 1 dataset 1 method 1

citation-polarity summary

background 10 baseline 1 use dataset 1

representative citing papers

MM-OptBench: A Solver-Grounded Benchmark for Multimodal Optimization Modeling

cs.AI · 2026-05-12 · unverdicted · novelty 8.0

MM-OptBench is a solver-grounded benchmark showing current multimodal LLMs reach at most 52% pass@1 on generating correct optimization models from text-plus-visual problem specifications.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

Optical reasoning encodes rationales in images rather than text, matching or exceeding text-based performance on math, science, and multimodal benchmarks while cutting tokens by 28.57% on language tasks and 16% on multimodal tasks.

Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions?

cs.CV · 2026-06-08 · unverdicted · novelty 7.0

Reasoning VLMs show lower robustness to semantic visual distractions than to perceptual corruptions, with distractions entering their reasoning chains and causing errors.

Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable reasoning on high-RP samples.

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.

CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

cs.CV · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

CurveBench is a new benchmark for recovering rooted containment trees from images of nested Jordan curves, where the strongest model reaches only 19.1% accuracy on hard cases and fine-tuning lifts an open model to 33.3% on easy cases.

Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Audited olympiad corpus and Physics-R1 recipe improve 8B VLM by up to 18 points on held-out physics problems while exposing contamination in prior evals.

Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning

cs.AI · 2026-01-14 · unverdicted · novelty 7.0

Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.

Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization

cs.CV · 2026-01-07 · unverdicted · novelty 7.0

GPRO trains a meta-controller on 790k failure-labeled samples to dynamically select fast, perception, or reasoning paths in LVLMs, yielding higher accuracy and shorter responses than prior slow-thinking methods.

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

cs.CV · 2025-07-08 · conditional · novelty 7.0

MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

cs.CV · 2025-04-14 · unverdicted · novelty 7.0

GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using only 0.02% of the data.

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

cs.AI · 2025-03-17 · conditional · novelty 7.0

R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.

Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.

SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

cs.AI · 2026-05-10 · unverdicted · novelty 7.0 · 2 refs

SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

cs.CV · 2025-05-20 · unverdicted · novelty 7.0

DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.

Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning

cs.CL · 2026-06-29 · unverdicted · novelty 6.0

PRP introduces proactive routing via Draft Rating Learning and Joint Rating Learning to route queries early between draft and target models for efficient multimodal reasoning.

MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy

cs.AI · 2026-06-26 · unverdicted · novelty 6.0

MER-R1 uses dual-objective RL to optimize fast-thinking recall and slow-thinking precision separately in multimodal emotion recognition, with calibration to align them, yielding SOTA results on two benchmarks.

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

cs.CV · 2026-06-16 · unverdicted · novelty 6.0

SR-REAL equips spatial VLMs with dual LOR and DTR reasoning paths trained via RL, achieving better benchmark performance through mutual reinforcement and generalization without per-task tuning.

Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

cs.AI · 2026-06-05 · unverdicted · novelty 6.0

PTD-PO supplies step-wise token-distribution supervision to student policies via in-context privileged hints derived from spatial attention and intermediate reasoning, while keeping the student in an answer-free context and using Top-K Jensen-Shannon divergence for stable alignment.

Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection

cs.AI · 2026-06-02 · unverdicted · novelty 6.0

VEPO improves RL for visual reasoning by multiplicatively coupling visual sensitivity with token entropy, outperforming entropy-only baselines by 2.28 points at 7B and 3.15 points at 3B scale.

Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

Stopping large reasoning models at the first correct reasoning prefix improves accuracy up to 21% by avoiding harmful overthinking that destabilizes correct trajectories.

citing papers explorer

Showing 50 of 69 citing papers.

MM-OptBench: A Solver-Grounded Benchmark for Multimodal Optimization Modeling cs.AI · 2026-05-12 · unverdicted · none · ref 2
MM-OptBench is a solver-grounded benchmark showing current multimodal LLMs reach at most 52% pass@1 on generating correct optimization models from text-plus-visual problem specifications.
S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images cs.CV · 2026-04-23 · unverdicted · none · ref 12
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients cs.CL · 2026-06-16 · unverdicted · none · ref 101 · internal anchor
ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.
Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text cs.AI · 2026-06-08 · unverdicted · none · ref 5 · internal anchor
Optical reasoning encodes rationales in images rather than text, matching or exceeding text-based performance on math, science, and multimodal benchmarks while cutting tokens by 28.57% on language tasks and 16% on multimodal tasks.
Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions? cs.CV · 2026-06-08 · unverdicted · none · ref 8 · internal anchor
Reasoning VLMs show lower robustness to semantic visual distractions than to perceptual corruptions, with distractions entering their reasoning chains and causing errors.
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era cs.LG · 2026-05-17 · unverdicted · none · ref 33 · internal anchor
Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable reasoning on high-RP samples.
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both cs.CV · 2026-05-14 · unverdicted · none · ref 12 · internal anchor
ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.
CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves cs.CV · 2026-05-13 · unverdicted · none · ref 43 · 2 links · internal anchor
CurveBench is a new benchmark for recovering rooted containment trees from images of nested Jordan curves, where the strongest model reaches only 19.1% accuracy on hard cases and fine-tuning lifts an open model to 33.3% on easy cases.
Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning cs.CL · 2026-05-13 · unverdicted · none · ref 32 · internal anchor
Audited olympiad corpus and Physics-R1 recipe improve 8B VLM by up to 18 points on held-out physics problems while exposing contamination in prior evals.
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning cs.AI · 2026-01-14 · unverdicted · none · ref 1 · internal anchor
Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.
Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization cs.CV · 2026-01-07 · unverdicted · none · ref 8 · internal anchor
GPRO trains a meta-controller on 790k failure-labeled samples to dynamically select fast, perception, or reasoning paths in LVLMs, yielding higher accuracy and shorter responses than prior slow-thinking methods.
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning cs.CV · 2025-07-08 · conditional · none · ref 27 · internal anchor
MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents cs.CV · 2025-04-14 · unverdicted · none · ref 20 · internal anchor
GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using only 0.02% of the data.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization cs.AI · 2025-03-17 · conditional · none · ref 27 · internal anchor
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning cs.CV · 2026-05-10 · unverdicted · none · ref 9
RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 18 · 2 links
SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding cs.CV · 2026-04-24 · unverdicted · none · ref 32
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning cs.CV · 2025-05-20 · unverdicted · none · ref 14
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning cs.CL · 2026-06-29 · unverdicted · none · ref 34 · internal anchor
PRP introduces proactive routing via Draft Rating Learning and Joint Rating Learning to route queries early between draft and target models for efficient multimodal reasoning.
MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy cs.AI · 2026-06-26 · unverdicted · none · ref 22 · internal anchor
MER-R1 uses dual-objective RL to optimize fast-thinking recall and slow-thinking precision separately in multimodal emotion recognition, with calibration to align them, yielding SOTA results on two benchmarks.
Reinforcing Dual-Path Reasoning in Spatial Vision Language Models cs.CV · 2026-06-16 · unverdicted · none · ref 93 · internal anchor
SR-REAL equips spatial VLMs with dual LOR and DTR reasoning paths trained via RL, achieving better benchmark performance through mutual reinforcement and generalization without per-task tuning.
Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization cs.AI · 2026-06-05 · unverdicted · none · ref 12 · internal anchor
PTD-PO supplies step-wise token-distribution supervision to student policies via in-context privileged hints derived from spatial attention and intermediate reasoning, while keeping the student in an answer-free context and using Top-K Jensen-Shannon divergence for stable alignment.
Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection cs.AI · 2026-06-02 · unverdicted · none · ref 5 · internal anchor
VEPO improves RL for visual reasoning by multiplicatively coupling visual sensitivity with token entropy, outperforming entropy-only baselines by 2.28 points at 7B and 3.15 points at 3B scale.
Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models cs.AI · 2026-06-01 · unverdicted · none · ref 23 · internal anchor
Stopping large reasoning models at the first correct reasoning prefix improves accuracy up to 21% by avoiding harmful overthinking that destabilizes correct trajectories.
Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains cs.CV · 2026-06-01 · unverdicted · none · ref 1 · internal anchor
Tool use adds minimal new problem-solving ability in the tested multimodal agents, as 93-96% of tool-solved cases are also solved without tools.
TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL cs.AI · 2026-06-01 · unverdicted · none · ref 25 · internal anchor
TRON supplies 520 rule-verifiable online visual reasoning environments across five ability buckets that generate unlimited training instances for RL post-training, yielding consistent gains on ten external multimodal benchmarks for three vision-language models.
Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization cs.CV · 2026-05-28 · unverdicted · none · ref 17 · internal anchor
GCPO performs per-token credit assignment in discrete policy optimization by setting token advantages proportional to the difference in model predictions under positive versus negative prompts, outperforming GRPO and DAPO on text-to-image and chain-of-thought tasks.
AnE: Pushing the Reasoning Frontier of Multimodal LLMs via Anchor Evolution cs.CV · 2026-05-25 · unverdicted · none · ref 28 · internal anchor
AnE combines Truth Anchor Expansion and Scaffold-Stripping to deliver 10.3% gains on eight multimodal reasoning benchmarks for MLLMs.
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning cs.CV · 2026-05-19 · unverdicted · none · ref 19 · 2 links · internal anchor
ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six benchmarks.
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 24 · internal anchor
DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation cs.CV · 2026-05-15 · unverdicted · none · ref 22 · internal anchor
VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines including GPT-4o.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone cs.LG · 2026-05-12 · conditional · none · ref 26 · 2 links · internal anchor
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models cs.CV · 2026-03-15 · unverdicted · none · ref 18 · internal anchor
Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling cs.CV · 2025-11-25 · unverdicted · none · ref 31 · internal anchor
LongVT adds native video-cropping tool calling to LMMs for interleaved multimodal chain-of-tool-thought reasoning on long videos and releases VideoSIAH data for training and evaluation.
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding cs.CV · 2025-11-17 · unverdicted · none · ref 30 · internal anchor
REVISOR adds multimodal visual-text reflection and a Dual Attribution Decoupled Reward to improve long-form video reasoning in MLLMs without extra supervised fine-tuning.
Perception-Aware Policy Optimization for Multimodal Reasoning cs.CL · 2025-07-08 · unverdicted · none · ref 18 · internal anchor
PAPO integrates perception-aware supervision via a KL-based loss into RLVR methods like GRPO, yielding 4.4-17.5% gains on multimodal benchmarks and 30.5% fewer perception errors, with larger gains on vision-heavy tasks.
v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning cs.CL · 2025-05-24 · unverdicted · none · ref 4 · internal anchor
v1 adds a point-and-copy mechanism for dynamic visual token referencing in multimodal reasoning, trained on a new 300K dataset with grounding annotations, and outperforms baselines on multimodal math tasks.
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning cs.AI · 2025-03-27 · accept · none · ref 8 · internal anchor
UI-R1 shows rule-based RL with GRPO on 136 GUI tasks improves a 3B MLLM's action prediction accuracy by 6-22% over its base model and matches larger SFT-trained models.
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles cs.CV · 2025-03-21 · conditional · none · ref 48 · internal anchor
Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning cs.CV · 2025-03-10 · conditional · none · ref 27 · internal anchor
MM-Eureka models trained via rule-based RL on the new MMK12 dataset achieve competitive or superior multimodal mathematical reasoning performance compared to models like InternVL2.5-78B while trailing slightly behind o1.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology cs.CV · 2026-05-11 · unverdicted · none · ref 82
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards for cancer screening.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs cs.CV · 2026-05-01 · unverdicted · none · ref 56 · 2 links
PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models cs.CV · 2026-04-22 · unverdicted · none · ref 44
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning cs.CV · 2026-04-10 · unverdicted · none · ref 2
VL-Calibration is a reinforcement learning method that separates visual and reasoning confidence in LVLMs via intrinsic visual certainty estimation to improve calibration and accuracy.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward cs.CV · 2026-04-06 · unverdicted · none · ref 49
Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 90
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model cs.CV · 2025-04-10 · unverdicted · none · ref 38
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
LASER: A Corrective Lens for LVLMs via Visual Attention Preservation and Sink Suppression cs.CV · 2026-07-02 · unverdicted · none · ref 41 · internal anchor
LASER uses Visual Grounding Reward and Sink Suppression Reward to preserve visual attention trajectories and suppress sink tokens, reducing visual forgetting in LVLMs.
MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning cs.AI · 2026-06-16 · unverdicted · none · ref 50 · internal anchor
MathVis-Fine proposes a dataset with fine-grained visual annotations and dependency ratings plus a progressive two-stage training paradigm to align visual supervision with sample-specific necessity in multimodal mathematical reasoning.
From Shortcuts to Reasoning: Robust Post-Training of Theory of Mind with Reinforcement Learning cs.LG · 2026-06-08 · unverdicted · none · ref 73 · internal anchor
Thinking-RFT improves Theory of Mind accuracy by 6% over SFT on shortcut-free datasets, with 10% gains on higher-order reasoning and better generalization to new domains.

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer