hub

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu · 2025 · cs.CV · arXiv 2503.07365

23 Pith papers cite this work. Polarity classification is still indexing.

23 Pith papers citing it

open full Pith review browse 23 citing papers arXiv PDF

abstract

DeepSeek R1, and o1 have demonstrated powerful reasoning capabilities in the text domain through stable large-scale reinforcement learning. To enable broader applications, some works have attempted to transfer these capabilities to multimodal reasoning. However, these efforts have been limited by the limited difficulty of selected tasks and relatively small training scales, making it challenging to demonstrate strong multimodal reasoning abilities. To address this gap, we introduce the MMK12 dataset and MM-EUREKA with 7B and 32B parameters. The former is a high-quality multimodal mathematics reasoning dataset featuring diverse knowledge domains with human-verified answers and solution processes. The latter is a multimodal model employing rule-based reinforcement learning on MMK12, utilizing online filtering and two-stage training strategy to enhance training stability. MM-EUREKA demonstrates remarkable performance gains in multimodal mathematical reasoning, outperforming previous powerful models like InternVL2.5-78B or InternVL2.5-38B-MPO. In particular, MM-EUREKA achieves competitive or superior performance compared to both open-source and closed-source models, and trails slightly behind o1 in multidisciplinary reasoning tasks. We open-source our complete pipeline to foster further research in this area. We release all our codes, models, data, etc. at https://github.com/ModalMinds/MM-EUREKA

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

MM-OptBench: A Solver-Grounded Benchmark for Multimodal Optimization Modeling

cs.AI · 2026-05-12 · unverdicted · novelty 8.0

MM-OptBench is a solver-grounded benchmark showing current multimodal LLMs reach at most 52% pass@1 on generating correct optimization models from text-plus-visual problem specifications.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.

SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

cs.AI · 2026-05-10 · unverdicted · novelty 7.0 · 2 refs

SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

cs.CV · 2025-05-20 · unverdicted · novelty 7.0

DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

cs.CV · 2025-03-10 · conditional · novelty 6.0

MM-Eureka models trained via rule-based RL on the new MMK12 dataset achieve competitive or superior multimodal mathematical reasoning performance compared to models like InternVL2.5-78B while trailing slightly behind o1.

RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards for cancer screening.

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

cs.CV · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.

CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution

cs.CV · 2026-04-24 · unverdicted · novelty 6.0

CharTide decouples chart-to-code data into three perspectives and uses inquiry-driven RL with atomic QA verification to let smaller VLMs surpass GPT-4o on chart-to-code tasks.

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.

VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

cs.CV · 2026-04-10 · unverdicted · novelty 6.0

VL-Calibration is a reinforcement learning method that separates visual and reasoning confidence in LVLMs via intrinsic visual certainty estimation to improve calibration and accuracy.

Visually-Guided Policy Optimization for Multimodal Reasoning

cs.CV · 2026-04-10 · unverdicted · novelty 6.0

VGPO introduces visual attention compensation and dual-grained advantage re-weighting to reinforce visual focus in VLMs, yielding better activation and performance on multimodal reasoning tasks.

Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

cs.CV · 2025-08-25 · unverdicted · novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

cs.CV · 2025-04-10 · unverdicted · novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.

Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

cs.LG · 2026-04-23 · unverdicted · novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

cs.AI · 2026-04-21 · unverdicted · novelty 5.0

DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.

DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

cs.AI · 2026-04-14 · unverdicted · novelty 5.0 · 2 refs

DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-efficient resolution allocation.

SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units

cs.CV · 2026-04-12 · unverdicted · novelty 5.0

SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.

Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

cs.AI · 2026-04-11 · unverdicted · novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by reinforcing visual attention.

From System 1 to System 2: A Survey of Reasoning Large Language Models

cs.AI · 2025-02-24 · accept · novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

cs.LG · 2026-05-12

citing papers explorer

Showing 23 of 23 citing papers.

MM-OptBench: A Solver-Grounded Benchmark for Multimodal Optimization Modeling cs.AI · 2026-05-12 · unverdicted · none · ref 2
MM-OptBench is a solver-grounded benchmark showing current multimodal LLMs reach at most 52% pass@1 on generating correct optimization models from text-plus-visual problem specifications.
S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images cs.CV · 2026-04-23 · unverdicted · none · ref 12
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning cs.CV · 2026-05-10 · unverdicted · none · ref 9
RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 18 · 2 links
SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding cs.CV · 2026-04-24 · unverdicted · none · ref 32
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning cs.CV · 2025-05-20 · unverdicted · none · ref 14
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning cs.CV · 2025-03-10 · conditional · none · ref 27 · internal anchor
MM-Eureka models trained via rule-based RL on the new MMK12 dataset achieve competitive or superior multimodal mathematical reasoning performance compared to models like InternVL2.5-78B while trailing slightly behind o1.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology cs.CV · 2026-05-11 · unverdicted · none · ref 82
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards for cancer screening.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs cs.CV · 2026-05-01 · unverdicted · none · ref 56 · 2 links
PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution cs.CV · 2026-04-24 · unverdicted · none · ref 22
CharTide decouples chart-to-code data into three perspectives and uses inquiry-driven RL with atomic QA verification to let smaller VLMs surpass GPT-4o on chart-to-code tasks.
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models cs.CV · 2026-04-22 · unverdicted · none · ref 44
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning cs.CV · 2026-04-10 · unverdicted · none · ref 2
VL-Calibration is a reinforcement learning method that separates visual and reasoning confidence in LVLMs via intrinsic visual certainty estimation to improve calibration and accuracy.
Visually-Guided Policy Optimization for Multimodal Reasoning cs.CV · 2026-04-10 · unverdicted · none · ref 4
VGPO introduces visual attention compensation and dual-grained advantage re-weighting to reinforce visual focus in VLMs, yielding better activation and performance on multimodal reasoning tasks.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward cs.CV · 2026-04-06 · unverdicted · none · ref 49
Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 90
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model cs.CV · 2025-04-10 · unverdicted · none · ref 38
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding cs.LG · 2026-04-23 · unverdicted · none · ref 57
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling cs.AI · 2026-04-21 · unverdicted · none · ref 33
DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding cs.AI · 2026-04-14 · unverdicted · none · ref 42 · 2 links
DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-efficient resolution allocation.
SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units cs.CV · 2026-04-12 · unverdicted · none · ref 19
SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models cs.AI · 2026-04-11 · unverdicted · none · ref 101
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by reinforcing visual attention.
From System 1 to System 2: A Survey of Reasoning Large Language Models cs.AI · 2025-02-24 · accept · none · ref 257
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone cs.LG · 2026-05-12 · unreviewed · ref 26

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer