hub

org/abs/2501.06186

· 2025 · arXiv 2501.06186

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

cs.CV · 2025-05-27 · conditional · novelty 7.0

Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

cs.AI · 2025-03-17 · conditional · novelty 7.0

R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

cs.AI · 2025-03-07 · conditional · novelty 7.0

RL on Qwen2-VL-2B with SAT dataset produces R1-like reasoning and 59.47% CVBench accuracy, outperforming base model by ~30% and SFT by ~2%.

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

cs.CV · 2026-06-16 · unverdicted · novelty 6.0

SR-REAL equips spatial VLMs with dual LOR and DTR reasoning paths trained via RL, achieving better benchmark performance through mutual reinforcement and generalization without per-task tuning.

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

cs.CL · 2026-05-19 · conditional · novelty 6.0

Staged post-training that first solidifies visual perception before visual and textual reasoning improves VLM accuracy and shortens reasoning traces on visual math and perception benchmarks.

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

cs.MM · 2026-04-25 · unverdicted · novelty 6.0

OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.

Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.

Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

cs.CV · 2026-02-07 · unverdicted · novelty 6.0

Fine-R1 uses chain-of-thought supervised fine-tuning on a structured FGVR reasoning dataset plus triplet augmented policy optimization to outperform general MLLMs and CLIP models on seen and unseen fine-grained categories with 4-shot training.

Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework

cs.CV · 2025-09-27 · unverdicted · novelty 6.0

DRP decouples reasoning from perception in LMMs by using an LLM reasoner to query an LMM observer for visual details as needed, reducing visual grounding loss.

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

cs.CV · 2025-03-21 · conditional · novelty 6.0

Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.

Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning

cs.AI · 2025-09-26 · unverdicted · novelty 5.0

MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gains on visual reasoning tasks.

Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning

cs.CV · 2025-06-07 · unverdicted · novelty 4.0

Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

cs.CV · 2025-03-13 · unverdicted · novelty 4.0

R1-Onevision turns images into structured text for multimodal reasoning, trains on a custom dataset with RL, and claims SOTA results on an educational benchmark.

From System 1 to System 2: A Survey of Reasoning Large Language Models

cs.AI · 2025-02-24 · accept · novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

cs.CV · 2025-03-16 · unverdicted · novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? cs.CV · 2025-05-27 · conditional · none · ref 36
Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization cs.AI · 2025-03-17 · conditional · none · ref 37
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model cs.AI · 2025-03-07 · conditional · none · ref 14
RL on Qwen2-VL-2B with SAT dataset produces R1-like reasoning and 59.47% CVBench accuracy, outperforming base model by ~30% and SFT by ~2%.
From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models cs.CL · 2026-05-19 · conditional · none · ref 23
Staged post-training that first solidifies visual perception before visual and textual reasoning improves VLM accuracy and shortens reasoning traces on visual math and perception benchmarks.
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles cs.CV · 2025-03-21 · conditional · none · ref 68
Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.

org/abs/2501.06186

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer