Recognition: unknown
Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning
read the original abstract
Multimodal Large Language Models (MLLMs) perform well in single-image visual grounding but struggle with real-world tasks that demand cross-image reasoning and multi-modal instructions. To address this, we adopt a reinforcement learning (RL) based post-training strategy for MLLMs in multi-image grounding tasks. We first synthesize high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). Subsequently, we apply rejection sampling with the merged SFT model to curate reliable RL data and use rule-based RL to guide the model toward optimal reasoning paths. Extensive experiments demonstrate the effectiveness of our approach, achieving +9.04% on MIG-Bench and +4.41% on average across seven out-of-domain benchmarks.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs
Perception Programs rewrite dense visual tool outputs into language-native summaries, boosting MLLM accuracy by 15-45% absolute on BLINK perception tasks and setting new state-of-the-art results.
-
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation
RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.