Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
hub
Srpo: Enhancing multimodal llm reasoning via reflection-aware rein- forcement learning.arXiv preprint arXiv:2506.01713
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.
OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.
REVISOR adds multimodal visual-text reflection and a Dual Attribution Decoupled Reward to improve long-form video reasoning in MLLMs without extra supervised fine-tuning.
RLVR on synthetic mazes enables VLMs to solve spatial reasoning tasks unreachable by the base model and generalizes to real-world navigation benchmarks.
AutoRubric generates rubric-based process rewards from self-aggregated successful trajectories to improve faithful multimodal reasoning in MLLMs under RLVR without human annotation or teacher models.
DeepSearch embeds MCTS into RLVR training with global frontier selection, entropy guidance, and adaptive replay to achieve 62.95% average accuracy on math reasoning benchmarks while using 5.7x fewer GPU hours than extended training.
PAPO integrates perception-aware supervision via a KL-based loss into RLVR methods like GRPO, yielding 4.4-17.5% gains on multimodal benchmarks and 30.5% fewer perception errors, with larger gains on vision-heavy tasks.
citing papers explorer
-
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
-
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.