{"total":10,"items":[{"citing_arxiv_id":"2605.12034","ref_index":38,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation","primary_cat":"cs.MM","submitted_at":"2026-05-12T12:16:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Our mixed bi-modal SFT baseline starts fromQwen2.5-Omni-3B[ 1] and is output-token balanced across four sources: audio-text (1B output tokens), image-text (1B), video-text (1B), and pure text (1B). The audio-text, image-text, and pure-text portions are drawn from internal datasets. The video-text portion combines four open-source video corpora: Video-R1-data [37], VideoAuto-R1- Data [38], ShareGPT4Video [39], and LLaV A-Video-178K [15]. Because these corpora partially overlap, we deduplicate exact matches at the video-query level while retaining multiple distinct 9 StepFun-Audio Team queries for the same video when appropriate. We then rewrite the video CoTs with Qwen2.5-VL- 235B [40], add dense full-video captions derived from 30-second segments, and discard examples"},{"citing_arxiv_id":"2605.09614","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-10T15:53:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"long-chain reasoning, such as GRPO [24] and DAPO [25], and extends them to multimodal reasoning through methods such as Vision-R1 [ 8] and MM-Eureka [ 9]. Beyond final-answer rewards, VL- Rethinker [26], SRPO [27], and Mulberry [28] improve intermediate trajectories through reflection or refinement. More recent perception-aware methods, including PAPO [21], VPPO [22], PeRL [29], and V APO [30], push the policy toward stronger perceptual grounding and visual utilization. This yields a clean test-time interface: normal decoding, no visual re-injection. Yet the mechanism remains implicit: rewards promote grounding, but do not locate visual forgetting or trace how a local correction propagates through later reasoning. 3 Problem Formulation: Selective Policy Reshaping for Visual Retention"},{"citing_arxiv_id":"2604.20806","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model","primary_cat":"cs.CV","submitted_at":"2026-04-22T17:37:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[53] Zaharov Timur, Konstantin Korolev, and Aleksandr Nikolich. Physics big, 2024. URLhttps:// huggingface.co/datasets/Vikhrmodels/physics_big. [54] Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, et al. Thinking with video: Video generation as a promising multimodal reasoning paradigm.arXiv preprint arXiv:2511.04570, 2025. [55] Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, et al. Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learning.arXiv preprint arXiv:2506.01713, 2025. 16 [56] Fei Wang, Xingyu Fu, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Lorena Yan, Wenjie Jacky Mo, Hsiang-Hui Liu, Pan Lu, Chun-"},{"citing_arxiv_id":"2511.19972","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Boosting Reasoning in Large Multimodal Models via Activation Replay","primary_cat":"cs.CV","submitted_at":"2025-11-25T06:31:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.13026","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding","primary_cat":"cs.CV","submitted_at":"2025-11-17T06:25:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"REVISOR adds multimodal visual-text reflection and a Dual Attribution Decoupled Reward to improve long-form video reasoning in MLLMs without extra supervised fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.11113","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VIDEOP2R: Video Understanding from Perception to Reasoning","primary_cat":"cs.CV","submitted_at":"2025-11-14T09:42:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.00710","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Does RLVR Extend Reasoning Boundaries? Investigating Capability Expansion in Vision-Language Models","primary_cat":"cs.AI","submitted_at":"2025-11-01T21:19:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RLVR on synthetic mazes enables VLMs to solve spatial reasoning tasks unreachable by the base model and generalizes to real-world navigation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.14738","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AutoRubric: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning","primary_cat":"cs.CL","submitted_at":"2025-10-16T14:40:02+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AutoRubric generates rubric-based process rewards from self-aggregated successful trajectories to improve faithful multimodal reasoning in MLLMs under RLVR without human annotation or teacher models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.25454","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search","primary_cat":"cs.AI","submitted_at":"2025-09-29T20:00:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeepSearch embeds MCTS into RLVR training with global frontier selection, entropy guidance, and adaptive replay to achieve 62.95% average accuracy on math reasoning benchmarks while using 5.7x fewer GPU hours than extended training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.06448","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Perception-Aware Policy Optimization for Multimodal Reasoning","primary_cat":"cs.CL","submitted_at":"2025-07-08T23:22:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PAPO integrates perception-aware supervision via a KL-based loss into RLVR methods like GRPO, yielding 4.4-17.5% gains on multimodal benchmarks and 30.5% fewer perception errors, with larger gains on vision-heavy tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}