{"total":20,"items":[{"citing_arxiv_id":"2605.25571","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AnE: Pushing the Reasoning Frontier of Multimodal LLMs via Anchor Evolution","primary_cat":"cs.CV","submitted_at":"2026-05-25T08:26:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AnE combines Truth Anchor Expansion and Scaffold-Stripping to deliver 10.3% gains on eight multimodal reasoning benchmarks for MLLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09614","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-10T15:53:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"does RAPO increase downstream visual-dependence signals along generated trajectories? RQ3:Ablation and Robustness Analysis.How sensitive is RAPO to design choices such as KL win- dow length and KL coefficient, and how robust is it across reasoning length and model architecture? 5.1 Experimental Setup All experimental details are provided in Appendix C.1. Models and Baselines.We adopt LVLMs from the Qwen family [ 31, 32, 7] as our base models. To ensure fair comparison, we group baselines by base model and restrict comparisons to methods built on the same base model.(i) Qwen3-VL-Instruct.We evaluate our method on the Qwen3-VL- 2B-Instruct and 8B-Instruct models [7], and compare it with representative RL-based optimization methods, including GRPO [24], PAPO [21], and VPPO [22]."},{"citing_arxiv_id":"2604.24339","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection","primary_cat":"cs.CV","submitted_at":"2026-04-27T11:31:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ing them through step-by-step reasoning. This paradigm has been successfully extended to VLMs, giving rise to the research framework of Multimodal Chain-of-Thought (MCoT). Early MCoT methods adopted a \"perceive-then- reason\" paradigm, but decoupling vision and language of- ten led to loss of critical visual details. Recent works like LLaV A-CoT [54], Virgo [14], and Mulberry [55] intro- duced \"Slow Thinking\" mechanisms to improve systematic reasoning, yet they are commonly task-specific and gener- alize poorly in open-domain settings. GThinker [59] pro- posed a more flexible, prompt-driven approach with vision- guided reflection to improve cross-task generalization and interpretability. Recent MCoT [25, 44, 48] methods have"},{"citing_arxiv_id":"2605.04064","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Improving Medical VQA through Trajectory-Aware Process Supervision","primary_cat":"cs.LG","submitted_at":"2026-04-10T21:13:46+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A trajectory-aware process reward using DTW on sentence embeddings, combined with exact-match in GRPO after SFT, raises mean medical VQA accuracy from 0.598 to 0.689 across six benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04500","ref_index":82,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward","primary_cat":"cs.CV","submitted_at":"2026-04-06T07:51:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"MiniCPM-Llama-V-2.5-8B [83] 19.6 77.2 86.7 2025 45.6 51.8 - - 89.2 - MiniCPM-V-2.6-8B [83] 27.2 78.0 83.2 2348 - 57.5 82.4 - 96.7 - IXC-2.5 [90] - 82.2 87.9 2229 50.0 59.9 82.2 - 96.6 - Reasoning Models LLaV A-CoT-11B [79] - 75.0 - - - 54.9 - - - - LLaV A-Reasoner-8B [92] - - - - - - 83.0 - - - Insight-V-8B [18] 24.9 82.3 - 2312 - 61.5 81.5 - - - Mulberry-7B [82] - - - 2396 - 61.3 83.9 - - - Vision-R1-LlamaV-CI-11B [31] - - - 2190 - 61.4 83.9 - - - Saliency-R1 Qwen2.5-VL-3B [6] 30.8 77.8 87.5 2206 52.0 56.1 83.6 35.7 81.2 40.4 + SFT 29.9 75.8 87.1 2351 56.0 55.7 84.0 38.1 90.5 51.7 + Saliency-R1 31.1 76.7 87.5 2235 57.7 56.0 84.2 39.9 91.2 59.6 Qwen2.5-VL-7B [6] 36.2 82.8 86.7 2302 58.7 62.4 84.0 37.5 88.2 49."},{"citing_arxiv_id":"2604.03179","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models","primary_cat":"cs.LG","submitted_at":"2026-04-03T16:56:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"However, these mod- els still struggle with complex tasks such as visual mathe- matical reasoning [31, 41], which demand multi-step and reasoning capabilities. To address this limitation, recent studies have explored various approaches to enhance the reasoning capabilities of MLLMs, including long chain-of- thought generation [28, 42], Monte Carlo tree search [39], and latent-space reasoning [38]. Inspired by the success of reasoning LLMs such as DeepSeek-R1 [7], numerous re- cent works [3, 8, 19, 20, 25, 29, 33, 36, 37] have investigated reinforcement learning (RL)-based post-training strategies for MLLMs through rule-based reward optimization. While the adoption of reinforcement learning has demonstrated"},{"citing_arxiv_id":"2604.03318","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs","primary_cat":"cs.CV","submitted_at":"2026-04-01T15:28:13+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"multi-stage alignment [3, 20, 29, 40] and instruction tun- ing [5, 61], together with large-scale visual instruction datasets [20, 37], have been proposed to develop stronger open-source multimodal models [2, 3, 20, 29, 36, 40, 53]. To further enhance multimodal reasoning, LLaV A-CoT [48] structures reasoning into four stages for step-by-step in- ference, while Mulberry [51] leverages a collective Monte Carlo tree search to learn from explicit reasoning trees. In addition, reinforcement learning methods inspired by DeepSeek-R1 [14] have been introduced to strengthen gen- eral reasoning [12, 16, 18, 28, 33, 50, 56], further pushing the reasoning capabilities of MLLMs. Despite these advancements, current MLLMs still strug-"},{"citing_arxiv_id":"2512.14044","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2025-12-16T03:19:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OmniDrive-R1 boosts VLM reasoning score from 51.77% to 80.35% and answer accuracy from 37.81% to 73.62% on DriveLMM-o1 via reinforcement-driven interleaved multi-modal chain-of-thought with annotation-free grounding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.14738","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AutoRubric: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning","primary_cat":"cs.CL","submitted_at":"2025-10-16T14:40:02+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AutoRubric generates rubric-based process rewards from self-aggregated successful trajectories to improve faithful multimodal reasoning in MLLMs under RLVR without human annotation or teacher models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.22746","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning","primary_cat":"cs.AI","submitted_at":"2025-09-26T04:33:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gains on visual reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.23678","ref_index":80,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Grounded Reinforcement Learning for Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2025-05-29T17:20:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.09925","ref_index":71,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding","primary_cat":"cs.CV","submitted_at":"2025-04-14T06:33:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.17352","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles","primary_cat":"cs.CV","submitted_at":"2025-03-21T17:52:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.12937","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization","primary_cat":"cs.AI","submitted_at":"2025-03-17T08:51:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.12605","ref_index":94,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey","primary_cat":"cs.CV","submitted_at":"2025-03-16T18:39:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"IPVR [51]; Multimodal-CoT [29]; CoT-PT [52]; PromptCoT [53]; VCoT [54];PCoT [55]; MM-CoT [56]; HoT [57]; CoTDet [58]; DDCoT [59]; CPSeg [60];Gen2Sim [61]; CoI [62]; MC-CoT [63]; CCoT [64]; LoT [65]; DPMM-CoT [66];GCoT [23]; CoCoT [67]; KAM-CoT [68]; PKRD-CoT [69]; CoS [70]; CoA [71];Det-CoT [72]; BDoG [73]; TextCoT [74]; CoRAG [75]; Cantor [76];Visual Sketchpad [77]; IoT [78]; PS-CoT [79]; G-CoT [80]; STIC [81];SNSE-CoT [82]; CoE [83]; DCoT [84]; Layoutllm-t2i [85]; Creatilayout [86];visual-o1 [87]; R-CoT [88]; LLaV A-CoT [9]; VIC [89]; RelationLMM [90];Insight-V [91]; LLaV A-Aurora [92]; AR-MCTS [93]; Mulberry [94]; Virgo [95];Socratic [96]; LlamaV-o1 [97]; MV oT [30]; PARM++ [34]; URSA [98];Multimodal Open R1 [99]; AStar [100]; R1-OneVision [101]; SoT [102] Video CaVIR [103]; VideoAgent [104]; Track-LongVideo [105]; CaRDiff [106];V oT [31]; R3CoT [107]; Antgpt [108]; Grounding-Prompter [109];VIP [110]; DreamFactory [111]; Chain-of-Shot [112]; TI-PREGO [113]"},{"citing_arxiv_id":"2503.10615","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization","primary_cat":"cs.CV","submitted_at":"2025-03-13T17:56:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"R1-Onevision turns images into structured text for multimodal reasoning, trains on a custom dataset with RL, and claims SOTA results on an educational benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.07536","ref_index":85,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL","primary_cat":"cs.CL","submitted_at":"2025-03-10T17:04:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024. 3 [84] Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative prefer- ence learning. arXiv preprint arXiv:2405.00451, 2024. 1 [85] Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search. arXiv preprint arXiv:2412.18319 , 2024. 1 [86] Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou."},{"citing_arxiv_id":"2502.17419","ref_index":234,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From System 1 to System 2: A Survey of Reasoning Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-02-24T18:50:52+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"AtomThink [231] Synthetic Data/Exploration In-Context Learning>100 SFT & RL ✓ I T AD, IPR, EB RedStar [54] Distill Human-Designed 2 SFT ✓ I T AD, VC Auto-CoT [13] Exploration In-Context Learning 2 ICL ✗ T AD, IPR, GO PoT [232] Verification In-Context Learning1 ICL ✗ T AD, IPR, GO PAL [233] Verification In-Context Learning 1 ICL ✗ T AD, IPR, GO Decomposed Prompt [234]Exploration Human-Designed 3 ICL ✗ T AD, IPR Least-to-Most [235] Exploration Human-Designed 2 ICL ✗ T AD, IPR CoR-Math [236] Synthetic Data Human-Designed 3 SFT ✓ T AS, SR, NLR 3.2.5 Reinforcement Fine-T uning Reinforcement Fine-Tuning (RFT) [241] is an innovative technique recently introduced by OpenAI, designed to en- able developers and engineers to fine-tune existing models"},{"citing_arxiv_id":"2502.02871","ref_index":235,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning","primary_cat":"cs.CL","submitted_at":"2025-02-05T04:05:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.05366","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Search-o1: Agentic Search-Enhanced Large Reasoning Models","primary_cat":"cs.AI","submitted_at":"2025-01-09T16:48:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding, and QA tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"expert model via self-improvement. CoRR, abs/2409.12122, 2024. [68] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop ques- tion answering. In EMNLP, pages 2369-2380, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. [69] Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search.arXiv preprint arXiv:2412.18319, 2024. [70] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao."}],"limit":50,"offset":0}