{"total":14,"items":[{"citing_arxiv_id":"2605.20177","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-19T17:58:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Staged post-training that first solidifies visual perception before visual and textual reasoning improves VLM accuracy and shortens reasoning traces on visual math and perception benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00877","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models","primary_cat":"cs.MM","submitted_at":"2026-04-25T14:53:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04500","ref_index":67,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward","primary_cat":"cs.CV","submitted_at":"2026-04-06T07:51:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InInternational conference on machine learning, pages 3319-3328. PMLR, 2017. 2 [66] Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 6 [67] Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav- o1: Rethinking step-by-step visual reasoning in llms.arXiv preprint arXiv:2501.06186, 2025. 2 [68] Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula,"},{"citing_arxiv_id":"2602.07605","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning","primary_cat":"cs.CV","submitted_at":"2026-02-07T16:16:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fine-R1 uses chain-of-thought supervised fine-tuning on a structured FGVR reasoning dataset plus triplet augmented policy optimization to outperform general MLLMs and CLIP models on seen and unseen fine-grained categories with 4-shot training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.23322","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework","primary_cat":"cs.CV","submitted_at":"2025-09-27T14:13:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DRP decouples reasoning from perception in LMMs by using an LLM reasoner to query an LMM observer for visual details as needed, reducing visual grounding loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.22746","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning","primary_cat":"cs.AI","submitted_at":"2025-09-26T04:33:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gains on visual reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.06856","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2025-06-07T16:37:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.21374","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?","primary_cat":"cs.CV","submitted_at":"2025-05-27T16:05:01+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.17352","ref_index":68,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles","primary_cat":"cs.CV","submitted_at":"2025-03-21T17:52:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.12937","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization","primary_cat":"cs.AI","submitted_at":"2025-03-17T08:51:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.12605","ref_index":97,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey","primary_cat":"cs.CV","submitted_at":"2025-03-16T18:39:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"IPVR [51]; Multimodal-CoT [29]; CoT-PT [52]; PromptCoT [53]; VCoT [54];PCoT [55]; MM-CoT [56]; HoT [57]; CoTDet [58]; DDCoT [59]; CPSeg [60];Gen2Sim [61]; CoI [62]; MC-CoT [63]; CCoT [64]; LoT [65]; DPMM-CoT [66];GCoT [23]; CoCoT [67]; KAM-CoT [68]; PKRD-CoT [69]; CoS [70]; CoA [71];Det-CoT [72]; BDoG [73]; TextCoT [74]; CoRAG [75]; Cantor [76];Visual Sketchpad [77]; IoT [78]; PS-CoT [79]; G-CoT [80]; STIC [81];SNSE-CoT [82]; CoE [83]; DCoT [84]; Layoutllm-t2i [85]; Creatilayout [86];visual-o1 [87]; R-CoT [88]; LLaV A-CoT [9]; VIC [89]; RelationLMM [90];Insight-V [91]; LLaV A-Aurora [92]; AR-MCTS [93]; Mulberry [94]; Virgo [95];Socratic [96]; LlamaV-o1 [97]; MV oT [30]; PARM++ [34]; URSA [98];Multimodal Open R1 [99]; AStar [100]; R1-OneVision [101]; SoT [102] Video CaVIR [103]; VideoAgent [104]; Track-LongVideo [105]; CaRDiff [106];V oT [31]; R3CoT [107]; Antgpt [108]; Grounding-Prompter [109];VIP [110]; DreamFactory [111]; Chain-of-Shot [112]; TI-PREGO [113] Audio/Speech SpeechGPT-Gen [114]; CoT-ST [115]; LPE [116]; SpatialSonic [117];Audio-CoT [118]; Audio-Reasoner [119]"},{"citing_arxiv_id":"2503.10615","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization","primary_cat":"cs.CV","submitted_at":"2025-03-13T17:56:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"R1-Onevision turns images into structured text for multimodal reasoning, trains on a custom dataset with RL, and claims SOTA results on an educational benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.05132","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"R1-Zero's \"Aha Moment\" in Visual Reasoning on a 2B Non-SFT Model","primary_cat":"cs.AI","submitted_at":"2025-03-07T04:21:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RL on Qwen2-VL-2B with SAT dataset produces R1-like reasoning and 59.47% CVBench accuracy, outperforming base model by ~30% and SFT by ~2%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.17419","ref_index":236,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From System 1 to System 2: A Survey of Reasoning Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-02-24T18:50:52+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"RedStar [54] Distill Human-Designed 2 SFT ✓ I T AD, VC Auto-CoT [13] Exploration In-Context Learning 2 ICL ✗ T AD, IPR, GO PoT [232] Verification In-Context Learning1 ICL ✗ T AD, IPR, GO PAL [233] Verification In-Context Learning 1 ICL ✗ T AD, IPR, GO Decomposed Prompt [234]Exploration Human-Designed 3 ICL ✗ T AD, IPR Least-to-Most [235] Exploration Human-Designed 2 ICL ✗ T AD, IPR CoR-Math [236] Synthetic Data Human-Designed 3 SFT ✓ T AS, SR, NLR 3.2.5 Reinforcement Fine-T uning Reinforcement Fine-Tuning (RFT) [241] is an innovative technique recently introduced by OpenAI, designed to en- able developers and engineers to fine-tune existing models for specific domains or complex tasks. Unlike general SFT, RFT focuses on optimizing the model's reasoning process by"}],"limit":50,"offset":0}