{"total":11,"items":[{"citing_arxiv_id":"2605.19852","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-19T13:44:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AutoTool uses dual-mode RL to let MLLMs adaptively choose tool use or text-only reasoning, reporting 21.8% accuracy gain on V* and 44.9% efficiency gain on POPE versus baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.07348","ref_index":98,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition","primary_cat":"cs.CV","submitted_at":"2025-12-08T09:40:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MICo-150K is a new 150K-image dataset with 7 tasks, a De&Re real-image subset, MICo-Bench, and Weighted-Ref-VIEScore metric that improves AI models for generating consistent composites from arbitrary numbers of reference images.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.21122","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation","primary_cat":"cs.CV","submitted_at":"2025-10-24T03:23:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NoisyGRPO is an RL framework that perturbs visual inputs with Gaussian noise for exploration and computes trajectory advantages via Bayesian posterior fusion of noise prior and reward likelihood to improve multimodal CoT generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.06856","ref_index":68,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2025-06-07T16:37:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.10479","ref_index":148,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","primary_cat":"cs.CV","submitted_at":"2025-04-14T17:59:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"3 OCR, Chart, and Document Understanding To assess the model's integrated vision-language understanding in tasks involving text, document, and chart com- prehension, we perform a comprehensive evaluation over nine benchmarks, including AI2D [57], ChartQA [91], TextVQA [107], DocVQA [ 93], InfoVQA [ 92], OCRBench [ 76], SEED-2-Plus [ 61], CharXiv [ 128], and VCR [148]. As illustrated in Table 3, the InternVL3 series not only maintains robust performance across these benchmarks but also demonstrates competitive or superior results when compared to other open-source and closed-source counterparts. At the 1B scale, InternVL3-1B achieves performance that is roughly on par with previous lower-scale models. At the 2B scale, InternVL3-2B not only improves its absolute scores-for instance, reaching 78."},{"citing_arxiv_id":"2503.16549","ref_index":80,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems","primary_cat":"cs.CV","submitted_at":"2025-03-19T11:46:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.12605","ref_index":161,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey","primary_cat":"cs.CV","submitted_at":"2025-03-16T18:39:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Healthcare and MedicalMM-PEAR-CoT [129]; StressSelfRefine [151]; TI-PREGO [113];Chain-of-Look [152]; MedCoT [153]; MedVLM-R1 [154] Social and HumanChain-of-Sentiment [155]; Chain-of-Exemplar [83]; Chain-of-Empathetic [128] Multimodal GenerationPARM++ [34]; RPG-DiffusionMaster [156]; L3GO [125]; 3D-PreMise [124] Benchmark MLLM Finetuningwith Rationale ScienceQA [157]; A-OKVQA [158]; T-SciQ [55]; VideoCoT [159];VideoEspresso [160]; EgoCoT [39]; EMMA-X [140]; M3CoT [28];MA VIS [161]; CoT-CAP3D [126] DownstreamCapability Assessment With Rationale VMMMU [162]; SEED [163]; MathVista [164];MathVerse [165]; Math-Vision [166]; Emma [167];Migician [135]; RefA VS [168]; VSIBench [169];MeViS [170]; HallusionBench [171];A VTrustBench [172]; A VHBench [173] Without RationaleCoMT [174]; WorldQA [175]; MiCEval [176];OlympiadBench [177]; MME-CoT [178];OmniBench [179]"},{"citing_arxiv_id":"2412.10302","ref_index":110,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding","primary_cat":"cs.CV","submitted_at":"2024-12-13T17:37:48+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Chang, Z. Gan, et al. Ferret-v2: An improved baseline for referring and grounding with large language models. arXiv preprint arXiv:2404.07973, 2024. [109] R. Zhang, X. Wei, D. Jiang, Y. Zhang, Z. Guo, C. Tong, J. Liu, A. Zhou, B. Wei, S. Zhang, P . Gao, and H. Li. Mavis: Mathematical visual instruction tuning, 2024. URL https: //arxiv.org/abs/2407.08739. [110] B. Zheng, B. Gou, J. Kil, H. Sun, and Y. Su. Gpt-4v(ision) is a generalist web agent, if grounded. 2024. URL https://openreview.net/forum?id=piecKJ2DlB. [111] X. Zheng, D. Burdick, L. Popa, P . Zhong, and N. X. R. Wang. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. Winter Conference for Applications in Computer Vision (WACV), 2021."},{"citing_arxiv_id":"2411.10442","ref_index":115,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization","primary_cat":"cs.CL","submitted_at":"2024-11-15T18:59:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Mmmu: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert agi. In CVPR, 2024. 5, 6 [114] Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024. 5, 6 [115] Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang, et al. Mavis: Mathematical visual instruction tuning. arXiv preprint arXiv:2407.08739, 2024. 3 [116] Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. Evaluating the performance of large language models on gaokao benchmark."},{"citing_arxiv_id":"2407.07895","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models","primary_cat":"cs.CV","submitted_at":"2024-07-10T17:59:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"large language models with zero-initialized attention. In The Twelfth International Conference on Learning Representations, 2024. [65] Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624 , 2024. [66] Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo, Chengzhuo Tong, Jiaming Liu, Ao- jun Zhou, Bin Wei, Shanghang Zhang, et al. Mavis: Mathematical visual instruction tuning. arXiv preprint arXiv:2407.08739, 2024. [67] Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chun- yuan Li. Llava-next: A strong zero-shot video under-"},{"citing_arxiv_id":"2403.14624","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?","primary_cat":"cs.CV","submitted_at":"2024-03-21T17:59:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}