{"total":13,"items":[{"citing_arxiv_id":"2605.20177","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-19T17:58:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Staged post-training that first solidifies visual perception before visual and textual reasoning improves VLM accuracy and shortens reasoning traces on visual math and perception benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10765","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning","primary_cat":"cs.CV","submitted_at":"2026-05-11T15:59:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DRAPE generates query-image conditioned prompts on the fly for multimodal continual instruction tuning and reports SOTA results on MCIT benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"61 94.16 61.37 54.43 67.92 66.53 65.11 67.48 - w/o Cross-Modal Attn. 69.39 59.44 94.14 61.11 47.77 67.87 66.55 64.22 66.31 -1.17 w/o Null-Space Proj. 69.94 55.78 83.6261.4849.83 65.4367.43 65.3864.86 -2.62 Grounding (RefCOCO) [19, 30], VQAv2 [9], and OCR-VQA [31]. The second is UCIT [10], which contains six sequential tasks: ArxivQA [ 21], CLEVR-Math [22], IconQA [28], ImageNet-R [13], VizWiz-caption [12], and Flickr30k [32]. Together, these two benchmarks let us evaluate our method in both a widely used MCIT setting and a cleaner setting with reduced data-overlap concerns. Comparison Methods.We compare DRAPEwith classic prompt-based continual learning ap- proaches, including CODA-Prompt [36], DualPrompt [41], and L2P [54], as well as recent MCIT"},{"citing_arxiv_id":"2604.14646","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Targeted Exploration via Unified Entropy Control for Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-04-16T05:52:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UEC-RL improves RL reasoning performance in LLMs and VLMs by activating exploration on hard prompts and stabilizing entropy, delivering a 37.9% relative gain over GRPO on Geometry3K.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14016","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MAny: Merge Anything for Multimodal Continual Instruction Tuning","primary_cat":"cs.LG","submitted_at":"2026-04-15T15:57:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MAny addresses dual-forgetting in multimodal continual instruction tuning via CPM and LPM merging strategies, delivering up to 8.57% accuracy gains on UCIT benchmarks without additional training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13395","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Quantifying and Understanding Uncertainty in Large Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-04-15T01:53:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Conformal prediction adapted to reasoning-answer pairs in LRMs yields distribution-free uncertainty sets with finite-sample guarantees, paired with a Shapley explanation method that isolates provably sufficient training subsets and steps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.20856","ref_index":77,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NVIDIA Nemotron 3: Efficient and Open Intelligence","primary_cat":"cs.CL","submitted_at":"2025-12-24T00:24:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.06856","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2025-06-07T16:37:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.10479","ref_index":72,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","primary_cat":"cs.CV","submitted_at":"2025-04-14T17:59:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"As a result, the number of training samples grows from 16.3M in InternVL2.5 to 21.7M in InternVL3. For MPO data, we construct preference pairs based on the data pipeline and samples proposed in MMPR v1.2 [ 124], which cover a wide range of domains, including general visual question answering (VQA) [43, 50, 90, 83, 127, 126], science [57, 16, 82], chart [91, 54, 11], mathematics [72, 104, 10, 81, 55, 40, 147, 106], OCR [92, 107, 9, 49, 96], and document [24]. We use the SFT versions of InternVL3-8B, 38B, and 78B to generate rollouts. During the MPO phase, all models are trained on the same dataset, which comprises about 300K samples. 6 2.4 Test-Time Scaling Test-Time Scaling has been shown to be an effective method to enhance the reasoning abilities of LLMs and"},{"citing_arxiv_id":"2503.07536","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL","primary_cat":"cs.CL","submitted_at":"2025-03-10T17:04:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[36] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Ed- wards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In ICLR, 2023. 1, 6, 13 [37] Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. VILA: On pre-training for vi- sual language models. In CVPR, 2024. 2 [38] Adam Dahlgren Lindström and Savitha Sam Abraham. CLEVR-Math: A dataset for compositional language, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022. 13 [39] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 3 [40] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning."},{"citing_arxiv_id":"2412.05271","ref_index":144,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","primary_cat":"cs.CV","submitted_at":"2024-12-06T18:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"GQA [98], OKVQA [178], A-OKVQA [205], Visual7W [317], VisText [226], VSR [147], TallyQA [2],General QA Objects365-YorN [208], IconQA [167], Stanford40 [273], VisDial [51], VQAv2 [74], Hateful-Memes [111] MA VIS [300], GeomVerse [107], MetaMath-Rendered [281], MapQA [23], GeoQA+ [20], Geometry3K [164],Mathematics UniGeo [26], GEOS [206], CLEVR-Math [144] ChartQA [181], PlotQA [187], FigureQA [105], LRV-Instruction [148], ArxivQA [132], MMC-Inst [149], TabMWP [166], DVQA [104], UniChart [182], SimChart9K [263], Chart2Text [191], FinTabNet [312],Chart SciTSR [39], Synthetic Chart2Markdown LaionCOCO-OCR [204], Wukong-OCR [75], ParsynthOCR [89], SynthDoG-EN [112], SynthDoG-ZH [112], SynthDoG-RU [112], SynthDoG-JP [112], SynthDoG-KO [112], IAM [180], EST-VQA [253], ST-VQA [17],"},{"citing_arxiv_id":"2411.10442","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization","primary_cat":"cs.CL","submitted_at":"2024-11-15T18:59:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Thus, the cost of our pipeline is only 57.5% of that of RLAIF-V . Additionally, a comparison with other recent data pipelines [25, 71, 120] is also presented in Section 5.2.2. 3 Task Dataset General VQA VQAv2 [30], GQA [35], OKVQA [64], IconQA [60] Science AI2D [40], ScienceQA [61], M3CoT [16] Chart ChartQA [65], DVQA [38], MapQA [13] Mathematics GeoQA+ [12], CLEVR-Math [52], Geometry3K [59], GEOS [85], GeomVerse [39], Geo170K [28] OCR OCRVQA [69], InfoVQA [67], TextVQA [86], STVQA [8], SROIE [34] Document DocVQA [66] Table 1. Datasets used to build our preference dataset. 3.2. Multimodal Preference Dataset Dataset Statistics. Using this pipeline, we build a large- scale multimodal preference dataset, MMPR. Data exam-"},{"citing_arxiv_id":"2407.03320","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output","primary_cat":"cs.CV","submitted_at":"2024-07-03T17:59:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Vision Capability Enhancement WanJuan [46], Flicker[160], MMC-Inst[82], RCTW-17[130], CTW[165], LSVT[137], ReCTs[175], ArT[28] Table 1. Datasets used for Pre-Training. The data are collected from diverse sources for the three objectives. Task Dataset Caption ShareGPT4V [17], COCO [21], Nocaps [1] General QA VQAv2 [4], GQA [53], OK-VQA [105] VD [32], RD [16], VSR [81], ALLaV A-QA [15] Multi-Turn QA MMDU [92] Science QA AI2D [61], SQA [98], TQA [62], IconQA [97] Chart QA DVQA [58], ChartQA [106], ChartQA-AUG [106] Math QA MathQA [161], Geometry3K [96], TabMWP [99], CLEVR-MATH [80], Super [75] World Knowledge QA A-OKVQA [127], KVQA [128], ViQuAE [65] OCR QA TextVQA [133], OCR-VQA [109], ST-VQA [11] HD-OCR QA InfoVQA[108], DocVQA [107], TabFact [20],"},{"citing_arxiv_id":"2404.16821","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites","primary_cat":"cs.CV","submitted_at":"2024-04-25T17:59:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024. 3 [57] Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023. 3 [58] Adam Dahlgren Lindstr ¨om and Savitha Sam Abra- ham. Clevr-math: A dataset for compositional lan- guage, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022. 5 [59] Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning. TACL, 11:635-651, 2023. 5 [60] Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang."}],"limit":50,"offset":0}