{"total":19,"items":[{"citing_arxiv_id":"2605.11405","ref_index":12,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone","primary_cat":"cs.LG","submitted_at":"2026-05-12T01:51:03+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05045","ref_index":8,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise","primary_cat":"cs.CV","submitted_at":"2026-05-06T15:41:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt and preprocessing fixes providing only partial relief.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02487","ref_index":57,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Visibility-Aware Mobile Grasping in Dynamic Environments","primary_cat":"cs.RO","submitted_at":"2026-05-04T11:41:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A visibility-aware mobile grasping system with iterative whole-body planning and behavior-tree subgoal generation achieves 68.8% success in unknown static and 58% in dynamic environments, outperforming a baseline by 22.8% and 18%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27472","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations","primary_cat":"cs.AI","submitted_at":"2026-04-30T06:14:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17472","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniMesh: Unifying 3D Mesh Understanding and Generation","primary_cat":"cs.CV","submitted_at":"2026-04-19T14:53:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11490","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Anthropogenic Regional Adaptation in Multimodal Vision-Language Model","primary_cat":"cs.AI","submitted_at":"2026-04-13T13:56:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08456","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-09T16:51:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08050","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning","primary_cat":"cs.CV","submitted_at":"2026-04-09T09:58:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ABMamba uses Mamba-based linear-complexity processing plus a novel Aligned Hierarchical Bidirectional Scan to deliver competitive video captioning on VATEX and MSR-VTT at roughly 3x higher throughput than typical Transformer MLLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05672","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model","primary_cat":"cs.RO","submitted_at":"2026-04-07T10:18:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.13778","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy","primary_cat":"cs.RO","submitted_at":"2025-10-15T17:30:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.18265","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","primary_cat":"cs.CV","submitted_at":"2025-08-25T17:58:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"0 31.6 / 65.0 90.1 / 96.4 76.7 MiniCPM-V-4-4B [164] 80.9 / 91.4 73.0 81.4 94.0 67.0 862 67.0 31.9 / 56.4 80.9 / 90.1 75.0 InternVL3.5-4B 82.6 / 92.3 86.0 77.9 92.4 78.0 822 69.4 39.6 / 71.1 91.6 / 97.0 80.0 Ovis1.6-Gemma2-9B [77] 84.4 / - - - - - 830 - - - - MiniCPM-V2.6-8B [164] 82.1 / - 82.4 80.1 90.8 - 852 65.7 31.0 / 57.1 73.9 / 85.7 - Molmo-7B-D [30] - / 93.2 84.1 81.7 92.2 72.6 694 - - - - Qwen2-VL-7B [138] 83.0 / 92.1 83.0 84.3 94.5 76.5 866 69.0 - 89.7 / 93.8 - Qwen2.5-VL-7B [5] 83.9 / - 87.3 84.9 95.7 82.6 864 70.4 42.5 / 73.9 - - Keye-VL-8B [126] 85.8 / 88.5 72.5 75.7 87.0 63.0 853 67.8 36.8 / 75.2 - - GLM-4.1V-9B [46] 82.2 / 87.0 70.0 79.6 93.3 80.3 823 71.8 53.4 / 82.4 32.7 / 55.2 72.5"},{"citing_arxiv_id":"2505.07062","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seed1.5-VL Technical Report","primary_cat":"cs.CV","submitted_at":"2025-05-11T17:28:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"and extract noun phrases and entities from captions, and then adopt Grounding DINO [14, 80] to annotate diverse open-vocabulary objects in web images. We filter out low-quality annotations with CLIP [106] and heuristic metrics,e.g., non-maximum suppression. The automatic annotation pipeline brings about 200 million samples and 200 billion tokens. Point Data. Initially, we utilized the public data provided by PixMo-Points [21]. Recognizing limitations in the diversity and quantity of the available PixMo data, we developed a dedicated pipeline for generating additional pointing data. This pipeline employs Molmo [21] and CountGD [3] to annotate the center points of objects within a large collection of web images. Notably, CountGD proved particularly effective in annotating"},{"citing_arxiv_id":"2504.13181","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Perception Encoder: The best visual embeddings are not at the output of the network","primary_cat":"cs.CV","submitted_at":"2025-04-17T17:59:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.10479","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","primary_cat":"cs.CV","submitted_at":"2025-04-14T17:59:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"9 784 60.0 21.0 / 40.6 32.9 / 59.2 62.0 InternVL2.5-2B [18] 74.9 / 83.5 79.2 74.3 88.7 60.9 804 60.9 21.3 / 49.7 93.2 / 97.6 72.1 InternVL3-2B 78.7 / 87.4 80.2 77.0 88.3 66.1 835 64.6 28.3 / 54.7 91.2 / 96.9 74.7 Ovis1.6-Gemma2-9B [84] 84.4 / - - - - - 830 - - - - MiniCPM-V2.6 [135] 82.1 / - 82.4 80.1 90.8 - 852 65.7 31.0 / 57.1 73.9 / 85.7 - Molmo-7B-D [31] - / 93.2 84.1 81.7 92.2 72.6 694 - - - - Qwen2-VL-7B [121] 83.0 / 92.1 83.0 84.3 94.5 76.5 866 69.0 - 89.7 / 93.8 - Qwen2.5-VL-7B [7] 83.9 / - 87.3 84.9 95.7 82.6 864 70.4 42.5/73.9 - - InternVL2-8B [19] 83.8 / 91.7 83.3 77.4 91.6 74.8 794 67.5 31.2 / 56.1 37.9 / 61.5 69.7 InternVL2.5-8B [18] 84.5 / 92.8 84.8 79.1 93.0 77.6 822 69.7 32.9 / 68.6 92."},{"citing_arxiv_id":"2504.05299","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SmolVLM: Redefining small and efficient multimodal models","primary_cat":"cs.AI","submitted_at":"2025-04-07T17:58:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.01743","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs","primary_cat":"cs.CL","submitted_at":"2025-03-03T17:05:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.13106","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","primary_cat":"cs.CV","submitted_at":"2025-01-22T18:59:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Both types of data enhance the model's understanding of images, supporting more accurate object localization and recognition in complex scenes. 3.2.3 Multi-task Fine-tuning Table 3 Data mixture in massive multi-task fine-tuning stage. Task Dataset Amount Image & Text Data General LLaVA-SFT-665K [38], LLaVA-OV-SI [29], Cambrian-cleaned [39], Pixmo (docs, cap, points, cap-qa, ask-model-anything) [35] 9.87M Document DocVQA [40], Docmatix [41] 1.31M Chart/Figure ChartQA [42], MMC_Instruction [83], DVQA [84], LRV_Instruction [85], Chart- Gemma [86], InfoVQA [87], PlotQA [88] 1.00M OCR MultiUI [89], in-house data 0.83M Grounding RefCoco [90], VCR [91], in-house data 0.50M Multi-Image Demon-Full [92], Contrastive_Caption [93] 0.41M Text-only Magpie [94], Magpie-Pro [94], Synthia [95], Infinity-Instruct-subjective [82], Numina-"},{"citing_arxiv_id":"2412.10302","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding","primary_cat":"cs.CV","submitted_at":"2024-12-13T17:37:48+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.05271","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","primary_cat":"cs.CV","submitted_at":"2024-12-06T18:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"GPT-4V [192] 63.1 - - 58.1 - / 24.0 32.8 18.0 GPT-4o-20240513 [192] 69.1 - 54.0 / 49.7 / 51.9 63.8 - / 30.4 50.2 25.9 Claude-3.5-Sonnet [8] 68.3 - 55.0 / 48.0 / 51.5 67.7 - - - Gemini-1.5-Pro [200] 62.2 - 49.4 / 44.4 / 46.9 63.9 - / 19.2 - - LLaV A-OneVision-72B [124] 56.8 - 38.0 / 24.0 / 31.0 67.5 - 39.1 - NVLM-D-72B [50] 59.7 54.6 - 66.6 - - - Molmo-72B [54] 54.1 - - 58.6 - - - Qwen2-VL-72B [246] 64.5 - 49.2 / 43.3 / 46.2 70.5 - / 25.9 - 11.2 InternVL2-Llama3-76B [35] 62.7 55.1 41.9 / 38.0 / 40.0 65.5 23.7 / 23.6 42.8 5.5 InternVL2.5-78B 70.1 61.8 51.4 / 45.9 / 48.6 72.3 34.9 / 32.2 51.7 11.6 Table 6:Comparison of multimodal reasoning and mathematical performance.MMMU [ 289] and MMMU-Pro [290] are multidisciplinary reasoning benchmarks, while MathVista [163], MATH-Vision [245],"}],"limit":50,"offset":0}