{"total":15,"items":[{"citing_arxiv_id":"2512.10362","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2025-12-11T07:22:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Visual Funnel resolves contextual blindness in MLLMs by constructing an entropy-scaled portfolio of hierarchically structured image crops that preserves both local detail and global context.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.10479","ref_index":68,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","primary_cat":"cs.CV","submitted_at":"2025-04-14T17:59:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804-4814, 2022. 1, 3 [67] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In The Conference on Empirical Methods in Natural Language Processing, pages 292-305, 2023. 12, 13 [68] Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023. 1 [69] Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al."},{"citing_arxiv_id":"2504.09925","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding","primary_cat":"cs.CV","submitted_at":"2025-04-14T06:33:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.05271","ref_index":140,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","primary_cat":"cs.CV","submitted_at":"2024-12-06T18:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814, 2024. 25 [139] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InThe Conference on Empirical Methods in Natural Language Processing, pages 292-305, 2023. 21 [140] Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models.arXiv preprint arXiv:2311.06607, 2023. 1 [141] Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan L"},{"citing_arxiv_id":"2410.05970","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling","primary_cat":"cs.CV","submitted_at":"2024-10-08T12:17:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PDF-WuKong adds a sparse sampler to an MLLM for efficient long-PDF multimodal QA and reports an 8.6% F1 gain over proprietary models on a new 1.1M-pair academic-paper dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.13257","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?","primary_cat":"cs.CV","submitted_at":"2024-08-23T17:59:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.03320","ref_index":77,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output","primary_cat":"cs.CV","submitted_at":"2024-07-03T17:59:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"main robustness in visual reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 8 [76] Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023. 2 [77] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 2 [78] Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language"},{"citing_arxiv_id":"2404.16821","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites","primary_cat":"cs.CV","submitted_at":"2024-04-25T17:59:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024. 3, 6 [54] Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan L Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning. In CVPR, pages 14963-14973, 2023. 5 [55] Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023. 3 [56] Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. Moe-llava:"},{"citing_arxiv_id":"2403.20330","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Are We on the Right Way for Evaluating Large Vision-Language Models?","primary_cat":"cs.CV","submitted_at":"2024-03-29T17:59:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"1 48.7 37.5 20.3 36.7LLaV A-1.5[24] (Vicuna-v1.5-7B[8]) 7B LVLM 34.4 65.0 68.7 55.6 65.6 23.6 52.2 LLM 32.8 8.9 64.0 48.3 31.9 18.9 34.1 LVLM-text 34.2 26.2 71.9 63.3 38.1 29.4 43.9InternLM2-XC2[12] (InternLM2-7B[42]) 7B LVLM 41.7 79.6 96.7 81.4 74.9 57.4 72.0 LLM 19.8 8.4 52.7 42.6 7.6 20.5 25.3 LVLM-text 32.4 15.6 71.1 56.8 36.1 25.0 39.5Monkey-Chat[23] (Qwen-7B[1]) 10B LVLM 37.1 71.0 82.4 68.5 69.1 34.0 60.4 LLM 29.9 10.3 58.9 42.5 32.6 22.0 32.7 LVLM-text 30.1 15.5 54.6 52.5 36.7 25.0 35.7CogVLM-Chat[45] (Vicuna-v1.5-7B[8]) 17B LVLM 34.2 63.4 66.3 63.3 68.7 34.7 55.1 LLM 37.1 10.5 53.6 57.3 37.3 21.7 36.3 LVLM-text 37.3 23.2 68.6 59.9 41.0 22.7 42.1Yi-VL[49] (Yi-34B[49]) 34B LVLM 43.2 71.5 75.3 65."},{"citing_arxiv_id":"2403.09611","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training","primary_cat":"cs.CV","submitted_at":"2024-03-14T17:51:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"backbone is adapted to the new resolution during fine-tuning. Through this method, we have fine-tuned our model to support image resolutions ranging from 448×448, 560×560, to 672×672. Note that, for a resolution of672×672, with a patch size of14×14, an image is represented with2, 304 tokens. Sub-image decomposition, recently introduced by SPHINX [73], Mon- key [69], and LLaVA-NeXT [75]. Computing self-attention among more than 2, 000 image tokens is computationally challenging, limiting further scaling to even higher image resolutions. Following SPHINX [73], as shown in Figure 7a, for a high-resolution input image, e.g., 1344 × 1344, we construct five images of 672 × 672, and feed them as independent images into our visual encoder."},{"citing_arxiv_id":"2402.00253","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Hallucination in Large Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2024-02-01T00:33:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.16420","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model","primary_cat":"cs.CV","submitted_at":"2024-01-29T18:59:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.14238","ref_index":89,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","primary_cat":"cs.CV","submitted_at":"2023-12-21T18:59:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"filtering, thus retaining the vast majority of the data. We considered six factors: CLIP similarity, watermark proba- bility, unsafe probability, aesthetic score, image resolution, and caption length, to remove extreme data points and avoid disrupting training stability. Additionally, we removed data that was duplicated with ImageNet-1K/22K [38], Flickr30K [116], and COCO [89] to ensure the reliability of our zero- shot evaluations. Due to download failures and the use of our data filtering pipeline, the total amount of data retained in the first stage was 4.98 billion. (2) Stage 2: In the second stage, we implemented a more stringent data filtering strategy. With generative supervision included, we deleted most of the low-quality data based on"},{"citing_arxiv_id":"2307.06281","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MMBench: Is Your Multi-modal Model an All-around Player?","primary_cat":"cs.CV","submitted_at":"2023-07-12T16:23:09+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2306.13549","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2023-06-23T15:21:52+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"CogAgent [44] uses a dual-encoder mechanism, where two encoders process high and low-resolution images, respec- tively. High-resolution features are injected into the low- resolution branch through cross-attention. Patch-division methods cut a high-resolution image into patches and reuse the low-resolution encoder. For example, Monkey [51] and SPHINX [53] divide a large image into smaller patches and send sub-images together with a downsampled high- resolution image to the image encoder, where the sub- images and the low-resolution image capture local and global features, respectively. In contrast, parameter size and training data composition are of less importance compared with input resolution, found by empirical studies [52]."}],"limit":50,"offset":0}