{"total":24,"items":[{"citing_arxiv_id":"2605.27959","ref_index":67,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-27T04:52:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ROVER introduces a learnable routing plugin for object-centric visual evidence in MLLMs via token triplets and differential attention, reporting gains on MM-GCoT and VideoEspresso when integrated into Qwen2.5-VL-7B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20165","ref_index":78,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-19T17:50:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19846","ref_index":26,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-19T13:40:26+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18160","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-18T10:04:22+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05848","ref_index":40,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-07T08:23:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"mark our VideoRouter adaptations; the controlled InternVL3 comparison is provided in Table 2. 6 Model Video-MME LongVideoBench MLVU LVBench EgoSchema Long Overall Average Duration 2386s 1010s 473s 651s 4101s 180s Proprietary MLLMs GPT-4V [36] 53.5 59.9 59.1 49.2 - 55.6 GPT-4o [37] 65.3 71.9 66.7 64.6 30.8 72.2 Open-source MLLMs Oryx-1.5 [38] - 58.8 56.3 67.5 - - MiniCPM-v2.6 [39] 51.8 60.9 54.9 37.3 - - mPLUG-Owl3 [40] 50.1 59.3 52.1 63.7 - - NVILA [41] 54.8 64.2 - 70.1 - - LLaV A-Video [6] - 63.3 58.2 70.8 41.5 57.3 Video-XL [42] 49.2 55.5 49.5 64.9 - - VideoLLaMA2 [43] - 47.9 - 48.5 - 51.7 Video-CCAM [44] 46.7 53.2 - 58.5 - - Kangaroo [45] 46.7 56.0 54.8 61.0 39.4 62.7 LongV A [46] 46.2 52.6 - 56.3 - - LongVILA [47] - 60.1 57.1 - - - LongVU [48] - 60.6 - 65.4 - -"},{"citing_arxiv_id":"2604.22498","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-24T12:26:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"guage generation problem by serializing spatial coordinates into text tokens. This line of work establishes grounding as a practical interface between language generation and object localization. To further improve grounding fidelity and reduce hallucination, subse- quent approaches incorporate stronger supervision or optimization strategies, including preference learning [51] and reinforcement learning for single-image perception [27, 28]. More recently, sev- eral works have begun to extend grounding to multi-image tasks, including benchmark construction and post-training methods such as Migician [23], MIRG-RL [58], and GeM-VG [59]. While these efforts demonstrate the importance of grounding in multi-image scenarios, they typically rely on manually curated multi-image an-"},{"citing_arxiv_id":"2604.17087","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling","primary_cat":"cs.CV","submitted_at":"2026-04-18T17:52:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14149","ref_index":73,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-15T17:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Note that the performance of proprietary models and 3∼9B-scale VLMs are provided for reference. Size LongVideoBench MLVU VideoMME (Long) LVBench Average Duration 473s 651s 2386s 4101s Proprietary Models GPT-4V [48] - 59.1 49.2 53.5 - GPT-4o [49] - 66.7 64.665.3 30.8 Gemini-1.5-Pro [58] - 64.0 -67.4 33.1 3∼9B-Scale VLMs Qwen2.5VL-3B [4] 3B 43.3 68.2 - - mPLUG-Owl3 [73] 7B 52.1 - 50.1 43.5 VideoChat-Flash-7B [35] 7B 64.7 74.755.4 48.2 Eagle2.5-8B [8] 8B 66.4 77.6- - Kangaroo [41] 8B 54.8 61.0 46.7 39.4 TimeMarker [10] 8B 56.3 - 46.4 41.3 InternVL3-9B [81] 9B 62.5 70.8 - - 2B-Scale VLMs InternVL3-2B [81] 2B 55.4 64.2 - - VideoChat-Flash-2B [35] 2B 58.3 65.7 44.9 42.9 XComp 2B 59.7 66.7 45.6 46.2 Discussion.Despite the complexity of duplicating the question token, it is viable to mitigate the"},{"citing_arxiv_id":"2604.07914","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction","primary_cat":"cs.CV","submitted_at":"2026-04-09T07:31:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MESA reduces hallucinations in LVLMs via controlled selective latent intervention that preserves the original token distribution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05418","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG","primary_cat":"cs.CV","submitted_at":"2026-04-07T04:26:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"3 75.4 + PE 58.1 79.8 w/o Weighted Expectation 54.2 71.6 w/o Event-Boundary Detector 62.3 88.2 w/o Spatio-Temporal Graph 56.4 74.8 w/o Spatial Edges 57.2 79.3 w/o Temporal Edges 59.8 83.4 Full 64.5 92.2 ing the same experimental setup as Video-RAG [2] and TV-RAG [9]. Baseline MLLMs include GPT- 4o [79], LLaV A-Video (7B/72B) [80], mPLUG- Owl3 (8B) [81], Aria (25B) [82], and InternVL-1.5 (26B) [83]. Results are compiled from the bench- marks' leaderboards and the original papers, with the remaining results obtained from our reproduc- tions based on their open-sourced codes and origi- nal settings. In most cases, VideoStir outperforms both the native model and prior long-video RAG baselines, demonstrating highly competitive per-"},{"citing_arxiv_id":"2604.04379","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning","primary_cat":"cs.CV","submitted_at":"2026-04-06T03:01:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"2 67.9 - 57.5 62.1 30.8 66.7 Open-Source LMMs LLaV A-OneVision-7B [19] 32.4 33.8 58.2 - 56.7 - - - ShareGPT4Video-8B [5] - - - - - - - 39.7 LLaV A-Video-7B [56] 35.7 36.1 63.7 65.5 62.1 53.4 - 59.5 VILA-1.5-8B [24] 28.9 20.9 - 58.8 - - - - VideoLLaMA 3-7B [51] - 46.0 61.0 - - - 45.3 - MiniCPM-V 2.6-8B [47] - - 59.7 59.6 44.7 46.4 43.5 - mPLUG-Owl3-8B [48] - - 53.5 - 54.5 - - 59.8 InternVL2.5-8B [6] 41.6 - 63.7 68.7 70.5 - - - Qwen2.5-VL-7B [2] 37.4 47.4 65.1 69.2 67.5 51.3 42.0 56.0 RL-based LMMs Video-R1 [9] 35.8 52.3 59.3 73.2 63.9 - - - STAR-R1 [23] 34.1 49.2 56.6 72.4 67.8 - - - TinyLLaV A-Video-R1 [55] - - 46.6 49.5 - - - - VideoChat-R1 [22] - 50.0 58.8 73.9 67.9 - - - VideoChat-R1.5 [43] - 51."},{"citing_arxiv_id":"2603.27259","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark","primary_cat":"cs.CV","submitted_at":"2026-03-28T12:44:19+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.20093","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback","primary_cat":"cs.CV","submitted_at":"2025-10-23T00:27:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StableSketcher improves text-to-sketch generation by fine-tuning a diffusion VAE and adding a VQA-based RL reward, while releasing the SketchDUO dataset of sketches with captions and QA pairs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.15602","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?","primary_cat":"cs.CV","submitted_at":"2025-09-19T05:08:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces TennisTV benchmark for evaluating 17 MLLMs on tennis video understanding from stroke-level to rally-level tasks with automated pipelines and human verification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.00748","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2025-07-01T13:48:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A pipeline of chain-of-thought data synthesis, LoRA-based supervised fine-tuning, rejection sampling, and rule-based reinforcement learning raises multi-image grounding accuracy by 9.04% on MIG-Bench and 4.41% on average across seven other benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.05831","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HeartcareGPT: A Unified Multimodal ECG Suite for Dual Signal-Image Modeling and Understanding","primary_cat":"cs.LG","submitted_at":"2025-06-06T07:56:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HeartcareGPT proposes Dual Stream Projection Alignment (DSPA) on a structure-aware tokenizer for unified ECG signal-image modeling, supported by Heartcare-400K dataset and Heartcare-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.05425","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning","primary_cat":"cs.CV","submitted_at":"2025-06-05T05:51:35+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.16416","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2025-05-22T09:05:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Circle-RoPE achieves cross-modal positional disentanglement in VLMs by mapping 2D image tokens to a cone-like annulus orthogonal to the text axis, with PTD=0 eliminating RoPE geometric bias while preserving intra-image structure via alternating geometry encoding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.14362","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DeepEyes: Incentivizing \"Thinking with Images\" via Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2025-05-20T13:48:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.04326","ref_index":79,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs","primary_cat":"cs.CV","submitted_at":"2025-02-06T18:59:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.12386","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling","primary_cat":"cs.CV","submitted_at":"2025-01-21T18:59:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding better benchmark performance, 6x longer video memory, and new capabilities likeobject","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.00321","ref_index":143,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning","primary_cat":"cs.CV","submitted_at":"2024-12-31T07:32:35+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"\"Generative multimodal models are in-context learners,\" in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 398-14 409. [142] J. Ye, H. Xu, H. Liu, A. Hu, M. Yan, Q. Qian, J. Zhang, F. Huang, and J. Zhou, \"mplug-owl3: Towards long image-sequence understanding in multi-modal large language models,\"arXiv preprint arXiv:2408.04840, 2024. [143] W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, X. Song et al., \"Cogvlm: Visual expert for pretrained language models,\"arXiv preprint arXiv:2311.03079, 2023. [144] Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., \"Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,\""},{"citing_arxiv_id":"2411.16771","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VidHal: Benchmarking Temporal Hallucinations in Vision LLMs","primary_cat":"cs.CV","submitted_at":"2024-11-25T06:17:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.08035","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LVBench: An Extreme Long Video Understanding Benchmark","primary_cat":"cs.CV","submitted_at":"2024-06-12T09:36:52+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}