{"total":24,"items":[{"citing_arxiv_id":"2605.22269","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering","primary_cat":"cs.CV","submitted_at":"2026-05-21T10:13:03+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MuKV adds multi-grained KV cache compression at patch-frame-segment levels plus semi-hierarchical retrieval to raise accuracy and cut memory in long video question-answering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17283","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OProver: A Unified Framework for Agentic Formal Theorem Proving","primary_cat":"cs.CL","submitted_at":"2026-05-17T06:39:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17260","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-17T05:02:52+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14310","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-14T03:22:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09904","ref_index":39,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-11T02:47:59+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958- 22967, 2025. [38] Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. InEuropean conference on computer vision, pages 396-416. Springer, 2024. [39] Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025. [40] Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng. Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models."},{"citing_arxiv_id":"2605.09874","ref_index":105,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-11T01:59:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07897","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-08T15:40:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"To see is to believe: Prompting gpt-4v for better visual instruction tuning.arXiv preprint arXiv:2311.07574, 2023. [33] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. [34] Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, Min Dou, Kai Chen, Wenhai Wang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2.5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025. [35] Hang Wu, Yujun Cai, Zehao Li, Haonan Ge, Bowen Sun, Junsong Yuan, and Yiwei Wang."},{"citing_arxiv_id":"2605.06094","ref_index":42,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VISD: Enhancing Video Reasoning via Structured Self-Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-07T12:13:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025. [41] Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision language model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. [42] Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025. [43] Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, and Mohit Bansal. Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and"},{"citing_arxiv_id":"2605.04515","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Priors to Perception: Grounding Video-LLMs in Physical Reality","primary_cat":"cs.CV","submitted_at":"2026-05-06T05:48:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard LoRA fine-tuning.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024. [35] Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025. [36] Qwen Team. Qwen 3.5 technical report.https://qwen.ai/blog?id=qwen3.5, 2026. [37] Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025. [38] Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng. Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models."},{"citing_arxiv_id":"2605.00496","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"High-Speed Vision Improves Zero-Shot Semantic Understanding of Human Actions","primary_cat":"cs.CV","submitted_at":"2026-05-01T08:12:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Higher temporal resolution in video significantly improves zero-shot semantic understanding of high-speed human actions like kendo.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21873","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Grounding Video Reasoning in Physical Signals","primary_cat":"cs.CV","submitted_at":"2026-04-23T17:17:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robustness and weak spatial performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12335","ref_index":91,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-14T06:17:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A unified synthetic data generation pipeline produces unlimited annotated multimodal video data across multiple tasks, enabling models trained mostly on synthetic data to generalize effectively to real-world video understanding benchmarks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Following the proposed synthetic data generation pipeline, we generate approximately 5K training videos and 1K val- idation videos using images from the MSCOCO dataset. These generated videos (see figure 4), along with their asso- ciated multimodal annotations, form a synthetic dataset that is used to fine-tune a multimodal video foundation model. In this work, we adopt InternVideo2.5 [91] as the base video foundation model. To investigate the effectiveness of differ- ent supervision signals, we fine-tune the model using three types of training configurations: (1) caption-only supervi- sion, following the conventional fine-tuning paradigm; (2) captions augmented with visual question-answer pairs; and (3) VQA-only supervision. Furthermore, we evaluate the fine-tuned model on two"},{"citing_arxiv_id":"2604.11240","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-13T09:44:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08966","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms","primary_cat":"cs.CV","submitted_at":"2026-04-10T05:10:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A controlled study on compact video LLMs finds that continuous temporal decoding delivers the strongest accuracy-efficiency trade-off for video temporal grounding across three benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08337","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-09T15:10:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04372","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning","primary_cat":"cs.CV","submitted_at":"2026-04-06T02:43:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"GPT-4o(2024-05-13)[21] - 57.5 67.9 1.87 69.5 34 62.1 61.2 Open-Source LMMs (5B<) Qwen2.5-VL [5] 3B 67.0 61.5 1.63 64.4 27.9 38.6 42.3 Qwen3-VL [4] 4B 68.9 69.3 - - 58.4 - 56.2 InternVL3.5 [46] 4B 71.2 65.4 1.59 65.4 54.9 45.2 58.3 + G2F-RAG (Ours)4B 74.8+3.6 70.1+4.7 1.66+0.07 68.7+3.3 59.0+4.1 47.1+1.9 62.4+3.9 Open-Source LMMs (>5B, 10B<) InternVideo2.5 [47] 7B 75.7 65.1 - - - - 43.0 VideoLLaMA 3 [58] 7B 69.7 61.0 - - - - 46.0 VILA-1.5 [30] 8B - - - 58.8 28.9 - 20.9 MiniCPM-V 2.6 [55] 8B 44.7 59.7 1.70 59.6 - 46.4 - MiniCPM-V 4.5 [57] 8B - 67.9 - - - - - InternVL2.5 [9] 8B 70.5 63.7 1.68 68.7 41.6 - - InternVL3 [65] 8B 73.2 66.0 1.69 70.4 41.6 - 48.9 Qwen3-VL [4] 8B 68.7 71.4 - - 59.4 - 65.3 LLaV A-Video [62] 7B 62."},{"citing_arxiv_id":"2512.21334","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Streaming Video Instruction Tuning","primary_cat":"cs.CV","submitted_at":"2025-12-24T18:59:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13511","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Adapting MLLMs for Nuanced Video Retrieval","primary_cat":"cs.CV","submitted_at":"2025-12-15T16:38:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.03043","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OneThinker: All-in-one Reasoning Model for Image and Video","primary_cat":"cs.CV","submitted_at":"2025-12-02T18:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.24943","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents","primary_cat":"cs.CV","submitted_at":"2025-09-29T15:42:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.15602","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?","primary_cat":"cs.CV","submitted_at":"2025-09-19T05:08:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces TennisTV benchmark for evaluating 17 MLLMs on tennis video understanding from stroke-level to rally-level tasks with automated pipelines and human verification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.01844","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics","primary_cat":"cs.LG","submitted_at":"2025-06-02T16:30:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.16933","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning","primary_cat":"cs.LG","submitted_at":"2025-05-22T17:23:26+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Advancing universal audio understanding via unified large-scale audio-language models,\" arXiv preprint arXiv:2311.07919, 2023. [9] S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, S. Sakshi, O. Nieto, R. Duraiswami, and D. Manocha, \"Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,\"arXiv preprint arXiv:2406.11768, 2024. [10] Y . Wang, X. Li, Z. Yan, Y . He, J. Yu, X. Zeng, C. Wang, C. Ma, H. Huang, J. Gao et al., \"Internvideo2. 5: Empowering video mllms with long and rich context modeling,\" arXiv preprint arXiv:2501.12386, 2025. [11] L. Chen, X. Wei, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, Z. Tang, L. Yuan et al., \"Sharegpt4video: Improving video understanding and generation with better captions,\""},{"citing_arxiv_id":"2504.06958","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning","primary_cat":"cs.CV","submitted_at":"2025-04-09T15:09:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}