{"total":11,"items":[{"citing_arxiv_id":"2606.29445","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction","primary_cat":"cs.CV","submitted_at":"2026-06-28T15:11:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17065","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning","primary_cat":"cs.MA","submitted_at":"2026-05-16T16:15:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PyraVid is a hierarchical multimodal memory system that structures long videos into pyramids to improve long-horizon reasoning and evidence aggregation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.27259","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark","primary_cat":"cs.CV","submitted_at":"2026-03-28T12:44:19+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.05299","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SmolVLM: Redefining small and efficient multimodal models","primary_cat":"cs.AI","submitted_at":"2025-04-07T17:58:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.13826","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos","primary_cat":"cs.CV","submitted_at":"2025-01-23T16:51:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022. 6 [30] Marija Sabli'c, Ana Mirosavljevi'c, and Alma Škugor. Video- based learning (vbl)-past, present and future: An overview of the research published from 2008 to 2019. Technology, Knowledge and Learning, 26(4):1061-1077, 2021. 2 [31] Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023. 2 [32] Gemini Team. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context."},{"citing_arxiv_id":"2501.05067","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding","primary_cat":"cs.CV","submitted_at":"2025-01-09T08:43:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.03320","ref_index":135,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output","primary_cat":"cs.CV","submitted_at":"2024-07-03T17:59:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2, 8, 9 [134] Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense token to sparse memory for long video understand- ing. arXiv preprint arXiv:2307.16449, 2023. 2 [135] Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, and Gaoang Wang. Moviechat+: Question-aware sparse memory for long video question answering. arXiv preprint arXiv:2404.17176, 2024. 2 [136] Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiy- ing Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Genera-"},{"citing_arxiv_id":"2406.08035","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LVBench: An Extreme Long Video Understanding Benchmark","primary_cat":"cs.CV","submitted_at":"2024-06-12T09:36:52+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.07476","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","primary_cat":"cs.CV","submitted_at":"2024-06-11T17:22:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.04264","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MLVU: Benchmarking Multi-task Long Video Understanding","primary_cat":"cs.CV","submitted_at":"2024-06-06T17:09:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"focus on one specific task, like captioning [52]. These limi- tations hinder comprehensive evaluation of LVU capabilities. Last but not least, many previous evaluation tasks are not properly designed for LVU, as they can be solved without using the complex information from long videos. For ex- ample, many questions are simply about one single frame in the long videos [ 41, 60]. Besides, numerous others are about popular movies and celebrities [13, 27], which can be answered directly by MLLMs based on the textual prompts. Conceptually, MLLMs are expected to handle any type of long video and accomplish any related tasks. Therefore, the evaluation of LVU should emphasize two important prop- erties: length and diversity."},{"citing_arxiv_id":"2404.16994","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning","primary_cat":"cs.CV","submitted_at":"2024-04-25T19:29:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}