{"total":17,"items":[{"citing_arxiv_id":"2605.22570","ref_index":43,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis","primary_cat":"cs.CV","submitted_at":"2026-05-21T14:48:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[41] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. [42] Chenglin Li, Qianglong Chen, Zhi Li, Feng Tao, and Yin Zhang. Videocogqa: A control- lable benchmark for evaluating cognitive abilities in video-language models.arXiv preprint arXiv:2411.09105, 2024. [43] Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195-22206, 2024. [44] Linjie Li, Mahtab Bigverdi, Jiawei Gu, Zixian Ma, Yinuo Yang, Ziang Li, Yejin Choi, and Ranjay"},{"citing_arxiv_id":"2605.22109","ref_index":37,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?","primary_cat":"cs.AI","submitted_at":"2026-05-21T07:42:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18678","ref_index":59,"ref_count":4,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Lance: Unified Multimodal Modeling by Multi-Task Synergy","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:18:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.","context_count":2,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"LLaMA-Adapter [146] 7B 23.0 28.0 51.0 30.0 33.0 53.5 32.5 33.5 25.5 21.5 30.5 29.0 22.5 41.5 39.5 31.5 22.5 28.0 32.0 31.7 Video-ChatGPT [82] 7B 23.5 26.0 62.0 22.5 26.5 54.0 28.0 40.0 23.0 20.0 31.0 30.5 25.5 39.5 48.5 33.0 29.5 26.0 35.5 32.7 VideoChat [61] 7B 33.5 26.5 56.0 33.5 40.5 53.0 40.5 30.0 25.5 27.0 48.5 35.0 20.5 42.5 46.0 41.0 23.5 23.5 36.0 35.5 VideoChat2 [60] 7B 66.0 47.5 83.5 49.5 60.0 58.0 71.5 42.5 23.0 23.0 88.5 39.0 42.0 58.5 44.0 36.5 35.0 40.5 65.5 51.1 ST-LLM [76] 7B 66.0 53.5 84.0 44.0 58.5 80.5 73.5 38.5 42.5 31.0 86.5 36.5 56.5 78.5 43.0 46.5 34.5 41.5 58.5 54.9 GPT-4V [87] - 55.5 63.5 72.0 46.5 73.5 18.5 59.0 29.5 12.0 40.5 83.5 39.0 12.0 22.5 45.0 52.0 31.0 59.0 11.0 43.5 PLLaVA [137] 34B 67."},{"citing_arxiv_id":"2605.17360","ref_index":13,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction","primary_cat":"cs.CV","submitted_at":"2026-05-17T09:57:01+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Omni-DuplexEval creates a new benchmark and LLM-as-a-Judge framework for real-time duplex omni-modal interaction, revealing that current models score below 40% overall and struggle especially with proactive responses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14607","ref_index":18,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ViMU: Benchmarking Video Metaphorical Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-14T09:23:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13228","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding","primary_cat":"cs.CV","submitted_at":"2026-05-13T09:19:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"DVD [50] - - - 67.3 NVILA [23] 8B 70.1 68.1 64.2 STAR [6] - - - 70.0 VideoSeek [19] - - - 70.1 InternVL3.5-30B-A3B [37] 30B 73.0 72.1 68.7 RETOOL-VIDEO(Ours) 9B 81.5 72.9 76.6 5 Experiments 5.1 Experiment Settings Benchmarks and Evaluation Metrics.We evaluate RETOOL-VIDEOon three representative general- purpose video understanding benchmarks: MVBench [15], MLVU [52], and Video-MMEw/o sub. [8]. MVBench focuses on short-video temporal understanding, while MLVU emphasizes long-video reasoning with tasks covering both global video comprehension and local evidence reasoning. Video- MME provides a comprehensive open-domain evaluation across diverse video durations and scenarios; we use its standard no-subtitle setting, denoted as Video-MME w/o sub."},{"citing_arxiv_id":"2605.11803","ref_index":23,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-12T08:58:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OTT-Vid uses optimal transport with non-uniform token mass and locality-aware costs to dynamically allocate compression budgets across video frames, retaining 95.8% VQA and 73.9% VTG performance at 10% token retention.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"We use this difficulty to distribute the compression budget non-uniformly across frame pairs, and execute compression within each pair according to its transport plan and allocated budget. In this way, OTT-Vid unifies token-level compression decisions with pair-level budget allocation in a single transport formulation. We evaluate OTT-Vid on four video question answering benchmarks (MVBench [ 23], VideoMME [13], LongVideoBench [28], MLVU [34]) and two video temporal grounding bench- marks (Charades-STA [15], ActivityNet-Captions [3], with refined annotations from TimeLens [31]). The latter directly tests whether compression preserves temporal evidence, which prior work rarely evaluates. On Qwen2.5-VL-7B [2], OTT-Vid consistently outperforms strong training-free baselines"},{"citing_arxiv_id":"2605.09904","ref_index":22,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-11T02:47:59+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"NeXT-Video established effective paradigms for video-language instruction tuning [27, 25, 21, 7, 57]. More recent multimodal systems further improve long-context video perception and general multimodal reasoning [36, 6, 39, 34, 2]. Meanwhile, video understanding benchmarks have expanded from short-video QA and activity understanding [ 47, 46, 52] to broad temporal reasoning, long- video comprehension, and shortcut-aware evaluation [22, 13, 11, 42, 8, 14]. These advances raise a finer-grained evaluation question: can Video-LLMs maintain object-level consistency? ∗junzhec@tju.edu.cn †xj.max.guo@gmail.com Preprint. arXiv:2605.09904v2 [cs.CV] 12 May 2026 Figure 1: Representative TOC-Bench QA examples. The benchmark supports multiple deterministic task formats, including four-way multiple choice, statement-pair judgment, event ordering, and"},{"citing_arxiv_id":"2605.07568","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-08T10:40:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"critical evidence explicitly shown in the video. Temporal Benchmarks.To cover diverse forms of temporal reasoning, we evaluate on AoTP P B, TVBench [14], and VITATECS (Direction) [27]. These benchmarks require models to reason about physical irreversibility, temporal ordering, motion direction or temporal localization. General Benchmarks.We use MVBench [ 24] and Video-MME [15] as temporal-neutral controls. These benchmarks cover general video understanding tasks, domains, and video durations, but with low sensitivity to temporal information. We therefore use them as general-purpose control benchmarks to assess whether the additional AoT supervision will disrupt general video understanding. 6.3 Main Results"},{"citing_arxiv_id":"2605.06094","ref_index":22,"ref_count":4,"confidence":0.55,"is_internal_anchor":false,"paper_title":"VISD: Enhancing Video Reasoning via Structured Self-Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-07T12:13:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Visioncoach: Reinforcing grounded video reasoning via visual-perception prompting.arXiv preprint arXiv:2603.14659, 2026. [21] Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026. [22] Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195-22206, 2024. [23] Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao,"},{"citing_arxiv_id":"2604.21921","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Context Unrolling in Omni Models","primary_cat":"cs.CV","submitted_at":"2026-04-23T17:58:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05015","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-06T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.07062","ref_index":74,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Seed1.5-VL Technical Report","primary_cat":"cs.CV","submitted_at":"2025-05-11T17:28:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"For long video understanding, it also attains strong results with a 128K token context (up to 640 frames). We recognize the importance of extended temporal understanding and plan future work focused on expanding this context window capacity to further enhance long-form video comprehension. Regarding streaming video understanding, we evaluate on OVBench [51], OVOBench [74], StreamBench [153], and the proactive sub-task of StreamingBench [76]. Seed1.5-VL achieves SOTA performance across all these benchmarks, indicating strong potential for real-time applications such as interactive video dialogue systems. In video reasoning (Video-MMMU [49], MMVU [175]), Seed1.5-VL scores 81.4 and 70.1, respectively, currently trailing top models such as Gemini 2."},{"citing_arxiv_id":"2504.06958","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning","primary_cat":"cs.CV","submitted_at":"2025-04-09T15:09:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.21776","ref_index":20,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Video-R1: Reinforcing Video Reasoning in MLLMs","primary_cat":"cs.CV","submitted_at":"2025-03-27T17:59:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"but also enables the model to transfer reasoning skills learned from static images to dynamic video contexts. Combined with T-GRPO, this approach equips Video-R1 with stronger, more generalizable video reasoning capabilities. Our experiments show that Video-R1 achieves consistent and significant improvements across a suite of challenging video reasoning benchmarks, including VSI-Bench [38], VideoMMMU [13], MMVU [48], MVBench [20], TempCompass [27], and VideoMME [9]. Notably, Video-R1-7B attains 37.1% accuracy on VSI-Bench, a challenging video spatial reasoning benchmark, outperforming even proprietary models like GPT-4o [15]. These results suggest that with carefully designed algorithms and data pipelines, RL can indeed unlock complex temporal reasoning capabilities in MLLMs, similar"},{"citing_arxiv_id":"2502.04326","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs","primary_cat":"cs.CV","submitted_at":"2025-02-06T18:59:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.13106","ref_index":143,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","primary_cat":"cs.CV","submitted_at":"2025-01-22T18:59:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023. [142] Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195-22206, 2024. [143] Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spatiotemporal adaptive compression for long video-language understanding. arXiv preprint arXiv:2410.17434, 2024. [144] Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng."}],"limit":50,"offset":0}