{"total":12,"items":[{"citing_arxiv_id":"2605.30673","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation","primary_cat":"cs.CL","submitted_at":"2026-05-29T00:06:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TeachObs is a new human-validated benchmark dataset and evaluation protocol for multimodal AI on classroom teaching observation, showing no model dominates across tracks and that models over-rate procedurally clear lessons.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.18265","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","primary_cat":"cs.CV","submitted_at":"2025-08-25T17:58:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"70 - - - Qwen2.5-VL-72B [5] 73.3 / 79.1 70.4 2.02 74.6 60.7 70.9 InternVL3-78B [187] 72.7 / 75.7 78.7 1.81 79.5 65.7 72.1 InternVL3.5-241B-A28B 72.9 / 76.0 76.5 1.74 78.2 67.1 71.4 Table 9: Comparison of video understanding performance. We evaluate InternVL3.5's video understand- ing capabilities across 5 benchmarks. For Video-MME [ 35], MMBench-Video [ 33], MLVU [ 186], and LongVideoBench [149], we test with four different settings: 16, 32, 48, and 64 frames, and report the maximum results. For MVBench [ 63], we conduct testing using 16 frames. When calculating Overall, the score of MMBench-Video is normalized from 0-3 to 0-100. 3.10 Video Understanding InternVL3.5 demonstrates remarkable video understanding capabilities across a comprehensive set of bench-"},{"citing_arxiv_id":"2504.10479","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","primary_cat":"cs.CV","submitted_at":"2025-04-14T17:59:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"MMMB, Multilingual MMBench, and MTVQA underscores the promise of our approach in advancing global multimodal applications. 3.10 Video Understanding Video understanding is essential for evaluating how well MLLMs capture temporal and multimodal cues in complex video content. In this work, we assess the InternVL3 series on six established benchmarks-Video- MME [38], MVBench [65], MMBench-Video [35], MLVU [154], LongVideoBench [129], and CG-Bench [2], as detailed in Table 8. Overall, the InternVL3 models demonstrate clear performance improvements and a strong scalability trend over their predecessors. As the model capacity increases, the performance gains become more pronounced. For instance, InternVL3-2B records higher Video-MME scores (58.9/61."},{"citing_arxiv_id":"2502.13923","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Qwen2.5-VL Technical Report","primary_cat":"cs.CV","submitted_at":"2025-02-19T18:00:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level localization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.04326","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs","primary_cat":"cs.CV","submitted_at":"2025-02-06T18:59:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.13826","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos","primary_cat":"cs.CV","submitted_at":"2025-01-23T16:51:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"standing tasks, including action understanding [14, 22, 28, 38, 39, 44], temporal reasoning [3, 18, 20, 31, 34, 37, 42], and video captioning [ 4, 35, 40, 41, 53]. Several bench- marks enhance scene interpretation by incorporating exter- nal knowledge, including KnowIT-VQA [10] and WorldQA [50]. Recent benchmarks like Video-MME [9], MMBench- Video [7], and MLVU [52] have expanded the scope to as- sess multi-tasking and multi-domain video understanding. While these benchmarks recognize videos as visual scenes for interpretation, Video-MMMU uniquely recognizes video as an educational medium, emphasizing knowledge-driven question-answering on videos. 2.2. Knowledge-driven Benchmarks As AI systems progress toward Expert AGI [24], knowledge-"},{"citing_arxiv_id":"2501.12386","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling","primary_cat":"cs.CV","submitted_at":"2025-01-21T18:59:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding better benchmark performance, 6x longer video memory, and new capabilities likeobject","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.04001","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos","primary_cat":"cs.CV","submitted_at":"2025-01-07T18:58:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"is more common to report perception and cognition separately, we report the sum as in their original paper if the individual scores are missing. Method Image Segmentation Video Segmentation Image Chat Video Chat GCG RefCOCO [42] RefCOCO+ [42] RefCOCOg [105]MeViS [18] Ref-DA VIS17 [44] Ref-YTVOS [79] ReVOS [100]MME [25] MMBench [66] SEED-Bench [48]Video-MME [26] MMBench-Video [23]GCG [76] LLAVA-1.5-13B [62]- - - - - - - 1531(+) 68.8 70.1 - - -Video-LLaVA-7B [58]- - - - - - - - 60.9 - 39.9 1.03 -LLaMA-VID-7B [56]- - - - - - - 1521(+) 65.1 59.9 - 1.08 -mPLUG-Owl3-8B [104]- - - - - - - - 77.6 - 53.5 1.35 -InternVL2-8B [14]- - - - - - - - 81.776.2 54.01.28 -PixelLM-7B [78]73.0 66.3 69.3 - - - - 309/135 17.4 - - - -LaSagnA [86]76."},{"citing_arxiv_id":"2412.17574","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks","primary_cat":"cs.CV","submitted_at":"2024-12-23T13:45:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion perception and cross-modal alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.05271","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","primary_cat":"cs.CV","submitted_at":"2024-12-06T18:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"We report results for both \"with subtitle\" and \"without subtitle\" settings. MVBench[ 131]: MVBench is a video understanding benchmark designed to comprehensively evaluate the temporal awareness of MLLMs in the open world. It covers 20 challenging video tasks, ranging from perception to cognition, which cannot be effectively solved using a single frame. We test this benchmark using 16 frames. MMBench-Video[ 65]: MMBench-Video is a quantitative benchmark for evaluating MLLMs' video under- standing and temporal reasoning skills, covering diverse domains, multi-shot long videos, and features like hallucination, commonsense reasoning, and temporal reasoning. For this benchmark, we test with four different settings: 16, 32, 48, and 64 frames, and report the maximum scores."},{"citing_arxiv_id":"2407.03320","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output","primary_cat":"cs.CV","submitted_at":"2024-07-03T17:59:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"benchmark for human activity understanding. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 2, 8 [38] Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. MMBench-Video: A long-form multi-shot benchmark for holistic video under- standing. arXiv preprint arXiv:2406.14515, 2024. 2, 9 [39] Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing, 2024. 2 [40] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jin- rui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Ron- grong Ji. Mme: A comprehensive evaluation benchmark"},{"citing_arxiv_id":"2406.07476","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","primary_cat":"cs.CV","submitted_at":"2024-06-11T17:22:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}