{"total":16,"items":[{"citing_arxiv_id":"2605.18018","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-18T08:09:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12034","ref_index":42,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation","primary_cat":"cs.MM","submitted_at":"2026-05-12T12:16:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. [41] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2024. URL https://arxiv. org/abs/2311.16502. [42] Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark, 2024. URLhttps://arxiv.org/abs/2409.02813. [43] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel"},{"citing_arxiv_id":"2604.17749","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos","primary_cat":"cs.CV","submitted_at":"2026-04-20T03:07:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Mobius: Text to seamless loop- ing video generation via latent shift. InSIGGRAPH, 2025. 3 [4] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.CoRR, abs/2311.15127, 2023. 2 [5] Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions.CoRR, abs/2406.04325, 2024. 3 [6] Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu."},{"citing_arxiv_id":"2501.13106","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","primary_cat":"cs.CV","submitted_at":"2025-01-22T18:59:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Gemma [86], InfoVQA [87], PlotQA [88] 1.00M OCR MultiUI [89], in-house data 0.83M Grounding RefCoco [90], VCR [91], in-house data 0.50M Multi-Image Demon-Full [92], Contrastive_Caption [93] 0.41M Text-only Magpie [94], Magpie-Pro [94], Synthia [95], Infinity-Instruct-subjective [82], Numina- Math [96] 2.21M Video & Text Data General LLaVA-Video-178K [25], ShareGPT4o-Video [28], FineVideo [97], CinePile [98], ShareGemini-k400 [99], ShareGemini-WebVID [99], VCG-Human [22], VCG-Plus [22], VideoLLaMA2 in-house data, Temporal Grounding in-house data 2.92M In this stage, we perform instruction tuning with instruction-following data to refine the model's ability to interpret and follow natural language instructions. This data mixture is designed to cover a wide range of"},{"citing_arxiv_id":"2501.05067","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding","primary_cat":"cs.CV","submitted_at":"2025-01-09T08:43:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.17574","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks","primary_cat":"cs.CV","submitted_at":"2024-12-23T13:45:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion perception and cross-modal alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.05271","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","primary_cat":"cs.CV","submitted_at":"2024-12-06T18:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"ALLaV A [25], SVIT [309], Cambrain-GPT4o [234], TextOCR-GPT4V [102], MMDU [159],Conversation Synthetic Real-World Conversations PMC-VQA [303], VQA-RAD [120], ImageCLEF [72], SLAKE [145], Medical-Diff-VQA [94],Medical PMC-CaseReport [260], GMAI-VL (subset) [134] GUI Screen2Words [240], WebSight [122] Type: Video Datasets Captioning Mementos [254], ShareGPT4Video [30], VideoGPT+ [174], ShareGPT4o-Video [35] General QA VideoChat2-IT [131], EgoTaskQA [99], NTU RGB+D [152], CLEVRER [276], STAR [259], LSMDC [201] Table 4:Summary of the pre-training data mixture of InternVL 2.5.Notably, we exclusively use conversaiton- format instruction data, and at this stage, only the MLP or both MLP and ViT parameters are trainable, allowing"},{"citing_arxiv_id":"2412.03603","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HunyuanVideo: A Systematic Framework For Large Video Generative Models","primary_cat":"cs.CV","submitted_at":"2024-12-03T23:52:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HunyuanVideo presents a 13B-parameter open-source video generative model with integrated data, architecture, training, and inference systems whose professional evaluations show it outperforming prior SOTA models including Runway Gen-3 and Luma 1.6.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.17247","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction","primary_cat":"cs.CV","submitted_at":"2024-10-22T17:59:53+00:00","verdict":"ACCEPT","verdict_confidence":"HIGH","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.02713","ref_index":160,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","primary_cat":"cs.CV","submitted_at":"2024-10-03T17:36:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.16500","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CogVLM2: Visual Language Models for Image and Video Understanding","primary_cat":"cs.CV","submitted_at":"2024-08-29T12:59:12+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.10188","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LongVILA: Scaling Long-Context Visual Language Models for Long Videos","primary_cat":"cs.CV","submitted_at":"2024-08-19T17:48:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.04840","ref_index":203,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models","primary_cat":"cs.CV","submitted_at":"2024-08-09T03:25:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.03326","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaVA-OneVision: Easy Visual Task Transfer","primary_cat":"cs.CV","submitted_at":"2024-08-06T17:59:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models? arXiv preprint arXiv:2403.20330, 2024. 10 [20] Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 5 [21] Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, and Jiaqi Wang. Sharegpt4video: Improving video understanding and generation with better captions. arXiv preprint arXiv:2406.04325, 2024. 38, 40 [22] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong,"},{"citing_arxiv_id":"2407.03320","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output","primary_cat":"cs.CV","submitted_at":"2024-07-03T17:59:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"ages specially designed Chain-of-Thought (CoT) [153] and Direct Preference Optimization (DPO) [124] techniques to significantly enhance the quality of its written content. We evaluated the versatility of InternLM-XComposer- 2.5 (IXC-2.5) across a range of twenty-eight benchmarks, including five video benchmarks [38, 42, 71, 88, 181], nine structural high-resolution benchmarks [20, 89, 106- 108, 117, 133, 139, 140], twelve general VQA bench- marks [18, 40, 44, 61, 66, 87, 100, 155, 164, 166], one multi-true multi-image benchmark [92], and one webpage crafting benchmark [131]. Compared to previous open- source LVLMs, IXC-2.5 achieved state-of-the-art results in 16 out of 28 benchmarks based on InternLM2-7B [143] backend. As shown in Figure 1, the performance of IXC-"},{"citing_arxiv_id":"2406.04264","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MLVU: Benchmarking Multi-task Long Video Understanding","primary_cat":"cs.CV","submitted_at":"2024-06-06T17:09:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"VideoChat [25] 2023-05 16 frm 26.4 12.8 2.15 18.3 17.0 22.0 4.90 15.7 11.7 17.7 3.53 Video-LLaMA-2 [59] 2024-08 16 frm 52.7 12.8 2.23 13.3 17.0 12.0 4.87 15.7 8.3 18.8 3.55 VideoChat2-HD [26] 2024-06 16 frm 74.7 43.6 2.83 35.0 34.0 30.0 5.14 21.4 23.3 37.4 3.99 Video-LLaV A [28] 2023-11 8 frm 70.3 38.5 20.9 2.30 26.4 26.0 5.06 20.0 21.7 29.3 3.68 ShareGPT4Video [7] 2024-05 16 frm 73.6 25.6 2.53 31.7 45.3 38.0 4.72 17.1 8.3 34.2 3.63 VideoLLaMA2 [9] 2024-06 16 frm 80.2 53.8 2.80 36.7 54.7 54.0 5.09 42.9 16.7 48.4 3.95 Long Video MLLMs MovieChat [41] 2023-07 2048 frm 18.7 10.3 2.30 23.3 15.1 16.0 3.24 17.1 15.0 16.5 2.77 Movie-LLM [42] 2024-03 1 fps 27.5 25.6 2.10 10.0 11.3 16.0 4.93 20.0 21.7 18.9 3.52 LLaMA-VID [27] 2023-11 1 fps 20."}],"limit":50,"offset":0}