{"total":13,"items":[{"citing_arxiv_id":"2606.00640","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"An Attribute-Based Measure of Video Complexity","primary_cat":"cs.CV","submitted_at":"2026-05-30T09:30:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00444","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scaling Video Understanding via Compact Latent Multi-Agent Collaboration","primary_cat":"cs.CV","submitted_at":"2026-05-01T06:24:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Formally, each agent Am encodes its local observa- tion X (m) and query q into a fixed-capacity communication token sequence: c(m) =A m(X (m), q)∈R K×d ,(4) where K is the number of communication tokens per agent anddis the embedding dimension. In the star-topology, the coordinator A0 aggregates tokens from all agents to produce the final prediction: ˆy=A0(q,c (1), ...,c (M) ).(5) Communication is constrained by the coordinator's finite context capacity, modeled as a communication budget mea- sured in tokens: M×K+ token(q)≤B com,(6) wheretoken(·)denotes number of tokens. 3.3. Optimization of MACF Since all local agents AM m=1 share the same input and output formats and model architectures, we tie their parameters and learn a single set of shared weights during training."},{"citing_arxiv_id":"2604.11283","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey","primary_cat":"cs.CV","submitted_at":"2026-04-13T10:42:31+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"VaQuitA [11], Vamos [12], COSMO [13], IVA [14], MMICT [15], LXMERT [16], EVE [17], ChatBridge [18], LLaMA-Adapter [19], BT-Adapter [20], Video-LLaMA 3 [21], RED-VILLM [22], InternVideo2 [23], Otter [24], VLog [25], TimeBlindness [26], Time-R1 [27] Spatio-Temporal Modeling MA-LMM [28], MovieLLM [29] , MovieChat [30], LongVLM [31], VideoStreaming [32], VideoLLM [33], VideoLLM-online [34], Vriptor [35], LLoVi [36], TimeChat [37], Momentor [38], LITA [39], SeViLA [40], VTG-LLM [41], VTimeLLM [42], HawkEye [43], Chat-UniVi [44], VideoGPT+ [45], ST-LLM [46], Slot-VLM [47], LSTP [48], OmniViD [49],Vid2Seq [50], DrVideo [51], ViLAMP [52], AKS [53], MCiT [54] Training Paradigms Video-LLaMA 2 [55], LLaMA-Adapter [19],"},{"citing_arxiv_id":"2605.02912","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-07T20:15:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VANGUARD is a staged-training VLM framework that reports 94% ROC-AUC and 84% F1 on UCF-Crime while adding chain-of-thought reasoning and spatial grounding to video anomaly detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.10016","ref_index":27,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Training-Free Multimodal Large Language Model Orchestration","primary_cat":"cs.CL","submitted_at":"2025-08-06T16:17:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM Orchestration integrates modality experts via an LLM controller, cross-modal memory, and interaction layer to enable multimodal input-output without gradient-based training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.14702","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark","primary_cat":"cs.AI","submitted_at":"2024-10-06T20:35:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.07895","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models","primary_cat":"cs.CV","submitted_at":"2024-07-10T17:59:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"narios remain largely less explored. This oversight is sig- nificant given that many real-world applications demand multi-image capabilities, such as comprehensive multi- image analyses. Traditionally, researchers have approached these challenges by training separate task-specific mod- els for each application scenario, e.g., multi-image [1, 19, 27], video [7, 29, 67], and 3D [14, 15, 58]. This is both labor-intensive and time-consuming, resulting in frag- mented methodologies that are inefficient and often unscal- able. Considering the diverse range of computer vision set- tings and data formats, there is a pressing need to develop a general framework for LMMs that can operate effectively across these varied contexts."},{"citing_arxiv_id":"2404.16994","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning","primary_cat":"cs.CV","submitted_at":"2024-04-25T19:29:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.16821","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites","primary_cat":"cs.CV","submitted_at":"2024-04-25T17:59:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[11] Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024. 2, 3, 4, 6 [12] Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In COLING, pages 1511-1520, 2022. 5 [13] Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023. 3 [14] Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Juny- ing Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jian- quan Li, Xiang Wan, and Benyou Wang."},{"citing_arxiv_id":"2403.14624","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?","primary_cat":"cs.CV","submitted_at":"2024-03-21T17:59:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.13289","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SALMONN: Towards Generic Hearing Abilities for Large Language Models","primary_cat":"cs.SD","submitted_at":"2023-10-20T05:41:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SALMONN integrates speech and audio encoders with a text-based LLM to process general audio inputs, achieve competitive results on trained tasks, and exhibit emergent cross-modal abilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.17257","ref_index":203,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Deep Learning Techniques for Action Anticipation","primary_cat":"cs.CV","submitted_at":"2023-09-29T14:07:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A literature survey reviewing deep learning approaches to action anticipation in everyday scenarios, with method classifications, dataset and metric summaries, and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In this context, transferring the knowledge harnessed by large language models (LLMs) like ChatGPT [177] to action anticipation is a fascinating avenue to explore, as LLMs are trained on extensive text corpora which encapsulate such data, and demonstrate remarkable zero-shot capability in effectively tackling multiple NLP or vision-centric tasks [119], [203]. Personalization: Current approaches are subject-agnostic, treating all individuals as having uniform preferences and behaviors. However, it is evident that each individual possesses unique tendencies, preferences, and patterns of behavior. Adapting anticipation models to individuals is another promising research direction. In particular, if the personal context, e."},{"citing_arxiv_id":"2307.06942","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation","primary_cat":"cs.CV","submitted_at":"2023-07-13T17:58:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}