{"total":14,"items":[{"citing_arxiv_id":"2605.28023","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning","primary_cat":"cs.CV","submitted_at":"2026-05-27T06:27:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VCap pairs reference captions as witnesses with visual signals as adjudicators to deliver hypergeometric-precision rewards for RL in visual captioning, enabling an 8B model to outperform SOTA on benchmarks and improve weak-to-strong generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20165","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-19T17:50:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14569","ref_index":100,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-05-14T08:39:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Huang, and Xiaomeng Li. NEURONS: Emulating the human visual cortex improves fidelity and inter- pretability in fmri-to-video reconstruction.arXiv preprint arXiv:2503.11167, 2025. 2, 5 [99] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 3 [100] Jiawei Wang, Liping Yuan, Yuchen Zhang, and Hao- miao Sun. Tarsier: Recipes for training and evalu- ating large video description models.arXiv preprint arXiv:2407.00634, 2024. 5 [101] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution."},{"citing_arxiv_id":"2604.21718","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Building a Precise Video Language with Human-AI Oversight","primary_cat":"cs.CV","submitted_at":"2026-04-22T09:01:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"tion built with professional video creators. See Figure 2 for an overview. Issue: lack of specification.Without clear instruction, annotators may not know what to describe or how much detail to include. To examine this, we manually evalu- ate eight widely used video-text datasets: MSR-VTT [72], ActivityNet [26], ShareGPT4Video [12], UltraVideo [73], VDC [7], Dream1K [61], PerceptionLM (PE-Video) [15], and TUNA-Bench [25]. We find that most datasets do not provide a detailed policy for annotators (the only exception being [25], whose guideline is not public). We observe three major issues caused by the lack of specification and provide detailed error examples in Appendix A: • (1) Imprecise terminology. Without clear guidelines,"},{"citing_arxiv_id":"2604.02891","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Progressive Video Condensation with MLLM Agent for Long-form Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-03T09:00:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ProVCA progressively condenses long videos via segment localization, snippet selection, and keyframe refinement to achieve SOTA zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA with fewer frames.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"key frames, discarding fine-grained visual cues. In contrast, our ProVCA (c) progressively condenses videos by using an MLLM to narrow down query-relevant content from coarse segments to fine frames, then feeds the selected key frames into the MLLM to generate the final answer. cues essential for answering detailed queries. Another line of work employs LLM-based agents [6]-[9] (Fig. 1 (b)). instead pretrains a video based MLLM on large-scale multimodal datasets. Although effective, this paradigm requires extensive multimodal data and substantial computation, and often relies on many input frames to cover complex content, leading to high memory cost, redundancy, and low reasoning efficiency. Recent progress in MLLMs for downstream tasks [10]-[12]"},{"citing_arxiv_id":"2603.03944","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SCP: Spatial Causal Prediction in Video","primary_cat":"cs.CV","submitted_at":"2026-03-04T11:09:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.00181","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2026-01-30T04:45:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13511","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adapting MLLMs for Nuanced Video Retrieval","primary_cat":"cs.CV","submitted_at":"2025-12-15T16:38:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.20715","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding","primary_cat":"cs.CV","submitted_at":"2025-05-27T04:50:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MUSEG applies timestamp-aware multi-segment grounding with a phased-reward RL recipe to boost temporal grounding and time-sensitive video QA performance in MLLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.07062","ref_index":140,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seed1.5-VL Technical Report","primary_cat":"cs.CV","submitted_at":"2025-05-11T17:28:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[138] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power.Advancesin neural information processing systems, 32, 2019. [139] Jiawei Wang, Liping Yuan, Yuchen Zhang, and Haomiao Sun. Tarsier: Recipes for training and evaluating large video description models, 2024. URLhttps://arxiv.org/abs/2407.00634. [140] Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095-95169, 2024. [141] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang,"},{"citing_arxiv_id":"2504.06958","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning","primary_cat":"cs.CV","submitted_at":"2025-04-09T15:09:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.00131","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Open-Sora Plan: Open-Source Large Video Generation Model","primary_cat":"cs.CV","submitted_at":"2024-11-28T14:07:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Open-Sora Plan presents an open-source large video generation model that combines a Wavelet-Flow VAE, Joint Image-Video Skiparse Denoiser, and multi-dimensional data curation to achieve high-quality video outputs with public code and weights.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.02713","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","primary_cat":"cs.CV","submitted_at":"2024-10-03T17:36:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.07476","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","primary_cat":"cs.CV","submitted_at":"2024-06-11T17:22:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}