{"total":13,"items":[{"citing_arxiv_id":"2606.02569","ref_index":72,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AdaCodec: A Predictive Visual Code for Video MLLMs","primary_cat":"cs.CV","submitted_at":"2026-06-01T17:56:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AdaCodec introduces a predictive visual code that cuts visual token use in video MLLMs by sending full frames only on high predictive cost and otherwise encoding inter-frame changes as P-tokens, yielding better benchmark scores at lower budgets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21988","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-05-21T04:38:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21625","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly","primary_cat":"cs.CV","submitted_at":"2026-05-20T18:36:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13080","ref_index":155,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning to See What You Need: Gaze Attention for Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-13T06:54:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08560","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ZAYA1-VL-8B Technical Report","primary_cat":"cs.CV","submitted_at":"2026-05-08T23:41:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"vision-language models with video understanding and grounding.arXiv preprint arXiv:2601.10611, 2026. [63] Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, et al. Perceptionlm: Open-access data and mod- els for detailed visual understanding.arXiv preprint arXiv:2504.13180, 2025. [64] Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, 15 Changrui Chen, Didi Zhu, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, and Jiankang Deng. Llava-onevision-1.5: Fully open framework for democratized multimodal training."},{"citing_arxiv_id":"2604.24317","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Don't Pause! Every prediction matters in a streaming video","primary_cat":"cs.CV","submitted_at":"2026-04-27T11:07:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"+ KVflush 1fps✓3B 6.2/7.5 3.5/4.8 37.8/5.3 2.6/3.00.0/0.0 0.0/0.0 8.4/3.4 + StreamingVLM 1fps✓3B 1.0/1.7 4.9/6.6 31.0/4.0 1.0/2.2 0.0/0.0 0.0/0.0 6.3/2.4 AsynKV1fps✗7B 40.2/22.636.8/19.530.8/18.01.2/1.0 0.0/0.0 0.0/0.0 18.2/10.2 to approximate the current perception ceiling. We also evaluate leading open-source MLLMs: VideoLLaMA3[ 75],InternVL-3.5[ 58],PerceptionLM[ 14],Qwen2.5-VL[ 8], andQwen3- VL[7]. 5.2 Evaluation Metrics We evaluate all streaming models using both T -score and T-F1 at K=5 (T-F1@5); see Sec. B.2 for a discussion on K. Offline MLLMs are evaluated using T -score@1 only. Since our simu- lated first-hit protocol does not run them over the full video, FPs are artificially suppressed, rendering T-F1 uninformative."},{"citing_arxiv_id":"2604.21718","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Building a Precise Video Language with Human-AI Oversight","primary_cat":"cs.CV","submitted_at":"2026-04-22T09:01:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"idea to video captioning by implementing an oversight frame- work where models first generate high-recallpre-captions, and trained human experts focus oncritiquingrather than writing from scratch, guiding models to produce improved post-captions. Our user study in Appendix A shows that this approach improves annotation accuracy, writing quality, and efficiency over prior work [15, 81] that rely on manual caption editing, presumably because shifting limited human attention from textgenerationtoverificationallows for more effective use of cognitive resources. (3) Post-training strategies (Figure 4).Popular post- training methods such as DPO [51] and GRPO [22] rely on preference-based supervision, comparing candidate outputs ranked by humans or reward models."},{"citing_arxiv_id":"2604.12896","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs","primary_cat":"cs.CV","submitted_at":"2026-04-14T15:45:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Perception Programs rewrite dense visual tool outputs into language-native summaries, boosting MLLM accuracy by 15-45% absolute on BLINK perception tasks and setting new state-of-the-art results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08762","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InstrAct: Towards Action-Centric Understanding in Instructional Videos","primary_cat":"cs.CV","submitted_at":"2026-04-09T20:51:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on a new InstrAct Bench for semantic, procedural, and retrieval tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.10611","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding","primary_cat":"cs.CV","submitted_at":"2026-01-15T17:27:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"object or event changes over time) by prompting them with a set of predefined questions. To add any missing low-level details, we use Molmo to generate frame-level captions and an LLM to merge the clip and frame captions into a single long caption. This produces the densest video caption dataset to date, averaging 924 words per video, compared to 75 words in Video Localized Narratives [141], 89 and 100 in RCap and RDCap [22], 280 in ShareGPT4-Video [19], and 547 in LLaVA-Video-178K [184]. Molmo2-AskModelAnything (human).We collect 140k human-authored video QA pairs. Using video captions, we cluster videos into 31 categories and sample them evenly to promote data diversity. Annotators then write specific, fine-grained questions (e.g. about text, actions, or temporal relations), while we discourage"},{"citing_arxiv_id":"2511.16719","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SAM 3: Segment Anything with Concepts","primary_cat":"cs.CV","submitted_at":"2025-11-20T18:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.09985","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning","primary_cat":"cs.AI","submitted_at":"2025-06-11T17:57:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"sequence (sk)k∈[16], where sk is a real-valued 7D vector defined relative to the base of the robot. The first three dimensions ofsk encode the cartesian position of the end-effector, the next three dimensions encode its orientation in the form of extrinsic Euler angles, and the last dimension encodes the gripper state. We construct a sequence of actions(ak)k∈[15] by computing the change in end-effector state between adjacent frames. Specifically, each actionak is a real-valued 7-dimensional vector representing the change in end-effector state between framesk and k + 1. We apply random-resize-crop augmentations to the sampled video clips with the aspect-ratio sampled in the range (0.75, 1.35). Loss function."},{"citing_arxiv_id":"2504.13181","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Perception Encoder: The best visual embeddings are not at the output of the network","primary_cat":"cs.CV","submitted_at":"2025-04-17T17:59:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"1, right) to extract these strong, general features. First, in §4, we investigate the most effective technique to align features to the end of the network by adapting to a large language model. Thislanguage alignment enables us to construct PElangG, which individually outperforms all other popular vision encoders for MLLM tasks. Moreover, when paired with our Perception Language Model (PLM) [21], the combination rivals the latest state-of-the-art MLLMs, like InternVL3 [168]. Second, in §5, we identify a dichotomy in the layers optimal for spatial tasks. By visualizing the features and pinpointing the explicit reason for this dichotomy, we develop a straightforwardspatial alignmentapproach: distilling from the model's own frozen featuresto achieve most of the alignment, complemented by a novel use"}],"limit":50,"offset":0}