{"total":16,"items":[{"citing_arxiv_id":"2605.31529","ref_index":11,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence","primary_cat":"cs.CV","submitted_at":"2026-05-29T16:43:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SVI-Bench provides 35K hours of sports video with 9 tasks across four cognitive levels, revealing models drop from ~74% on action QA to 5% on agentic evidence integration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23045","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The TIME Machine: On The Power of Motion for Efficient Perception","primary_cat":"cs.CV","submitted_at":"2026-05-21T21:22:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TIME is a motion-based embedding from point tracks, trained only on synthetic data via masked autoencoding, that matches state-of-the-art video model performance with up to 10,000x less training data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22819","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cambrian-P: Pose-Grounded Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-21T17:59:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[17] Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025. [18] Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600.arXiv preprint arXiv:1808.01340, 2018. [19] Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics-700 human action dataset.arXiv preprint arXiv:1907.06987, 2019. [20] Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InCVPR, 2024."},{"citing_arxiv_id":"2606.00054","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data","primary_cat":"cs.RO","submitted_at":"2026-05-18T06:19:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys four classes of techniques that derive action-related supervision from human videos for VLA robot models and identifies three open challenges in episode structuring, embodiment grounding, and evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12090","ref_index":179,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"World Action Models: The Next Frontier in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"InternData-M1 [159], InternData-A1 [160], QUARD-Auto [161] Human Data SSv2 [162], EPIC-KITCHENS [163], HowT o100M [164], Kinetics-700 [165], EGTEA Gaze+ [ 166] Ego4D [167], HOI4D [168], EgoVid-5M [169], COM Kitchens [ 170], Egocentric-10k [ 171], DreamDojo [ 35] Assembly101 [172], H2O [ 173], EgoP AT3D [174], Ego-Exo4D [175], ARCTIC [176], HoloAssist [177] HOT3D [178], TACO [179], Kaiwu [ 180], OAKINK2 [181], Nymeria [ 182], EgoMimic [183] PH2D [184], Humanoid Everyday [185], IndEgo [ 186], PLAICraft [187], HD-EPIC [ 188], UniHand [189] Ego-Centric Human Manipulation Dataset [ 190], Aria Everyday Activities [ 191], EgoDex [ 192] Evaluation World Model Visual Fidelity PSNR, SSIM [ 193], LPIPS [ 194], DreamSim [ 195], DINO [196], FVD [197]"},{"citing_arxiv_id":"2604.20157","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HumanScore: Benchmarking Human Motions in Generated Videos","primary_cat":"cs.CV","submitted_at":"2026-04-22T03:51:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HumanScore defines six metrics for kinematic plausibility, temporal stability, and biomechanical consistency to benchmark human motions in videos from thirteen state-of-the-art generation models, revealing gaps between visual appeal and physical fidelity.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"defined, and common motion categories. It is impossible to enumerate all hu- man motions for benchmarking, since real-world movement is continuous and fluid rather than discrete. Instead, we aim to build a comprehensive and diverse ref- erence pool that covers a wide range of challenging and representative actions. To this end, we adopt Kinetics-700 [10] as our initial pool, as it is widely used in motion research and spans a rich variety of human activities, including demanding sports and complex motions. However, its 700 categories contain notable semantic redundancies, such as 'golf driving' and 'golf chipping'. To obtain a concise and non-redundant set, we conducted a rigorous sifting process as follows."},{"citing_arxiv_id":"2604.04974","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data","primary_cat":"cs.RO","submitted_at":"2026-04-04T15:37:11+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13684","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Recurrent Video Masked Autoencoders","primary_cat":"cs.CV","submitted_at":"2025-12-15T18:59:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.04590","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents","primary_cat":"cs.CV","submitted_at":"2025-07-07T00:51:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VLM2Vec-V2 is a multimodal embedding model trained on an extended MMEB-V2 benchmark that adds video and visual document tasks and reports gains on both new and prior image benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.09985","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning","primary_cat":"cs.AI","submitted_at":"2025-06-11T17:57:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Springer, 2024b. Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images.Advances in Neural Information Processing Systems , 28, 2015. 30 Daniel M Wolpert and Zoubin Ghahramani. Computational principles of movement neuroscience.Nature neuroscience, 3(11):1212-1217, 2000. Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023a. Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for"},{"citing_arxiv_id":"2410.06158","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2024-10-08T16:00:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"with the VQGAN decoder. We highlight that GR-2 is pre-trained on a significantly larger volume of video data compared to previous works that utilize video pre-training. The pre-training data includes commonly used public datasets of human activities, e.g., Howto100M [8], Ego4D [9], Something-Something V2 [10], EPIC-KITCHENS [11], and Kinetics-700 [12]. To tailor the pre-training data for robot manipulation tasks, we carefully establish a data processing pipeline that includes hand filtering [13] and re-captioning [14]. In addition, we include publicly available robot datasets, e.g., RT-1 [15] and Bridge [16]. In total, the number of video clips used for pre-training is 38 million, equivalent to approximately 50 billion tokens."},{"citing_arxiv_id":"2312.14238","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","primary_cat":"cs.CV","submitted_at":"2023-12-21T18:59:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"By flexibly combining the vision encoder and the language middleware, InternVL can support various vision-language tasks, including contrastive tasks, generative tasks, and multi-modal dialogue. characteristics stage 1 stage 2dataset language original cleaned remain cleaned remain LAION-en [120] English 2.3B 1.94B 84.3% 91M 4.0% LAION-COCO [121] 663M 550M 83.0% 550M 83.0% COYO [14] 747M 535M 71.6% 200M 26.8% CC12M [20] 12.4M 11.1M 89.5% 11.1M 89.5% CC3M [124] 3.0M 2.6M 86.7% 2.6M 86.7% SBU [112] 1.0M 1.0M 100% 1.0M 100% Wukong [55] Chinese 100M 69.4M 69.4% 69.4M 69.4% LAION-multi [120] Multi 2.2B 1.87B 85.0% 100M 4.5% Total Multi 6.03B 4.98B 82.6% 1.03B 17.0% Table 2. Details of the training data for InternVL in stage 1 and stage 2. Among them, LAION-en [120], LAION-multi [120],"},{"citing_arxiv_id":"2309.17257","ref_index":198,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Deep Learning Techniques for Action Anticipation","primary_cat":"cs.CV","submitted_at":"2023-09-29T14:07:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A literature survey reviewing deep learning approaches to action anticipation in everyday scenarios, with method classifications, dataset and metric summaries, and future directions.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"a comprehensive analysis of the potential gains of multi- modal methods and facilitating a fair comparison across different approaches. When examining the validation set, RAFTformer and MeMViT show the best overall performance with a sin- gle RGB modality. Specifically, RAFTformer, equipped with the MViT-B backbone and initialized with the Kinetics- 700 [198] (K700) dataset, exhibits superior performance. Several modalities are typically employed for egocentric vision following [98], including RGB, object presence (O), and optical flow (M). Some methods extend this set of modalities to include additional features like interacting hand-object bounding boxes (BB) [5] and audio (Au) [105]. Multi-modal incorporation proves to be highly effective for"},{"citing_arxiv_id":"2303.15389","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EVA-CLIP: Improved Training Techniques for CLIP at Scale","primary_cat":"cs.CV","submitted_at":"2023-03-27T17:02:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"size and ∼1/5 image-text pairs, achieves a 1.2-point averaged improvement over OpenCLIP-H/14. For video classification, we sample only a single center frame in each video, making it an image classification task. Following the conventional settings, we report the top-1 accuracy for UCF-101 [47] and the mean of top-1 and top-5 accuracy for Kinetics-400 [9], Kinetics-600 [7] and Kinetics- 700 [8]. In Table 3 we show that EV A-CLIPis also quite effective in zero-shot video recognition benchmarks. Table 4 reports the zero-shot image and text retrieval results on Flickr30K [53] and COCO [34]. EV A-CLIPout- performs all the competitors at the base and large model size. While the zero-shot retrieval performance of EV A- 02-CLIP-E/14 is slightly lower than OpenCLIP-G/14, the"},{"citing_arxiv_id":"2212.03191","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVideo: General Video Foundation Models via Generative and Discriminative Learning","primary_cat":"cs.CV","submitted_at":"2022-12-06T18:09:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternVideo combines masked video modeling and video-language contrastive learning into a single foundation model that reaches state-of-the-art results on 39 video datasets including 91.1% top-1 on Kinetics-400.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2205.01917","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CoCa: Contrastive Captioners are Image-Text Foundation Models","primary_cat":"cs.CV","submitted_at":"2022-05-04T07:01:14+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}