Recognition: 1 theorem link
VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
Pith reviewed 2026-05-08 18:31 UTC · model grok-4.3
The pith
Fine-tuning a 4B model on a new dataset of domain-specific video actions surpasses all open-weight 8B models on the VideoNet benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VideoNet introduces a benchmark of 1,000 distinct actions from 37 domains where models like Gemini 3.1 Pro score 69.9% on multiple-choice questions and Qwen3-VL-8B scores 45.0%. Performance remains low in binary settings at 59.2% for Qwen, and few-shot examples provide uneven gains. A collected training set of nearly 500k video QA pairs enables fine-tuning Molmo2-4B to surpass all open-weight 8B models.
What carries the argument
The VideoNet benchmark for domain-specific action recognition paired with a large-scale training dataset of video question-answer pairs that supports evaluation and fine-tuning of vision-language models.
If this is right
- Domain-specific actions require dedicated benchmarks beyond general video datasets to accurately measure VLM capabilities.
- Few-shot prompting alone does not fully close the performance gap between VLMs and humans on nuanced action recognition.
- Targeted fine-tuning on domain-specific video data can make smaller models competitive with or better than larger general-purpose models.
- Action recognition remains a relevant task for advancing video understanding in multimodal systems.
Where Pith is reading between the lines
- Similar large-scale domain-specific datasets could be developed for other challenging video tasks like anomaly detection or long-term event tracking.
- The benchmark might help identify particular weaknesses in how VLMs process temporal information across video frames.
- Integration of this training data into existing VLM pipelines could lead to improved real-world applications in areas requiring specialized action knowledge.
Load-bearing premise
The selected videos and questions represent challenging domain-specific actions in a balanced way without biases in collection or annotation that would exaggerate model deficiencies.
What would settle it
A controlled experiment where models trained without the new dataset achieve comparable or higher accuracy on VideoNet than the fine-tuned 4B model would indicate the data does not provide unique benefits.
read the original abstract
Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions from 37 domains. We begin with a multiple-choice evaluation setting, where the difference between closed and open models is stark: Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%. To understand why VLMs struggle on VideoNet, we relax the questions into a binary setting, where random chance is 50%. Still, Qwen achieves only 59.2% accuracy. Further relaxing the evaluation setup, we provide $k\in\{1,2,3\}$ in-context examples of the action. Some models excel in the few-shot setting, while others falter; Qwen improves $+7.0\%$, while Gemini declines $-4.8\%$. Notably, these gains fall short of the $+13.6\%$ improvement in non-expert humans when given few-shot examples. Finding that VLMs struggle to fully exploit in-context examples, we shift from test-time improvements to the training side. We collect the first large-scale training dataset for domain-specific actions, totaling nearly 500k video question-answer pairs. Fine-tuning a Molmo2-4B model on our data, we surpass all open-weight 8B models on the VideoNet benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VideoNet, a benchmark for domain-specific action recognition comprising 1,000 actions across 37 domains. It reports that current VLMs struggle on this benchmark, with closed models like Gemini 3.1 Pro achieving 69.9% accuracy in multiple-choice settings while open models like Qwen3-VL-8B reach only 45.0%; similar gaps persist in binary (Qwen at 59.2%) and few-shot (k=1,2,3) evaluations. The work also releases a large training set of nearly 500k video-question-answer pairs and demonstrates that fine-tuning Molmo2-4B on this data enables it to surpass all open-weight 8B models on VideoNet.
Significance. If the benchmark videos and training pairs are verifiably disjoint and free of selection or annotation artifacts, the work would provide a valuable large-scale resource (nearly 500k pairs) to advance VLM capabilities on challenging domain-specific actions, which current models handle poorly. The empirical performance gaps and human few-shot comparison (+13.6%) highlight limitations in in-context learning for action recognition. The scale of the training data stands out as a concrete contribution for future fine-tuning research.
major comments (2)
- [Abstract] Abstract: The central claim that fine-tuning Molmo2-4B on the 500k pairs 'surpass[es] all open-weight 8B models' requires that the training data be disjoint from the VideoNet benchmark videos and questions. No details are supplied on video sourcing, deduplication, question generation, or explicit exclusion checks, leaving open the possibility of contamination that could artifactually inflate the reported gains.
- [Abstract] Abstract: The reported performance numbers (e.g., 69.9% vs. 45.0% in multiple-choice, 59.2% in binary) are presented without accompanying information on data collection process, quality controls, statistical testing, or potential confounds in action selection and annotation. This absence makes the gaps difficult to interpret as genuine capability differences rather than artifacts of benchmark construction.
Simulated Author's Rebuttal
Thank you for the detailed and constructive review of our manuscript introducing VideoNet. We appreciate the focus on data integrity and interpretability of results. We address each major comment below and commit to revisions that strengthen the presentation of our contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that fine-tuning Molmo2-4B on the 500k pairs 'surpass[es] all open-weight 8B models' requires that the training data be disjoint from the VideoNet benchmark videos and questions. No details are supplied on video sourcing, deduplication, question generation, or explicit exclusion checks, leaving open the possibility of contamination that could artifactually inflate the reported gains.
Authors: We agree that verifying disjointness between the training data and benchmark is essential to substantiate the fine-tuning results. The manuscript describes the training set as collected independently from the benchmark videos through distinct sourcing pipelines. To make this rigorous and transparent, we will add a new subsection detailing the video sourcing methodology, automated and manual deduplication procedures, question generation protocols, and explicit exclusion verification steps (including overlap metrics) that confirm no benchmark videos or questions were included in the 500k training pairs. These additions will directly support the central claim. revision: yes
-
Referee: [Abstract] Abstract: The reported performance numbers (e.g., 69.9% vs. 45.0% in multiple-choice, 59.2% in binary) are presented without accompanying information on data collection process, quality controls, statistical testing, or potential confounds in action selection and annotation. This absence makes the gaps difficult to interpret as genuine capability differences rather than artifacts of benchmark construction.
Authors: The full manuscript includes a benchmark construction section covering action selection across 37 domains, annotation protocols, and basic quality controls. However, we acknowledge that the abstract and certain result presentations lack sufficient detail on these aspects. We will revise to expand the data collection description, add explicit quality control metrics (e.g., inter-annotator agreement), include statistical significance tests for the reported performance gaps, and discuss potential confounds in action selection and annotation. This will allow readers to better assess whether the gaps reflect model limitations. revision: yes
Circularity Check
No circularity; purely empirical benchmark and dataset with direct measurements.
full rationale
The paper introduces VideoNet benchmark (1000 actions, 37 domains) and a 500k video QA training set, then reports empirical fine-tuning results where Molmo2-4B surpasses open 8B models. No derivations, equations, fitted parameters, or first-principles claims exist. All results are direct performance measurements on held-out evaluation. No self-citation load-bearing steps, self-definitional constructs, or renaming of known results. Potential contamination is a validity issue but does not create circularity in any claimed derivation chain. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Actions can be reliably partitioned into 37 distinct domains and 1,000 unique categories that capture domain-specific challenges.
Reference graph
Works this paper leans on
-
[1]
Yamato, J
J. Yamato, J. Ohya, and K. Ishii. Recognizing human action in time-sequential images using hidden markov model. InProceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 379–385, 1992
1992
-
[2]
Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models, 2024
Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, and Arman Cohan. Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models, 2024
2024
-
[3]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan,PranavShyam,GirishSastry,AmandaAskell,SandhiniAgarwal,ArielHerbert-Voss,Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gra...
2020
-
[4]
MetaICL: Learning to learn in context
Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. MetaICL: Learning to learn in context. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors,Proceed- ings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809, Seat...
2022
-
[5]
Schuldt, I
C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. InProceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 3, pages 32–36 Vol.3, 2004
2004
-
[6]
Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012
2012
-
[7]
Hmdb: a large video database for human motion recognition
Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In2011 International conference on computer vision, pages 2556–2563. IEEE, 2011
2011
-
[8]
Activitynet: A large-scale video benchmark for human activity understanding
Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015
2015
-
[9]
Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik
Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. Ava: A video dataset of spatio-temporally localized atomic visual actions, 2018
2018
-
[10]
Moments in time dataset: one million videos for event understanding, 2019
Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfruend, Carl Vondrick, and Aude Oliva. Moments in time dataset: one million videos for event understanding, 2019
2019
-
[11]
A short note on the kinetics-700 human action dataset, 2022
Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics-700 human action dataset, 2022
2022
-
[12]
Internvideo2: Scaling video foundation models for multimodal video understanding,
Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding.arXiv preprint arXiv:2403.15377, 2024
-
[13]
Kuehne, A
H. Kuehne, A. B. Arslan, and T. Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. InProceedings of Computer Vision and Pattern Recognition Conference (CVPR), 2014
2014
-
[14]
Recognizing fine-grained and composite activities using hand-centric features and script data.International Journal of Computer Vision, 119(3):346–373, August 2015
Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, and Bernt Schiele. Recognizing fine-grained and composite activities using hand-centric features and script data.International Journal of Computer Vision, 119(3):346–373, August 2015. 15
2015
-
[15]
Finediving: A fine-grained dataset for procedure-aware action quality assessment, 2022
Jinglin Xu, Yongming Rao, Xumin Yu, Guangyi Chen, Jie Zhou, and Jiwen Lu. Finediving: A fine-grained dataset for procedure-aware action quality assessment, 2022
2022
-
[16]
Resound: Towards action recognition without representation bias
Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: Towards action recognition without representation bias. InProceedings of the European Conference on Computer Vision (ECCV), September 2018
2018
-
[17]
Actionatlas: A videoqa benchmark for domain-specialized action recognition, 2024
Mohammadreza Salehi, Jae Sung Park, Tanush Yadav, Aditya Kusupati, Ranjay Krishna, Yejin Choi, Hannaneh Hajishirzi, and Ali Farhadi. Actionatlas: A videoqa benchmark for domain-specialized action recognition, 2024
2024
-
[18]
Multisports: A multi- person video dataset of spatio-temporally localized sports actions
Yixuan Li, Lei Chen, Runyu He, Zhenzhi Wang, Gangshan Wu, and Limin Wang. Multisports: A multi- person video dataset of spatio-temporally localized sports actions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13536–13545, October 2021
2021
-
[19]
Finesports: Amulti-personhierarchical sports video dataset for fine-grained action understanding
JinglinXu, GuohaoZhao, SiboYin, WenhaoZhou, andYuxinPeng. Finesports: Amulti-personhierarchical sports video dataset for fine-grained action understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21773–21782, June 2024
2024
-
[20]
Fine-grained action recognition on a novel basketball dataset
Xiaofan Gu, Xinwei Xue, and Feng Wang. Fine-grained action recognition on a novel basketball dataset. InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2563–2567, 2020
2020
-
[21]
Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,
Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, and Jianwei Yang. Tempo- ralbench: Towards fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024
-
[22]
Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models, 2024
Wenyi Hong*, Yean Cheng*, Zhuoyi Yang*, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models, 2024
2024
-
[23]
Video action differencing, 2025
James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, and Serena Yeung-Levy. Video action differencing, 2025
2025
-
[24]
Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives, 2024
Kristen Grauman et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives, 2024
2024
-
[25]
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhong- cong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car...
2022
-
[26]
A comprehensive survey of hallucination mitigation techniques in large language models
S. M Towhidul Islam Tonmoy, S M Mehedi Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. A comprehensive survey of hallucination mitigation techniques in large language models.ArXiv, abs/2401.01313, 2024
-
[27]
Claude can now search the web, March 2025
Anthropic. Claude can now search the web, March 2025. 16
2025
-
[28]
Error rate bounds and iterative weighted majority voting for crowdsourcing, 2014
Hongwei Li and Bin Yu. Error rate bounds and iterative weighted majority voting for crowdsourcing, 2014
2014
-
[29]
Hansen, Patrick J
Derek L. Hansen, Patrick J. Schone, Douglas Corey, Matthew Reid, and Jake Gehring. Quality control mechanisms for crowdsourcing: peer review, arbitration, & expertise at familysearch indexing. In Proceedings of the 2013 Conference on Computer Supported Cooperative Work, CSCW ’13, page 649–660, New York, NY, USA, 2013. Association for Computing Machinery
2013
-
[30]
Quality control in crowdsourcing: A survey of quality attributes, assessment techniques, and assurance actions.ACM Comput
Florian Daniel, Pavel Kucherbaev, Cinzia Cappiello, Boualem Benatallah, and Mohammad Allahbakhsh. Quality control in crowdsourcing: A survey of quality attributes, assessment techniques, and assurance actions.ACM Comput. Surv., 51(1), January 2018
2018
-
[31]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2024
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...
2024
-
[32]
Berg, and Li Fei-Fei
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge, 2015
2015
-
[33]
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024
2024
-
[34]
Videocon: Robust video-language alignment via contrast captions, 2023
Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, and Aditya Grover. Videocon: Robust video-language alignment via contrast captions, 2023
2023
-
[35]
Bosheng Ding, Chengwei Qin, Linlin Liu, Yew Ken Chia, Boyang Li, Shafiq Joty, and Lidong Bing. Is GPT-3 a good data annotator? In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11173–11195, Toronto, Canada, July 2023. Associ...
2023
-
[36]
Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alexandros G
Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhu- rina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji...
2025
-
[37]
Whisperx: Time-accurate speech tran- scription of long-form audio.INTERSPEECH 2023, 2023
Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. Whisperx: Time-accurate speech tran- scription of long-form audio.INTERSPEECH 2023, 2023
2023
-
[38]
Molmo2: Open weights and data for vision-language models with video understanding and grounding, 2026
Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, and Ranjay Krishna. Molmo2: Open weights and data for vision-language m...
2026
-
[39]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un- terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021
2021
-
[40]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, 17 Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixu...
work page internal anchor Pith review arXiv 2025
-
[41]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review arXiv 2025
-
[42]
Expanding language-image pretrained models for general video recognition
Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for general video recognition. 2022
2022
-
[43]
Internvid: A large-scale video-text dataset for multimodal understanding and generation
Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. InICLR, 2023
2023
-
[44]
Long-clip: Unlocking the long-text capability of clip.arXiv preprint arXiv:2403.15378, 2024
Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip.arXiv preprint arXiv:2403.15378, 2024
-
[45]
Videoclip-xl: Advancing long description understanding for video clip models, 2024
Jiapeng Wang, Chengyu Wang, Kunzhe Huang, Jun Huang, and Lianwen Jin. Videoclip-xl: Advancing long description understanding for video clip models, 2024
2024
-
[46]
Quo vadis, action recognition? a new model and the kinetics dataset, 2018
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset, 2018
2018
-
[47]
Kangsan Kim, Geon Park, Youngwan Lee, Woongyeong Yeo, and Sung Ju Hwang. Videoicl: Confidence-based iterative in-context learning for out-of-distribution video understanding.arXiv preprint arXiv:2412.02186, 2024
-
[48]
The mirror neuron system and action recogni- tion.Brain and Language, 89:370–376, 2004
Giovanni Buccino, Ferdinand Binkofski, and Lucia Riggio. The mirror neuron system and action recogni- tion.Brain and Language, 89:370–376, 2004
2004
-
[49]
Finegym: A hierarchical video dataset for fine-grained action understanding, 2020
Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding, 2020
2020
-
[50]
Fine-grained action analysis: A multi-modality and multi-task dataset of figure skating, 2024
Sheng-Lan Liu, Yu-Ning Ding, Gang Yan, Si-Fan Zhang, Jin-Rong Zhang, Wen-Yue Chen, and Xue-Hai Xu. Fine-grained action analysis: A multi-modality and multi-task dataset of figure skating, 2024
2024
-
[51]
Temporal segmentation of fine-gained semantic action: A motion-centered figure skating dataset
Shenglan Liu, Aibin Zhang, Yunheng Li, Jian Zhou, Li Xu, Zhuben Dong, and Renhao Zhang. Temporal segmentation of fine-gained semantic action: A motion-centered figure skating dataset. InProceedings of the AAAI Conference on Artificial Intelligence, 2021
2021
-
[52]
Fsbench: A figure skating benchmark for advancing artistic sports understanding.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13595–13605, 2025
Rong Gao, Xin Liu, Zhuozhao Hu, Bohao Xing, Baiqiang Xia, Zitong Yu, and Heikki Kälviäinen. Fsbench: A figure skating benchmark for advancing artistic sports understanding.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13595–13605, 2025
2025
-
[53]
Learning to score figure skating sport videos.IEEE Transactions on Circuits and Systems for Video Technology, 30(12):4578–4590, 2020
Chengming Xu, Yanwei Fu, Bing Zhang, Zitian Chen, Yu-Gang Jiang, and Xiangyang Xue. Learning to score figure skating sport videos.IEEE Transactions on Circuits and Systems for Video Technology, 30(12):4578–4590, 2020
2020
-
[54]
Fsd-10: A fine-grained classification dataset for figure skating.Neurocomputing, 413:360–367, 2020
Shenglan Liu, Xiang Liu, Gao Huang, Hong Qiao, Lianyu Hu, Dong Jiang, Aibin Zhang, Yang Liu, and Ge Guo. Fsd-10: A fine-grained classification dataset for figure skating.Neurocomputing, 413:360–367, 2020
2020
-
[55]
Basket: A large-scale video dataset for fine-grained skill estimation.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 28952– 28962, 2025
Yulu Pan, Ce Zhang, and Gedas Bertasius. Basket: A large-scale video dataset for fine-grained skill estimation.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 28952– 28962, 2025. 18
2025
-
[56]
Soccernet: A scalable dataset for action spotting in soccer videos
Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. Soccernet: A scalable dataset for action spotting in soccer videos. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018
2018
-
[57]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024
Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024
2024
-
[58]
Gpt-5.4, 2026
OpenAI. Gpt-5.4, 2026
2026
-
[59]
The kinetics human action video dataset, 2017
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017
2017
-
[60]
Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh
Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models, 2021
2021
-
[61]
good” and “exemplary
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...
2025
-
[62]
next level up/down
MAXIMIZE VISUAL CONFUSION WITHOUT OBVIOUS PATTERNS: - Select actions that share visual features, body positions, or motion qualities with the positive action - Avoid predictable selection patterns (e.g., don't always choose the "next level up/down" or "same family" actions) - Mix selection criteria unpredictably to prevent the model from learning simple h...
-
[63]
STRATEGIC AMBIGUITY: - Include some negatives that differ in subtle ways (small variations in technique/position) - Include some negatives that differ in more significant ways but still maintain visual similarity - Vary the type of similarity (sometimes motion-based, sometimes position-based, sometimes tool/environment-based)
-
[64]
AVOID FUNCTIONALLY RELATED ACTIONS FOR NEGATIVES: - Never select actions that typically occur together with the positive action - Avoid actions that are commonly performed in sequence or as part of the same technique - Don't pair actions that would naturally appear in the same short video clip - Don't pair action categories that are too similar or the sam...
-
[65]
REASONABLE DISTRIBUTION: - Each action should appear as a negative approximately 2-5 times across the dataset - Avoid extreme over-representation or under-representation - The overall pattern of selections should appear random and unpredictable Please provide your hard negative choices for these actions in the same order as provided: <ACTION LIST> Negativ...
2025
-
[66]
The most problematic selection patterns identified
-
[67]
Any actions with co-occurring negatives
-
[68]
Then, provide the revised CSV with fixed negative selections, without detailed explanations for each change
Distribution imbalances Briefly explain the changes made to each action's negatives, ensuring that the new selections are genuinely challenging and visually confusing. Then, provide the revised CSV with fixed negative selections, without detailed explanations for each change. API Details model: o3-2025-04-16 reasoning effort: high Figure 13|Third user pro...
2025
-
[69]
The final list follows the exact same order as the original action list
-
[70]
Each action has 3 negatives that create genuine visual confusion
-
[71]
The selection patterns remain unpredictable and varied
-
[72]
No functionally related actions are paired
-
[73]
The distribution is reasonably balanced (each action appears 2-5 times as a negative) Provide the final clean CSV with optimized hard negatives: API Details model: gpt-4.5-preview-2025-02-27 temperature: 0.5 max_tokens: 4096 Figure 14|Final user prompt for hard-negative generation(4/4) 30 Figure 15 | Benchmark Clip Collection UI. All of our UIs were refin...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.