pith. machine review for the scientific record. sign in

arxiv: 2605.02834 · v2 · submitted 2026-05-04 · 💻 cs.CV · cs.LG

Recognition: 1 theorem link

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:31 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords action recognitionvideo datasetvision language modelsbenchmarkdomain specific actionsfine tuningmultimodal evaluationVideoNet
0
0 comments X

The pith

Fine-tuning a 4B model on a new dataset of domain-specific video actions surpasses all open-weight 8B models on the VideoNet benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VideoNet as a benchmark for 1,000 domain-specific actions spanning 37 domains to re-emphasize action recognition for vision-language models. Evaluations reveal significant struggles for current VLMs in closed and open settings, with limited benefits from in-context examples unlike humans. The authors release a training collection of nearly 500,000 video question-answer pairs focused on these actions. Fine-tuning demonstrates that this data allows smaller models to exceed the performance of larger counterparts. This work shifts attention back to domain-specific video understanding by supplying both test and training resources.

Core claim

VideoNet introduces a benchmark of 1,000 distinct actions from 37 domains where models like Gemini 3.1 Pro score 69.9% on multiple-choice questions and Qwen3-VL-8B scores 45.0%. Performance remains low in binary settings at 59.2% for Qwen, and few-shot examples provide uneven gains. A collected training set of nearly 500k video QA pairs enables fine-tuning Molmo2-4B to surpass all open-weight 8B models.

What carries the argument

The VideoNet benchmark for domain-specific action recognition paired with a large-scale training dataset of video question-answer pairs that supports evaluation and fine-tuning of vision-language models.

If this is right

  • Domain-specific actions require dedicated benchmarks beyond general video datasets to accurately measure VLM capabilities.
  • Few-shot prompting alone does not fully close the performance gap between VLMs and humans on nuanced action recognition.
  • Targeted fine-tuning on domain-specific video data can make smaller models competitive with or better than larger general-purpose models.
  • Action recognition remains a relevant task for advancing video understanding in multimodal systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar large-scale domain-specific datasets could be developed for other challenging video tasks like anomaly detection or long-term event tracking.
  • The benchmark might help identify particular weaknesses in how VLMs process temporal information across video frames.
  • Integration of this training data into existing VLM pipelines could lead to improved real-world applications in areas requiring specialized action knowledge.

Load-bearing premise

The selected videos and questions represent challenging domain-specific actions in a balanced way without biases in collection or annotation that would exaggerate model deficiencies.

What would settle it

A controlled experiment where models trained without the new dataset achieve comparable or higher accuracy on VideoNet than the fine-tuned 4B model would indicate the data does not provide unique benefits.

read the original abstract

Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions from 37 domains. We begin with a multiple-choice evaluation setting, where the difference between closed and open models is stark: Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%. To understand why VLMs struggle on VideoNet, we relax the questions into a binary setting, where random chance is 50%. Still, Qwen achieves only 59.2% accuracy. Further relaxing the evaluation setup, we provide $k\in\{1,2,3\}$ in-context examples of the action. Some models excel in the few-shot setting, while others falter; Qwen improves $+7.0\%$, while Gemini declines $-4.8\%$. Notably, these gains fall short of the $+13.6\%$ improvement in non-expert humans when given few-shot examples. Finding that VLMs struggle to fully exploit in-context examples, we shift from test-time improvements to the training side. We collect the first large-scale training dataset for domain-specific actions, totaling nearly 500k video question-answer pairs. Fine-tuning a Molmo2-4B model on our data, we surpass all open-weight 8B models on the VideoNet benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces VideoNet, a benchmark for domain-specific action recognition comprising 1,000 actions across 37 domains. It reports that current VLMs struggle on this benchmark, with closed models like Gemini 3.1 Pro achieving 69.9% accuracy in multiple-choice settings while open models like Qwen3-VL-8B reach only 45.0%; similar gaps persist in binary (Qwen at 59.2%) and few-shot (k=1,2,3) evaluations. The work also releases a large training set of nearly 500k video-question-answer pairs and demonstrates that fine-tuning Molmo2-4B on this data enables it to surpass all open-weight 8B models on VideoNet.

Significance. If the benchmark videos and training pairs are verifiably disjoint and free of selection or annotation artifacts, the work would provide a valuable large-scale resource (nearly 500k pairs) to advance VLM capabilities on challenging domain-specific actions, which current models handle poorly. The empirical performance gaps and human few-shot comparison (+13.6%) highlight limitations in in-context learning for action recognition. The scale of the training data stands out as a concrete contribution for future fine-tuning research.

major comments (2)
  1. [Abstract] Abstract: The central claim that fine-tuning Molmo2-4B on the 500k pairs 'surpass[es] all open-weight 8B models' requires that the training data be disjoint from the VideoNet benchmark videos and questions. No details are supplied on video sourcing, deduplication, question generation, or explicit exclusion checks, leaving open the possibility of contamination that could artifactually inflate the reported gains.
  2. [Abstract] Abstract: The reported performance numbers (e.g., 69.9% vs. 45.0% in multiple-choice, 59.2% in binary) are presented without accompanying information on data collection process, quality controls, statistical testing, or potential confounds in action selection and annotation. This absence makes the gaps difficult to interpret as genuine capability differences rather than artifacts of benchmark construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive review of our manuscript introducing VideoNet. We appreciate the focus on data integrity and interpretability of results. We address each major comment below and commit to revisions that strengthen the presentation of our contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that fine-tuning Molmo2-4B on the 500k pairs 'surpass[es] all open-weight 8B models' requires that the training data be disjoint from the VideoNet benchmark videos and questions. No details are supplied on video sourcing, deduplication, question generation, or explicit exclusion checks, leaving open the possibility of contamination that could artifactually inflate the reported gains.

    Authors: We agree that verifying disjointness between the training data and benchmark is essential to substantiate the fine-tuning results. The manuscript describes the training set as collected independently from the benchmark videos through distinct sourcing pipelines. To make this rigorous and transparent, we will add a new subsection detailing the video sourcing methodology, automated and manual deduplication procedures, question generation protocols, and explicit exclusion verification steps (including overlap metrics) that confirm no benchmark videos or questions were included in the 500k training pairs. These additions will directly support the central claim. revision: yes

  2. Referee: [Abstract] Abstract: The reported performance numbers (e.g., 69.9% vs. 45.0% in multiple-choice, 59.2% in binary) are presented without accompanying information on data collection process, quality controls, statistical testing, or potential confounds in action selection and annotation. This absence makes the gaps difficult to interpret as genuine capability differences rather than artifacts of benchmark construction.

    Authors: The full manuscript includes a benchmark construction section covering action selection across 37 domains, annotation protocols, and basic quality controls. However, we acknowledge that the abstract and certain result presentations lack sufficient detail on these aspects. We will revise to expand the data collection description, add explicit quality control metrics (e.g., inter-annotator agreement), include statistical significance tests for the reported performance gaps, and discuss potential confounds in action selection and annotation. This will allow readers to better assess whether the gaps reflect model limitations. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical benchmark and dataset with direct measurements.

full rationale

The paper introduces VideoNet benchmark (1000 actions, 37 domains) and a 500k video QA training set, then reports empirical fine-tuning results where Molmo2-4B surpasses open 8B models. No derivations, equations, fitted parameters, or first-principles claims exist. All results are direct performance measurements on held-out evaluation. No self-citation load-bearing steps, self-definitional constructs, or renaming of known results. Potential contamination is a validity issue but does not create circularity in any claimed derivation chain. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on domain assumptions about action categorization and data representativeness rather than mathematical axioms or new entities.

axioms (1)
  • domain assumption Actions can be reliably partitioned into 37 distinct domains and 1,000 unique categories that capture domain-specific challenges.
    This underpins the benchmark design and the claim that current VLMs struggle on these actions.

pith-pipeline@v0.9.0 · 5665 in / 1157 out tokens · 69650 ms · 2026-05-08T18:31:54.620894+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 8 canonical work pages · 2 internal anchors

  1. [1]

    Yamato, J

    J. Yamato, J. Ohya, and K. Ishii. Recognizing human action in time-sequential images using hidden markov model. InProceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 379–385, 1992

  2. [2]

    Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models, 2024

    Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, and Arman Cohan. Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models, 2024

  3. [3]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan,PranavShyam,GirishSastry,AmandaAskell,SandhiniAgarwal,ArielHerbert-Voss,Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gra...

  4. [4]

    MetaICL: Learning to learn in context

    Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. MetaICL: Learning to learn in context. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors,Proceed- ings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809, Seat...

  5. [5]

    Schuldt, I

    C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. InProceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 3, pages 32–36 Vol.3, 2004

  6. [6]

    Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012

  7. [7]

    Hmdb: a large video database for human motion recognition

    Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In2011 International conference on computer vision, pages 2556–2563. IEEE, 2011

  8. [8]

    Activitynet: A large-scale video benchmark for human activity understanding

    Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015

  9. [9]

    Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik

    Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. Ava: A video dataset of spatio-temporally localized atomic visual actions, 2018

  10. [10]

    Moments in time dataset: one million videos for event understanding, 2019

    Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfruend, Carl Vondrick, and Aude Oliva. Moments in time dataset: one million videos for event understanding, 2019

  11. [11]

    A short note on the kinetics-700 human action dataset, 2022

    Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics-700 human action dataset, 2022

  12. [12]

    Internvideo2: Scaling video foundation models for multimodal video understanding,

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding.arXiv preprint arXiv:2403.15377, 2024

  13. [13]

    Kuehne, A

    H. Kuehne, A. B. Arslan, and T. Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. InProceedings of Computer Vision and Pattern Recognition Conference (CVPR), 2014

  14. [14]

    Recognizing fine-grained and composite activities using hand-centric features and script data.International Journal of Computer Vision, 119(3):346–373, August 2015

    Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, and Bernt Schiele. Recognizing fine-grained and composite activities using hand-centric features and script data.International Journal of Computer Vision, 119(3):346–373, August 2015. 15

  15. [15]

    Finediving: A fine-grained dataset for procedure-aware action quality assessment, 2022

    Jinglin Xu, Yongming Rao, Xumin Yu, Guangyi Chen, Jie Zhou, and Jiwen Lu. Finediving: A fine-grained dataset for procedure-aware action quality assessment, 2022

  16. [16]

    Resound: Towards action recognition without representation bias

    Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: Towards action recognition without representation bias. InProceedings of the European Conference on Computer Vision (ECCV), September 2018

  17. [17]

    Actionatlas: A videoqa benchmark for domain-specialized action recognition, 2024

    Mohammadreza Salehi, Jae Sung Park, Tanush Yadav, Aditya Kusupati, Ranjay Krishna, Yejin Choi, Hannaneh Hajishirzi, and Ali Farhadi. Actionatlas: A videoqa benchmark for domain-specialized action recognition, 2024

  18. [18]

    Multisports: A multi- person video dataset of spatio-temporally localized sports actions

    Yixuan Li, Lei Chen, Runyu He, Zhenzhi Wang, Gangshan Wu, and Limin Wang. Multisports: A multi- person video dataset of spatio-temporally localized sports actions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13536–13545, October 2021

  19. [19]

    Finesports: Amulti-personhierarchical sports video dataset for fine-grained action understanding

    JinglinXu, GuohaoZhao, SiboYin, WenhaoZhou, andYuxinPeng. Finesports: Amulti-personhierarchical sports video dataset for fine-grained action understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21773–21782, June 2024

  20. [20]

    Fine-grained action recognition on a novel basketball dataset

    Xiaofan Gu, Xinwei Xue, and Feng Wang. Fine-grained action recognition on a novel basketball dataset. InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2563–2567, 2020

  21. [21]

    Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,

    Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, and Jianwei Yang. Tempo- ralbench: Towards fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024

  22. [22]

    Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models, 2024

    Wenyi Hong*, Yean Cheng*, Zhuoyi Yang*, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models, 2024

  23. [23]

    Video action differencing, 2025

    James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, and Serena Yeung-Levy. Video action differencing, 2025

  24. [24]

    Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives, 2024

    Kristen Grauman et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives, 2024

  25. [25]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhong- cong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car...

  26. [26]

    A comprehensive survey of hallucination mitigation techniques in large language models

    S. M Towhidul Islam Tonmoy, S M Mehedi Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. A comprehensive survey of hallucination mitigation techniques in large language models.ArXiv, abs/2401.01313, 2024

  27. [27]

    Claude can now search the web, March 2025

    Anthropic. Claude can now search the web, March 2025. 16

  28. [28]

    Error rate bounds and iterative weighted majority voting for crowdsourcing, 2014

    Hongwei Li and Bin Yu. Error rate bounds and iterative weighted majority voting for crowdsourcing, 2014

  29. [29]

    Hansen, Patrick J

    Derek L. Hansen, Patrick J. Schone, Douglas Corey, Matthew Reid, and Jake Gehring. Quality control mechanisms for crowdsourcing: peer review, arbitration, & expertise at familysearch indexing. In Proceedings of the 2013 Conference on Computer Supported Cooperative Work, CSCW ’13, page 649–660, New York, NY, USA, 2013. Association for Computing Machinery

  30. [30]

    Quality control in crowdsourcing: A survey of quality attributes, assessment techniques, and assurance actions.ACM Comput

    Florian Daniel, Pavel Kucherbaev, Cinzia Cappiello, Boualem Benatallah, and Mohammad Allahbakhsh. Quality control in crowdsourcing: A survey of quality attributes, assessment techniques, and assurance actions.ACM Comput. Surv., 51(1), January 2018

  31. [31]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2024

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  32. [32]

    Berg, and Li Fei-Fei

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge, 2015

  33. [33]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024

  34. [34]

    Videocon: Robust video-language alignment via contrast captions, 2023

    Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, and Aditya Grover. Videocon: Robust video-language alignment via contrast captions, 2023

  35. [35]

    Bosheng Ding, Chengwei Qin, Linlin Liu, Yew Ken Chia, Boyang Li, Shafiq Joty, and Lidong Bing. Is GPT-3 a good data annotator? In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11173–11195, Toronto, Canada, July 2023. Associ...

  36. [36]

    Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alexandros G

    Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhu- rina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji...

  37. [37]

    Whisperx: Time-accurate speech tran- scription of long-form audio.INTERSPEECH 2023, 2023

    Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. Whisperx: Time-accurate speech tran- scription of long-form audio.INTERSPEECH 2023, 2023

  38. [38]

    Molmo2: Open weights and data for vision-language models with video understanding and grounding, 2026

    Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, and Ranjay Krishna. Molmo2: Open weights and data for vision-language m...

  39. [39]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un- terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021

  40. [40]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, 17 Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixu...

  41. [41]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  42. [42]

    Expanding language-image pretrained models for general video recognition

    Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for general video recognition. 2022

  43. [43]

    Internvid: A large-scale video-text dataset for multimodal understanding and generation

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. InICLR, 2023

  44. [44]

    Long-clip: Unlocking the long-text capability of clip.arXiv preprint arXiv:2403.15378, 2024

    Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip.arXiv preprint arXiv:2403.15378, 2024

  45. [45]

    Videoclip-xl: Advancing long description understanding for video clip models, 2024

    Jiapeng Wang, Chengyu Wang, Kunzhe Huang, Jun Huang, and Lianwen Jin. Videoclip-xl: Advancing long description understanding for video clip models, 2024

  46. [46]

    Quo vadis, action recognition? a new model and the kinetics dataset, 2018

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset, 2018

  47. [47]

    VideoICL: Confidence-based iterative in-context learning for out-of-distribution video understanding.arXiv preprint arXiv:2412.02186, 2024

    Kangsan Kim, Geon Park, Youngwan Lee, Woongyeong Yeo, and Sung Ju Hwang. Videoicl: Confidence-based iterative in-context learning for out-of-distribution video understanding.arXiv preprint arXiv:2412.02186, 2024

  48. [48]

    The mirror neuron system and action recogni- tion.Brain and Language, 89:370–376, 2004

    Giovanni Buccino, Ferdinand Binkofski, and Lucia Riggio. The mirror neuron system and action recogni- tion.Brain and Language, 89:370–376, 2004

  49. [49]

    Finegym: A hierarchical video dataset for fine-grained action understanding, 2020

    Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding, 2020

  50. [50]

    Fine-grained action analysis: A multi-modality and multi-task dataset of figure skating, 2024

    Sheng-Lan Liu, Yu-Ning Ding, Gang Yan, Si-Fan Zhang, Jin-Rong Zhang, Wen-Yue Chen, and Xue-Hai Xu. Fine-grained action analysis: A multi-modality and multi-task dataset of figure skating, 2024

  51. [51]

    Temporal segmentation of fine-gained semantic action: A motion-centered figure skating dataset

    Shenglan Liu, Aibin Zhang, Yunheng Li, Jian Zhou, Li Xu, Zhuben Dong, and Renhao Zhang. Temporal segmentation of fine-gained semantic action: A motion-centered figure skating dataset. InProceedings of the AAAI Conference on Artificial Intelligence, 2021

  52. [52]

    Fsbench: A figure skating benchmark for advancing artistic sports understanding.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13595–13605, 2025

    Rong Gao, Xin Liu, Zhuozhao Hu, Bohao Xing, Baiqiang Xia, Zitong Yu, and Heikki Kälviäinen. Fsbench: A figure skating benchmark for advancing artistic sports understanding.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13595–13605, 2025

  53. [53]

    Learning to score figure skating sport videos.IEEE Transactions on Circuits and Systems for Video Technology, 30(12):4578–4590, 2020

    Chengming Xu, Yanwei Fu, Bing Zhang, Zitian Chen, Yu-Gang Jiang, and Xiangyang Xue. Learning to score figure skating sport videos.IEEE Transactions on Circuits and Systems for Video Technology, 30(12):4578–4590, 2020

  54. [54]

    Fsd-10: A fine-grained classification dataset for figure skating.Neurocomputing, 413:360–367, 2020

    Shenglan Liu, Xiang Liu, Gao Huang, Hong Qiao, Lianyu Hu, Dong Jiang, Aibin Zhang, Yang Liu, and Ge Guo. Fsd-10: A fine-grained classification dataset for figure skating.Neurocomputing, 413:360–367, 2020

  55. [55]

    Basket: A large-scale video dataset for fine-grained skill estimation.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 28952– 28962, 2025

    Yulu Pan, Ce Zhang, and Gedas Bertasius. Basket: A large-scale video dataset for fine-grained skill estimation.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 28952– 28962, 2025. 18

  56. [56]

    Soccernet: A scalable dataset for action spotting in soccer videos

    Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. Soccernet: A scalable dataset for action spotting in soccer videos. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018

  57. [57]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024

    Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024

  58. [58]

    Gpt-5.4, 2026

    OpenAI. Gpt-5.4, 2026

  59. [59]

    The kinetics human action video dataset, 2017

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017

  60. [60]

    Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh

    Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models, 2021

  61. [61]

    good” and “exemplary

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...

  62. [62]

    next level up/down

    MAXIMIZE VISUAL CONFUSION WITHOUT OBVIOUS PATTERNS: - Select actions that share visual features, body positions, or motion qualities with the positive action - Avoid predictable selection patterns (e.g., don't always choose the "next level up/down" or "same family" actions) - Mix selection criteria unpredictably to prevent the model from learning simple h...

  63. [63]

    STRATEGIC AMBIGUITY: - Include some negatives that differ in subtle ways (small variations in technique/position) - Include some negatives that differ in more significant ways but still maintain visual similarity - Vary the type of similarity (sometimes motion-based, sometimes position-based, sometimes tool/environment-based)

  64. [64]

    AVOID FUNCTIONALLY RELATED ACTIONS FOR NEGATIVES: - Never select actions that typically occur together with the positive action - Avoid actions that are commonly performed in sequence or as part of the same technique - Don't pair actions that would naturally appear in the same short video clip - Don't pair action categories that are too similar or the sam...

  65. [65]

    REASONABLE DISTRIBUTION: - Each action should appear as a negative approximately 2-5 times across the dataset - Avoid extreme over-representation or under-representation - The overall pattern of selections should appear random and unpredictable Please provide your hard negative choices for these actions in the same order as provided: <ACTION LIST> Negativ...

  66. [66]

    The most problematic selection patterns identified

  67. [67]

    Any actions with co-occurring negatives

  68. [68]

    Then, provide the revised CSV with fixed negative selections, without detailed explanations for each change

    Distribution imbalances Briefly explain the changes made to each action's negatives, ensuring that the new selections are genuinely challenging and visually confusing. Then, provide the revised CSV with fixed negative selections, without detailed explanations for each change. API Details model: o3-2025-04-16 reasoning effort: high Figure 13|Third user pro...

  69. [69]

    The final list follows the exact same order as the original action list

  70. [70]

    Each action has 3 negatives that create genuine visual confusion

  71. [71]

    The selection patterns remain unpredictable and varied

  72. [72]

    No functionally related actions are paired

  73. [73]

    action", but we sometimes provide a more descriptive word in its place (e.g., some American Football actions are classified under the subdomain of

    The distribution is reasonably balanced (each action appears 2-5 times as a negative) Provide the final clean CSV with optimized hard negatives: API Details model: gpt-4.5-preview-2025-02-27 temperature: 0.5 max_tokens: 4096 Figure 14|Final user prompt for hard-negative generation(4/4) 30 Figure 15 | Benchmark Clip Collection UI. All of our UIs were refin...