pith. sign in

arxiv: 2407.08101 · v5 · submitted 2024-07-11 · 💻 cs.CV

What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

Pith reviewed 2026-05-23 22:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords situated interactionvision-language modelsasynchronous interactionfitness coachingreal-time feedbackstreaming baselinebenchmark dataset
0
0 comments X

The pith

A simple streaming baseline can deliver timely feedback in live fitness coaching while state-of-the-art vision-language models cannot handle asynchronous responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models handle prompted turn-based exchanges well but lack mechanisms for watching an unfolding situation and speaking up at the right moment without a user prompt. The authors introduce the QEVD benchmark and dataset built around fitness coaching, a domain that requires continuous monitoring of complex actions and immediate corrective feedback. Experiments on this benchmark expose clear shortcomings in existing models for timing and relevance of responses. A basic end-to-end streaming model is shown to process live video and generate appropriate feedback asynchronously.

Core claim

Fitness coaching serves as a controlled testbed for situated interaction; existing state-of-the-art vision-language models exhibit limitations for asynchronous responses, whereas a simple end-to-end streaming baseline can recognize human actions and supply appropriate feedback at the appropriate time.

What carries the argument

The QEVD benchmark and dataset, which requires models to recognize complex human actions, identify mistakes, and generate real-time feedback during live fitness sessions.

If this is right

  • Models for real-world interaction must process continuous video streams rather than discrete prompted turns.
  • The timing of feedback is as important as its content in situated tasks.
  • Fitness coaching provides a measurable proxy for evaluating proactive AI behavior in other physical domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same streaming approach could be tested on related tasks such as physical therapy or assembly guidance.
  • Combining the baseline with stronger action-recognition backbones might close remaining performance gaps.
  • Future systems may need explicit mechanisms for deciding both content and delivery timing rather than relying on external prompts.

Load-bearing premise

Fitness coaching is a suitable controlled yet challenging domain that intrinsically requires monitoring live user activity and immediate feedback.

What would settle it

If leading vision-language models achieve high performance on the QEVD benchmark without streaming changes, or if the proposed baseline fails to match the timing and appropriateness of human coach responses.

Figures

Figures reproduced from arXiv: 2407.08101 by Antoine Mercier, Apratim Bhattacharyya, Cornelius Bohm, Florian Dietrichkeit, Guillaume Berger, Ingo Bax, Mark Todorovich, Mingu Lee, Pulkit Madan, Reza Pourreza, Roland Memisevic, Sunny Panchal, Xuanlin Li.

Figure 1
Figure 1. Figure 1: Long-range interactive videos from our Q [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example annotations available on the short video clips from the Q [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the STREAM-VLM model. The visual stream is processed by a 3D CNN and the language backbone is a LLaMA-2-7B model; special action tokens (<next> and <feedback> ) are highlighted in orange. introduced to enable feedback delivery without explicit prompting. We begin by describing the vision backbone in detail, followed by the special action tokens, and finally, the training scheme. Vision back… view at source ↗
Figure 4
Figure 4. Figure 4: Predicted feedbacks on the FIT-COACH benchmark. The “turn-based” LLaMA-VID and LLaVA-NeXT models are unable to provide corrective feedback and instead generate overly generic and repetitive feedback. The STREAM-VLM model has learned to provide relevant feedback at the appropriate time. predicted feedbacks match the ground-truth both at the semantic level while containing references to specific important te… view at source ↗
Figure 5
Figure 5. Figure 5: Additional example annotations available on the short video clips from the Q [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Statistics of the short-clips in the QEVD-FIT-300K dataset [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Exercise distribution in (long-form) videos from the Q [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Simple QEVD-FIT-300K data collection web interface. D Training Details of the STREAM-VLM Model Here, we provide additional details of training the STREAM-VLM model described in Sec. 4 in the main paper. D.1 Vision Backbone After pre-training the 3D CNN model on ImageNet [18], we fine-tune the 3D CNN model on the fine-grained labels available on the short-clips dataset (full list in Appendix C.1). The 3D CN… view at source ↗
read the original abstract

Vision-language models have shown impressive progress in recent years. However, existing models are largely limited to turn-based interactions, where each turn must be stepped (i.e., prompted) by the user. Open-ended, asynchronous interactions, where an AI model may proactively deliver timely responses or feedback based on the unfolding situation in real-time, are an open challenge. In this work, we present the QEVD benchmark and dataset, which explores human-AI interaction in the challenging, yet controlled, real-world domain of fitness coaching -- a task which intrinsically requires monitoring live user activity and providing immediate feedback. The benchmark requires vision-language models to recognize complex human actions, identify possible mistakes, and provide appropriate feedback in real-time. Our experiments reveal the limitations of existing state-of-the-art vision-language models for such asynchronous situated interactions. Motivated by this, we propose a simple end-to-end streaming baseline that can respond asynchronously to human actions with appropriate feedback at the appropriate time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces the QEVD benchmark and associated dataset for evaluating vision-language models on asynchronous, situated interactions in the domain of live fitness coaching. It claims that existing SOTA VLMs are limited to turn-based interactions, demonstrates their shortcomings via experiments on this benchmark (recognizing actions, identifying mistakes, and providing timely feedback), and proposes a simple end-to-end streaming baseline capable of responding asynchronously with appropriate feedback at the right time.

Significance. If the benchmark construction, dataset quality, and experimental results hold, this work offers a controlled yet realistic testbed for an important open challenge in situated AI. The domain choice is explicitly motivated as requiring live monitoring and immediate feedback. Credit is due for creating a new empirical benchmark and dataset focused on proactive, real-time interaction rather than turn-based prompting, along with a minimal baseline that directly targets the identified gap.

major comments (2)
  1. [Abstract] Abstract: the claim that 'experiments reveal the limitations of existing state-of-the-art vision-language models' and that the baseline 'can respond asynchronously... at the appropriate time' is presented without any quantitative results, metrics, dataset statistics, or experimental details. This makes it impossible to evaluate whether the central empirical claims are supported.
  2. [Introduction / Benchmark description (inferred from abstract)] The manuscript positions fitness coaching as an effective testbed, but without the full methods section describing how action recognition, mistake identification, and timing of feedback are operationalized and measured (e.g., annotation protocol, temporal granularity, success criteria), it is unclear whether the benchmark actually isolates the asynchronous aspect or reduces to standard action recognition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for acknowledging the potential value of the QEVD benchmark as a testbed for asynchronous situated interaction. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'experiments reveal the limitations of existing state-of-the-art vision-language models' and that the baseline 'can respond asynchronously... at the appropriate time' is presented without any quantitative results, metrics, dataset statistics, or experimental details. This makes it impossible to evaluate whether the central empirical claims are supported.

    Authors: We agree that the abstract would be strengthened by including quantitative support for the claims. The full manuscript reports detailed experimental results comparing SOTA VLMs against the streaming baseline on the QEVD tasks. In revision we will update the abstract to include key dataset statistics (e.g., number of videos and action instances) and summary performance metrics that illustrate the identified limitations and the baseline's ability to produce timely asynchronous feedback. revision: yes

  2. Referee: [Introduction / Benchmark description (inferred from abstract)] The manuscript positions fitness coaching as an effective testbed, but without the full methods section describing how action recognition, mistake identification, and timing of feedback are operationalized and measured (e.g., annotation protocol, temporal granularity, success criteria), it is unclear whether the benchmark actually isolates the asynchronous aspect or reduces to standard action recognition.

    Authors: The full manuscript contains a methods section that specifies the annotation protocol, temporal granularity of the streams, and success criteria for evaluating when feedback should be issued. The benchmark isolates the asynchronous requirement by evaluating models on continuous video input where they must both recognize actions/mistakes and decide the appropriate moment to respond, rather than reacting only to explicit user turns. We will add an explicit cross-reference in the introduction to the methods section to make this distinction clearer. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes an empirical benchmark (QEVD) and dataset for asynchronous situated interaction in fitness coaching, demonstrates limitations of existing VLMs through experiments, and introduces a simple streaming baseline. No derivations, equations, fitted parameters presented as predictions, or self-citation chains are described in the abstract or setup. The central claims rest on empirical evaluation and domain motivation rather than any self-referential reduction or imported uniqueness theorem. The work is self-contained as a benchmark proposal without load-bearing steps that collapse to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no details on free parameters, axioms, or invented entities; standard CV/ML assumptions likely apply but cannot be audited.

pith-pipeline@v0.9.0 · 5745 in / 986 out tokens · 20257 ms · 2026-05-23T22:49:41.559423+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 12 internal anchors

  1. [1]

    https://github.com/quic/sense

    Sense Core. https://github.com/quic/sense. [Online; accessed Oct-2024]

  2. [2]

    https://www.statista.com/topics/12564/home-fitness/ #topicOverview

    Statista home fitness. https://www.statista.com/topics/12564/home-fitness/ #topicOverview. [Online; accessed Oct-2024]

  3. [3]

    Llama 3 model card

    AI@Meta. Llama 3 model card. https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md. [Online; accessed Oct-2024]

  4. [4]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkow...

  5. [5]

    Situated dialogue learning through procedural environment generation

    Prithviraj Ammanabrolu, Renee Jia, and Mark O Riedl. Situated dialogue learning through procedural environment generation. In ACL, 2022

  6. [6]

    Lawrence Zitnick, and Devi Parikh

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In ICCV, 2015

  7. [7]

    Butt, Keith Curtis, Yooyoung Lee, Jonathan G

    George Awad, Asad A. Butt, Keith Curtis, Yooyoung Lee, Jonathan G. Fiscus, Afzal Godil, David Joy, Andrew Delgado, Alan F. Smeaton, Yvette Graham, Wessel Kraaij, Georges Quénot, João Magalhães, David Semedo, and Saverio G. Blasi. TRECVID 2018: Bench- marking video activity detection, video captioning and matching, video storytelling linking and video sear...

  8. [8]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021

  9. [9]

    METEOR: an automatic metric for MT evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In IEEvaluation@ACL, 2005

  10. [10]

    Can foundation models watch, talk and guide you step by step to make a cake? In EMNLP Findings, 2023

    Yuwei Bao, Keunwoo Peter Yu, Yichi Zhang, Shane Storks, Itamar Bar-Yossef, Alexander De La Iglesia, Megan Su, Xiao-Lin Zheng, and Joyce Chai. Can foundation models watch, talk and guide you step by step to make a cake? In EMNLP Findings, 2023

  11. [11]

    Look, remember and reason: Visual reasoning with grounded rationales

    Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, and Roland Memisevic. Look, remember and reason: Visual reasoning with grounded rationales. In ICLR, 2024

  12. [12]

    Platform for situated intelligence

    Dan Bohus, Sean Andrist, Ashley Feniello, Nick Saw, Mihai Jalobeanu, Pat Sweeney, Anne Loomis Thompson, and Eric Horvitz. Platform for situated intelligence. Technical report, Microsoft, 2021

  13. [13]

    Rodney A. Brooks. Intelligence without reason. In IJCAI, 1991

  14. [14]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017

  15. [15]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023

  16. [16]

    Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evange- los Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. In IJCV, 2022

  17. [17]

    Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In CVPR, 2017. 11

  18. [18]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009

  19. [19]

    Every picture tells a story: Generating sentences from images

    Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, 2010

  20. [20]

    Aifit: Automatic 3d human-interpretable feedback models for fitness training

    Mihai Fieraru, Mihai Zanfir, Silviu Cristian Pirlea, Vlad Olaru, and Cristian Sminchisescu. Aifit: Automatic 3d human-interpretable feedback models for fitness training. In CVPR, 2021

  21. [21]

    something some- thing

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The “something some- thing” video database for learning and evaluating visual common sense. In ICCV, 2017

  22. [22]

    Ego4d: Around the world in 3, 000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Ro- hit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, et al. Ego4d: Around the...

  23. [23]

    Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Tri- antafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. CoRR, abs/2311.18259, 2023

  24. [24]

    AGQA: A bench- mark for compositional spatio-temporal reasoning

    Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. AGQA: A bench- mark for compositional spatio-temporal reasoning. In CVPR, 2021

  25. [25]

    Visual programming: Compositional visual reason- ing without training

    Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reason- ing without training. CoRR, abs/2211.11559, 2022

  26. [26]

    Activi- tynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activi- tynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015

  27. [27]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022

  28. [28]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

  29. [29]

    Lawrence Zitnick, and Ross B

    Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. CLEVR: A diagnostic dataset for compositional language and elemen- tary visual reasoning. In CVPR, 2017

  30. [30]

    FIXMYPOSE: pose correc- tional captioning and retrieval

    Hyounghun Kim, Abhay Zala, Graham Burri, and Mohit Bansal. FIXMYPOSE: pose correc- tional captioning and retrieval. In AAAI, 2021

  31. [31]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  32. [32]

    Ross, João Carreira, Alexander V ostrikov, and Andrew Zisserman

    Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander V ostrikov, and Andrew Zisserman. The ava-kinetics localized human actions video dataset. CoRR, abs/2005.00214, 2020

  33. [33]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. CoRR, abs/2311.17043, 2023. 12

  34. [34]

    Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo

    Yuncheng Li, Yale Song, Liangliang Cao, Joel R. Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. TGIF: A new dataset and benchmark on animated GIF description. In CVPR, 2016

  35. [35]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. CoRR, abs/2311.10122, 2023

  36. [36]

    Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

    Chin-Yew Lin and Franz Josef Och. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In ACL, 2004

  37. [37]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

  38. [38]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https: //llava-vl.github.io/blog/2024-01-30-llava-next/

  39. [39]

    G-eval: NLG evaluation using gpt-4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, EMNLP, 2023

  40. [40]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  41. [41]

    Chameleon: Plug-and-play compositional reasoning with large lan- guage models

    Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large lan- guage models. CoRR, abs/2304.09842, 2023

  42. [42]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Abdul Rasheed, Salman H. Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. CoRR, abs/2306.05424, 2023

  43. [43]

    Something-else: Compositional action recognition with spatial-temporal interaction networks

    Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, and Trevor Darrell. Something-else: Compositional action recognition with spatial-temporal interaction networks. In CVPR, 2020

  44. [44]

    Is end-to-end learning enough for fitness activity recogni- tion? CoRR, abs/2305.08191, 2023

    Antoine Mercier, Guillaume Berger, Sunny Panchal, Florian Dietrichkeit, Cornelius Böhm, Ingo Bax, and Roland Memisevic. Is end-to-end learning enough for fitness activity recogni- tion? CoRR, abs/2305.08191, 2023

  45. [45]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019

  46. [46]

    Brown, Quanfu Fan, and Dan Gutfreund

    Mathew Monfort, Carl V ondrick, Aude Oliva, Alex Andonian, Bolei Zhou, Kandan Ramakr- ishnan, Sarah Adel Bargal, Tom Yan, Lisa M. Brown, Quanfu Fan, and Dan Gutfreund. Mo- ments in time dataset: One million videos for event understanding. In IEEE PAMI, 2020

  47. [47]

    Hello gpt-4o

    OpenAI. “Hello gpt-4o.”. https://openai.com/index/hello-gpt-4oâĄă. [Online; ac- cessed Oct-2024]

  48. [48]

    Sci- enceqa: A novel resource for question answering on scholarly articles

    Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. Sci- enceqa: A novel resource for question answering on scholarly articles. International Journal on Digital Libraries, 23(3):289–301, 2022

  49. [49]

    Assembly101: A large-scale multi-view video dataset for under- standing procedural activities

    Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for under- standing procedural activities. In CVPR, 2022

  50. [50]

    NTU RGB+D: A large scale dataset for 3d human activity analysis

    Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. NTU RGB+D: A large scale dataset for 3d human activity analysis. In CVPR, 2016

  51. [51]

    Finegym: A hierarchical video dataset for fine-grained action understanding

    Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding. In CVPR, 2020. 13

  52. [52]

    Finegym: A hierarchical video dataset for fine-grained action understanding

    Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding. In CVPR, 2020

  53. [53]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hug- ginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580, 2023

  54. [54]

    Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta

    Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016

  55. [55]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012

  56. [56]

    ViperGPT: Visual Inference via Python Execution for Reasoning

    Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execu- tion for reasoning. CoRR, abs/2303.08128, 2023

  57. [57]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805, 2023

  58. [58]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023

  59. [59]

    Chatvideo: A tracklet-centric multimodal and versatile video understanding system

    Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, Zuxuan Wu, and Yu-Gang Jiang. Chatvideo: A tracklet-centric multimodal and versatile video understanding system. CoRR, abs/2304.14407, 2023

  60. [60]

    Vatex: A large-scale, high-quality multilingual dataset for video-and-language research

    Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, 2019

  61. [61]

    Holoassist: an egocentric human interaction dataset for interactive AI assistants in the real world

    Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bo- hus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for interactive AI assistants in the real world. In ICCV, 2023

  62. [62]

    STAR: A benchmark for situated reasoning in real-world videos

    Bo Wu, Shoubin Yu, Zhenfang Chen, Josh Tenenbaum, and Chuang Gan. STAR: A benchmark for situated reasoning in real-world videos. In NeurIPS, 2021

  63. [63]

    Advancing high-resolution video-language representation with large-scale video transcriptions

    Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. In CVPR, 2022

  64. [64]

    Socratic models: Composing zero-shot multimodal reasoning with language

    Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Marcin Choromanski, Adrian Wong, Ste- fan Welker, Federico Tombari, Aveek Purohit, Michael S Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete Florence. Socratic models: Composing zero-shot multimodal reasoning with language. In ICLR, 2023

  65. [65]

    Video-llama: An instruction-tuned audio-visual lan- guage model for video understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual lan- guage model for video understanding. In EMNLP - System Demonstrations , 2023

  66. [66]

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero- init attention. CoRR, abs/2303.16199, 2023

  67. [67]

    Weinberger, and Yoav Artzi

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with BERT. In ICLR, 2020

  68. [68]

    Llava-next: A strong zero-shot video understanding model, April

    Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, April

  69. [69]

    URL https://llava-vl.github.io/blog/2024-04-30-llava-next-video/ . 14

  70. [70]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multi- modal chain-of-thought reasoning in language models. CoRR, abs/2302.00923, 2023

  71. [71]

    Judging llm-as-a-judge with mt- bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt- bench and chatbot arena. In NeurIPS, 2023

  72. [72]

    What exercise is the user doing? Describe how they are doing it

    Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018. 15 Appendix A Overview Here we provide: (1) Additional qualitative examples from Q EVD -FIT-300K and Q EVD -FIT– COACH; (2) Additional data collection and annotation details, including instructions provided to the crowd-worker...

  73. [73]

    <exercise_name> (<variant>) – where variant may describe the side of the body being used or describing the position of the body

  74. [74]

    left” and “right

    <exercise_name> For example: Templated name: spider man pushup Descriptive response: The user is doing spider man pushups. Templated name: lunges (left leg out in front) Descriptive response: The user is doing lunges with their left leg in front. 17 (a) Distribution of fine-grained class labels. (b) Distribution of clip lengths. Figure 6: Statistics of th...