What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

Antoine Mercier; Apratim Bhattacharyya; Cornelius Bohm; Florian Dietrichkeit; Guillaume Berger; Ingo Bax; Mark Todorovich; Mingu Lee; Pulkit Madan; Reza Pourreza

arxiv: 2407.08101 · v5 · submitted 2024-07-11 · 💻 cs.CV

What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

Sunny Panchal , Apratim Bhattacharyya , Guillaume Berger , Antoine Mercier , Cornelius Bohm , Florian Dietrichkeit , Reza Pourreza , Xuanlin Li

show 5 more authors

Pulkit Madan Mingu Lee Mark Todorovich Ingo Bax Roland Memisevic

This is my paper

Pith reviewed 2026-05-23 22:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords situated interactionvision-language modelsasynchronous interactionfitness coachingreal-time feedbackstreaming baselinebenchmark dataset

0 comments

The pith

A simple streaming baseline can deliver timely feedback in live fitness coaching while state-of-the-art vision-language models cannot handle asynchronous responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models handle prompted turn-based exchanges well but lack mechanisms for watching an unfolding situation and speaking up at the right moment without a user prompt. The authors introduce the QEVD benchmark and dataset built around fitness coaching, a domain that requires continuous monitoring of complex actions and immediate corrective feedback. Experiments on this benchmark expose clear shortcomings in existing models for timing and relevance of responses. A basic end-to-end streaming model is shown to process live video and generate appropriate feedback asynchronously.

Core claim

Fitness coaching serves as a controlled testbed for situated interaction; existing state-of-the-art vision-language models exhibit limitations for asynchronous responses, whereas a simple end-to-end streaming baseline can recognize human actions and supply appropriate feedback at the appropriate time.

What carries the argument

The QEVD benchmark and dataset, which requires models to recognize complex human actions, identify mistakes, and generate real-time feedback during live fitness sessions.

If this is right

Models for real-world interaction must process continuous video streams rather than discrete prompted turns.
The timing of feedback is as important as its content in situated tasks.
Fitness coaching provides a measurable proxy for evaluating proactive AI behavior in other physical domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same streaming approach could be tested on related tasks such as physical therapy or assembly guidance.
Combining the baseline with stronger action-recognition backbones might close remaining performance gaps.
Future systems may need explicit mechanisms for deciding both content and delivery timing rather than relying on external prompts.

Load-bearing premise

Fitness coaching is a suitable controlled yet challenging domain that intrinsically requires monitoring live user activity and immediate feedback.

What would settle it

If leading vision-language models achieve high performance on the QEVD benchmark without streaming changes, or if the proposed baseline fails to match the timing and appropriateness of human coach responses.

Figures

Figures reproduced from arXiv: 2407.08101 by Antoine Mercier, Apratim Bhattacharyya, Cornelius Bohm, Florian Dietrichkeit, Guillaume Berger, Ingo Bax, Mark Todorovich, Mingu Lee, Pulkit Madan, Reza Pourreza, Roland Memisevic, Sunny Panchal, Xuanlin Li.

**Figure 2.** Figure 2: Example annotations available on the short video clips from the Q [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture of the STREAM-VLM model. The visual stream is processed by a 3D CNN and the language backbone is a LLaMA-2-7B model; special action tokens (<next> and <feedback> ) are highlighted in orange. introduced to enable feedback delivery without explicit prompting. We begin by describing the vision backbone in detail, followed by the special action tokens, and finally, the training scheme. Vision back… view at source ↗

**Figure 4.** Figure 4: Predicted feedbacks on the FIT-COACH benchmark. The “turn-based” LLaMA-VID and LLaVA-NeXT models are unable to provide corrective feedback and instead generate overly generic and repetitive feedback. The STREAM-VLM model has learned to provide relevant feedback at the appropriate time. predicted feedbacks match the ground-truth both at the semantic level while containing references to specific important te… view at source ↗

**Figure 5.** Figure 5: Additional example annotations available on the short video clips from the Q [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Statistics of the short-clips in the QEVD-FIT-300K dataset [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Exercise distribution in (long-form) videos from the Q [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Simple QEVD-FIT-300K data collection web interface. D Training Details of the STREAM-VLM Model Here, we provide additional details of training the STREAM-VLM model described in Sec. 4 in the main paper. D.1 Vision Backbone After pre-training the 3D CNN model on ImageNet [18], we fine-tune the 3D CNN model on the fine-grained labels available on the short-clips dataset (full list in Appendix C.1). The 3D CN… view at source ↗

read the original abstract

Vision-language models have shown impressive progress in recent years. However, existing models are largely limited to turn-based interactions, where each turn must be stepped (i.e., prompted) by the user. Open-ended, asynchronous interactions, where an AI model may proactively deliver timely responses or feedback based on the unfolding situation in real-time, are an open challenge. In this work, we present the QEVD benchmark and dataset, which explores human-AI interaction in the challenging, yet controlled, real-world domain of fitness coaching -- a task which intrinsically requires monitoring live user activity and providing immediate feedback. The benchmark requires vision-language models to recognize complex human actions, identify possible mistakes, and provide appropriate feedback in real-time. Our experiments reveal the limitations of existing state-of-the-art vision-language models for such asynchronous situated interactions. Motivated by this, we propose a simple end-to-end streaming baseline that can respond asynchronously to human actions with appropriate feedback at the appropriate time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces the QEVD benchmark for asynchronous feedback in live fitness coaching and shows current VLMs struggle while a streaming baseline handles timing better.

read the letter

The core contribution is a new benchmark and dataset called QEVD that tests vision-language models on recognizing actions, spotting mistakes, and giving timely feedback during ongoing fitness sessions. The authors also describe a simple streaming baseline that responds asynchronously instead of waiting for user prompts. This directly targets the gap between turn-based VLM use and real-time situated interaction. The fitness domain is a reasonable choice because it forces continuous monitoring and precise timing without being completely uncontrolled. The abstract makes a clear case that off-the-shelf models fall short here and that the baseline improves on that. The work is empirical and proposes a concrete testbed rather than claiming broad theoretical advances. That keeps the scope manageable. The main limitation visible from the abstract is the lack of quantitative numbers, implementation details, or direct comparisons, which makes it hard to judge how large the gap actually is or how robust the baseline remains under different conditions. The domain is also fairly narrow, so readers interested in general robotics or assistance systems will need to do their own extrapolation. Overall the framing is coherent and the problem is well-motivated. This is worth sending to peer review because it supplies a usable benchmark and a minimal working approach that others can build on or critique with full results.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces the QEVD benchmark and associated dataset for evaluating vision-language models on asynchronous, situated interactions in the domain of live fitness coaching. It claims that existing SOTA VLMs are limited to turn-based interactions, demonstrates their shortcomings via experiments on this benchmark (recognizing actions, identifying mistakes, and providing timely feedback), and proposes a simple end-to-end streaming baseline capable of responding asynchronously with appropriate feedback at the right time.

Significance. If the benchmark construction, dataset quality, and experimental results hold, this work offers a controlled yet realistic testbed for an important open challenge in situated AI. The domain choice is explicitly motivated as requiring live monitoring and immediate feedback. Credit is due for creating a new empirical benchmark and dataset focused on proactive, real-time interaction rather than turn-based prompting, along with a minimal baseline that directly targets the identified gap.

major comments (2)

[Abstract] Abstract: the claim that 'experiments reveal the limitations of existing state-of-the-art vision-language models' and that the baseline 'can respond asynchronously... at the appropriate time' is presented without any quantitative results, metrics, dataset statistics, or experimental details. This makes it impossible to evaluate whether the central empirical claims are supported.
[Introduction / Benchmark description (inferred from abstract)] The manuscript positions fitness coaching as an effective testbed, but without the full methods section describing how action recognition, mistake identification, and timing of feedback are operationalized and measured (e.g., annotation protocol, temporal granularity, success criteria), it is unclear whether the benchmark actually isolates the asynchronous aspect or reduces to standard action recognition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for acknowledging the potential value of the QEVD benchmark as a testbed for asynchronous situated interaction. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'experiments reveal the limitations of existing state-of-the-art vision-language models' and that the baseline 'can respond asynchronously... at the appropriate time' is presented without any quantitative results, metrics, dataset statistics, or experimental details. This makes it impossible to evaluate whether the central empirical claims are supported.

Authors: We agree that the abstract would be strengthened by including quantitative support for the claims. The full manuscript reports detailed experimental results comparing SOTA VLMs against the streaming baseline on the QEVD tasks. In revision we will update the abstract to include key dataset statistics (e.g., number of videos and action instances) and summary performance metrics that illustrate the identified limitations and the baseline's ability to produce timely asynchronous feedback. revision: yes
Referee: [Introduction / Benchmark description (inferred from abstract)] The manuscript positions fitness coaching as an effective testbed, but without the full methods section describing how action recognition, mistake identification, and timing of feedback are operationalized and measured (e.g., annotation protocol, temporal granularity, success criteria), it is unclear whether the benchmark actually isolates the asynchronous aspect or reduces to standard action recognition.

Authors: The full manuscript contains a methods section that specifies the annotation protocol, temporal granularity of the streams, and success criteria for evaluating when feedback should be issued. The benchmark isolates the asynchronous requirement by evaluating models on continuous video input where they must both recognize actions/mistakes and decide the appropriate moment to respond, rather than reacting only to explicit user turns. We will add an explicit cross-reference in the introduction to the methods section to make this distinction clearer. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes an empirical benchmark (QEVD) and dataset for asynchronous situated interaction in fitness coaching, demonstrates limitations of existing VLMs through experiments, and introduces a simple streaming baseline. No derivations, equations, fitted parameters presented as predictions, or self-citation chains are described in the abstract or setup. The central claims rest on empirical evaluation and domain motivation rather than any self-referential reduction or imported uniqueness theorem. The work is self-contained as a benchmark proposal without load-bearing steps that collapse to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no details on free parameters, axioms, or invented entities; standard CV/ML assumptions likely apply but cannot be audited.

pith-pipeline@v0.9.0 · 5745 in / 986 out tokens · 20257 ms · 2026-05-23T22:49:41.559423+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 12 internal anchors

[1]

https://github.com/quic/sense

Sense Core. https://github.com/quic/sense. [Online; accessed Oct-2024]

work page 2024
[2]

https://www.statista.com/topics/12564/home-fitness/ #topicOverview

Statista home fitness. https://www.statista.com/topics/12564/home-fitness/ #topicOverview. [Online; accessed Oct-2024]

work page 2024
[3]

Llama 3 model card

AI@Meta. Llama 3 model card. https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md. [Online; accessed Oct-2024]

work page 2024
[4]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkow...

work page 2022
[5]

Situated dialogue learning through procedural environment generation

Prithviraj Ammanabrolu, Renee Jia, and Mark O Riedl. Situated dialogue learning through procedural environment generation. In ACL, 2022

work page 2022
[6]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In ICCV, 2015

work page 2015
[7]

Butt, Keith Curtis, Yooyoung Lee, Jonathan G

George Awad, Asad A. Butt, Keith Curtis, Yooyoung Lee, Jonathan G. Fiscus, Afzal Godil, David Joy, Andrew Delgado, Alan F. Smeaton, Yvette Graham, Wessel Kraaij, Georges Quénot, João Magalhães, David Semedo, and Saverio G. Blasi. TRECVID 2018: Bench- marking video activity detection, video captioning and matching, video storytelling linking and video sear...

work page 2018
[8]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021

work page 2021
[9]

METEOR: an automatic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In IEEvaluation@ACL, 2005

work page 2005
[10]

Can foundation models watch, talk and guide you step by step to make a cake? In EMNLP Findings, 2023

Yuwei Bao, Keunwoo Peter Yu, Yichi Zhang, Shane Storks, Itamar Bar-Yossef, Alexander De La Iglesia, Megan Su, Xiao-Lin Zheng, and Joyce Chai. Can foundation models watch, talk and guide you step by step to make a cake? In EMNLP Findings, 2023

work page 2023
[11]

Look, remember and reason: Visual reasoning with grounded rationales

Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, and Roland Memisevic. Look, remember and reason: Visual reasoning with grounded rationales. In ICLR, 2024

work page 2024
[12]

Platform for situated intelligence

Dan Bohus, Sean Andrist, Ashley Feniello, Nick Saw, Mihai Jalobeanu, Pat Sweeney, Anne Loomis Thompson, and Eric Horvitz. Platform for situated intelligence. Technical report, Microsoft, 2021

work page 2021
[13]

Rodney A. Brooks. Intelligence without reason. In IJCAI, 1991

work page 1991
[14]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017

work page 2017
[15]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023

work page 2023
[16]

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evange- los Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. In IJCV, 2022

work page 2022
[17]

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In CVPR, 2017. 11

work page 2017
[18]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009

work page 2009
[19]

Every picture tells a story: Generating sentences from images

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, 2010

work page 2010
[20]

Aifit: Automatic 3d human-interpretable feedback models for fitness training

Mihai Fieraru, Mihai Zanfir, Silviu Cristian Pirlea, Vlad Olaru, and Cristian Sminchisescu. Aifit: Automatic 3d human-interpretable feedback models for fitness training. In CVPR, 2021

work page 2021
[21]

something some- thing

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The “something some- thing” video database for learning and evaluating visual common sense. In ICCV, 2017

work page 2017
[22]

Ego4d: Around the world in 3, 000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Ro- hit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, et al. Ego4d: Around the...

work page 2022
[23]

Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Tri- antafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. CoRR, abs/2311.18259, 2023

work page arXiv 2023
[24]

AGQA: A bench- mark for compositional spatio-temporal reasoning

Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. AGQA: A bench- mark for compositional spatio-temporal reasoning. In CVPR, 2021

work page 2021
[25]

Visual programming: Compositional visual reason- ing without training

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reason- ing without training. CoRR, abs/2211.11559, 2022

work page arXiv 2022
[26]

Activi- tynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activi- tynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015

work page 2015
[27]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022

work page 2022
[28]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Lawrence Zitnick, and Ross B

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. CLEVR: A diagnostic dataset for compositional language and elemen- tary visual reasoning. In CVPR, 2017

work page 2017
[30]

FIXMYPOSE: pose correc- tional captioning and retrieval

Hyounghun Kim, Abhay Zala, Graham Burri, and Mohit Bansal. FIXMYPOSE: pose correc- tional captioning and retrieval. In AAAI, 2021

work page 2021
[31]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[32]

Ross, João Carreira, Alexander V ostrikov, and Andrew Zisserman

Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander V ostrikov, and Andrew Zisserman. The ava-kinetics localized human actions video dataset. CoRR, abs/2005.00214, 2020

work page arXiv 2005
[33]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. CoRR, abs/2311.17043, 2023. 12

work page arXiv 2023
[34]

Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo

Yuncheng Li, Yale Song, Liangliang Cao, Joel R. Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. TGIF: A new dataset and benchmark on animated GIF description. In CVPR, 2016

work page 2016
[35]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. CoRR, abs/2311.10122, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

Chin-Yew Lin and Franz Josef Och. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In ACL, 2004

work page 2004
[37]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

work page 2023
[38]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https: //llava-vl.github.io/blog/2024-01-30-llava-next/

work page 2024
[39]

G-eval: NLG evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, EMNLP, 2023

work page 2023
[40]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Chameleon: Plug-and-play compositional reasoning with large lan- guage models

Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large lan- guage models. CoRR, abs/2304.09842, 2023

work page arXiv 2023
[42]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Abdul Rasheed, Salman H. Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. CoRR, abs/2306.05424, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Something-else: Compositional action recognition with spatial-temporal interaction networks

Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, and Trevor Darrell. Something-else: Compositional action recognition with spatial-temporal interaction networks. In CVPR, 2020

work page 2020
[44]

Is end-to-end learning enough for fitness activity recogni- tion? CoRR, abs/2305.08191, 2023

Antoine Mercier, Guillaume Berger, Sunny Panchal, Florian Dietrichkeit, Cornelius Böhm, Ingo Bax, and Roland Memisevic. Is end-to-end learning enough for fitness activity recogni- tion? CoRR, abs/2305.08191, 2023

work page arXiv 2023
[45]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019

work page 2019
[46]

Brown, Quanfu Fan, and Dan Gutfreund

Mathew Monfort, Carl V ondrick, Aude Oliva, Alex Andonian, Bolei Zhou, Kandan Ramakr- ishnan, Sarah Adel Bargal, Tom Yan, Lisa M. Brown, Quanfu Fan, and Dan Gutfreund. Mo- ments in time dataset: One million videos for event understanding. In IEEE PAMI, 2020

work page 2020
[47]

Hello gpt-4o

OpenAI. “Hello gpt-4o.”. https://openai.com/index/hello-gpt-4oâĄă. [Online; ac- cessed Oct-2024]

work page 2024
[48]

Sci- enceqa: A novel resource for question answering on scholarly articles

Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. Sci- enceqa: A novel resource for question answering on scholarly articles. International Journal on Digital Libraries, 23(3):289–301, 2022

work page 2022
[49]

Assembly101: A large-scale multi-view video dataset for under- standing procedural activities

Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for under- standing procedural activities. In CVPR, 2022

work page 2022
[50]

NTU RGB+D: A large scale dataset for 3d human activity analysis

Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. NTU RGB+D: A large scale dataset for 3d human activity analysis. In CVPR, 2016

work page 2016
[51]

Finegym: A hierarchical video dataset for fine-grained action understanding

Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding. In CVPR, 2020. 13

work page 2020
[52]

Finegym: A hierarchical video dataset for fine-grained action understanding

Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding. In CVPR, 2020

work page 2020
[53]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hug- ginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta

Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016

work page 2016
[55]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[56]

ViperGPT: Visual Inference via Python Execution for Reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execu- tion for reasoning. CoRR, abs/2303.08128, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Chatvideo: A tracklet-centric multimodal and versatile video understanding system

Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, Zuxuan Wu, and Yu-Gang Jiang. Chatvideo: A tracklet-centric multimodal and versatile video understanding system. CoRR, abs/2304.14407, 2023

work page arXiv 2023
[60]

Vatex: A large-scale, high-quality multilingual dataset for video-and-language research

Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, 2019

work page 2019
[61]

Holoassist: an egocentric human interaction dataset for interactive AI assistants in the real world

Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bo- hus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for interactive AI assistants in the real world. In ICCV, 2023

work page 2023
[62]

STAR: A benchmark for situated reasoning in real-world videos

Bo Wu, Shoubin Yu, Zhenfang Chen, Josh Tenenbaum, and Chuang Gan. STAR: A benchmark for situated reasoning in real-world videos. In NeurIPS, 2021

work page 2021
[63]

Advancing high-resolution video-language representation with large-scale video transcriptions

Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. In CVPR, 2022

work page 2022
[64]

Socratic models: Composing zero-shot multimodal reasoning with language

Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Marcin Choromanski, Adrian Wong, Ste- fan Welker, Federico Tombari, Aveek Purohit, Michael S Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete Florence. Socratic models: Composing zero-shot multimodal reasoning with language. In ICLR, 2023

work page 2023
[65]

Video-llama: An instruction-tuned audio-visual lan- guage model for video understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual lan- guage model for video understanding. In EMNLP - System Demonstrations , 2023

work page 2023
[66]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero- init attention. CoRR, abs/2303.16199, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with BERT. In ICLR, 2020

work page 2020
[68]

Llava-next: A strong zero-shot video understanding model, April

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, April

work page
[69]

URL https://llava-vl.github.io/blog/2024-04-30-llava-next-video/ . 14

work page 2024
[70]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multi- modal chain-of-thought reasoning in language models. CoRR, abs/2302.00923, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Judging llm-as-a-judge with mt- bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt- bench and chatbot arena. In NeurIPS, 2023

work page 2023
[72]

What exercise is the user doing? Describe how they are doing it

Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018. 15 Appendix A Overview Here we provide: (1) Additional qualitative examples from Q EVD -FIT-300K and Q EVD -FIT– COACH; (2) Additional data collection and annotation details, including instructions provided to the crowd-worker...

work page 2018
[73]

<exercise_name> (<variant>) – where variant may describe the side of the body being used or describing the position of the body

work page
[74]

left” and “right

<exercise_name> For example: Templated name: spider man pushup Descriptive response: The user is doing spider man pushups. Templated name: lunges (left leg out in front) Descriptive response: The user is doing lunges with their left leg in front. 17 (a) Distribution of fine-grained class labels. (b) Distribution of clip lengths. Figure 6: Statistics of th...

work page

[1] [1]

https://github.com/quic/sense

Sense Core. https://github.com/quic/sense. [Online; accessed Oct-2024]

work page 2024

[2] [2]

https://www.statista.com/topics/12564/home-fitness/ #topicOverview

Statista home fitness. https://www.statista.com/topics/12564/home-fitness/ #topicOverview. [Online; accessed Oct-2024]

work page 2024

[3] [3]

Llama 3 model card

AI@Meta. Llama 3 model card. https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md. [Online; accessed Oct-2024]

work page 2024

[4] [4]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkow...

work page 2022

[5] [5]

Situated dialogue learning through procedural environment generation

Prithviraj Ammanabrolu, Renee Jia, and Mark O Riedl. Situated dialogue learning through procedural environment generation. In ACL, 2022

work page 2022

[6] [6]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In ICCV, 2015

work page 2015

[7] [7]

Butt, Keith Curtis, Yooyoung Lee, Jonathan G

George Awad, Asad A. Butt, Keith Curtis, Yooyoung Lee, Jonathan G. Fiscus, Afzal Godil, David Joy, Andrew Delgado, Alan F. Smeaton, Yvette Graham, Wessel Kraaij, Georges Quénot, João Magalhães, David Semedo, and Saverio G. Blasi. TRECVID 2018: Bench- marking video activity detection, video captioning and matching, video storytelling linking and video sear...

work page 2018

[8] [8]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021

work page 2021

[9] [9]

METEOR: an automatic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In IEEvaluation@ACL, 2005

work page 2005

[10] [10]

Can foundation models watch, talk and guide you step by step to make a cake? In EMNLP Findings, 2023

Yuwei Bao, Keunwoo Peter Yu, Yichi Zhang, Shane Storks, Itamar Bar-Yossef, Alexander De La Iglesia, Megan Su, Xiao-Lin Zheng, and Joyce Chai. Can foundation models watch, talk and guide you step by step to make a cake? In EMNLP Findings, 2023

work page 2023

[11] [11]

Look, remember and reason: Visual reasoning with grounded rationales

Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, and Roland Memisevic. Look, remember and reason: Visual reasoning with grounded rationales. In ICLR, 2024

work page 2024

[12] [12]

Platform for situated intelligence

Dan Bohus, Sean Andrist, Ashley Feniello, Nick Saw, Mihai Jalobeanu, Pat Sweeney, Anne Loomis Thompson, and Eric Horvitz. Platform for situated intelligence. Technical report, Microsoft, 2021

work page 2021

[13] [13]

Rodney A. Brooks. Intelligence without reason. In IJCAI, 1991

work page 1991

[14] [14]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017

work page 2017

[15] [15]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023

work page 2023

[16] [16]

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evange- los Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. In IJCV, 2022

work page 2022

[17] [17]

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In CVPR, 2017. 11

work page 2017

[18] [18]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009

work page 2009

[19] [19]

Every picture tells a story: Generating sentences from images

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, 2010

work page 2010

[20] [20]

Aifit: Automatic 3d human-interpretable feedback models for fitness training

Mihai Fieraru, Mihai Zanfir, Silviu Cristian Pirlea, Vlad Olaru, and Cristian Sminchisescu. Aifit: Automatic 3d human-interpretable feedback models for fitness training. In CVPR, 2021

work page 2021

[21] [21]

something some- thing

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The “something some- thing” video database for learning and evaluating visual common sense. In ICCV, 2017

work page 2017

[22] [22]

Ego4d: Around the world in 3, 000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Ro- hit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, et al. Ego4d: Around the...

work page 2022

[23] [23]

Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Tri- antafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. CoRR, abs/2311.18259, 2023

work page arXiv 2023

[24] [24]

AGQA: A bench- mark for compositional spatio-temporal reasoning

Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. AGQA: A bench- mark for compositional spatio-temporal reasoning. In CVPR, 2021

work page 2021

[25] [25]

Visual programming: Compositional visual reason- ing without training

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reason- ing without training. CoRR, abs/2211.11559, 2022

work page arXiv 2022

[26] [26]

Activi- tynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activi- tynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015

work page 2015

[27] [27]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022

work page 2022

[28] [28]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Lawrence Zitnick, and Ross B

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. CLEVR: A diagnostic dataset for compositional language and elemen- tary visual reasoning. In CVPR, 2017

work page 2017

[30] [30]

FIXMYPOSE: pose correc- tional captioning and retrieval

Hyounghun Kim, Abhay Zala, Graham Burri, and Mohit Bansal. FIXMYPOSE: pose correc- tional captioning and retrieval. In AAAI, 2021

work page 2021

[31] [31]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[32] [32]

Ross, João Carreira, Alexander V ostrikov, and Andrew Zisserman

Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander V ostrikov, and Andrew Zisserman. The ava-kinetics localized human actions video dataset. CoRR, abs/2005.00214, 2020

work page arXiv 2005

[33] [33]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. CoRR, abs/2311.17043, 2023. 12

work page arXiv 2023

[34] [34]

Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo

Yuncheng Li, Yale Song, Liangliang Cao, Joel R. Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. TGIF: A new dataset and benchmark on animated GIF description. In CVPR, 2016

work page 2016

[35] [35]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. CoRR, abs/2311.10122, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

Chin-Yew Lin and Franz Josef Och. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In ACL, 2004

work page 2004

[37] [37]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

work page 2023

[38] [38]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https: //llava-vl.github.io/blog/2024-01-30-llava-next/

work page 2024

[39] [39]

G-eval: NLG evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, EMNLP, 2023

work page 2023

[40] [40]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

Chameleon: Plug-and-play compositional reasoning with large lan- guage models

Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large lan- guage models. CoRR, abs/2304.09842, 2023

work page arXiv 2023

[42] [42]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Abdul Rasheed, Salman H. Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. CoRR, abs/2306.05424, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Something-else: Compositional action recognition with spatial-temporal interaction networks

Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, and Trevor Darrell. Something-else: Compositional action recognition with spatial-temporal interaction networks. In CVPR, 2020

work page 2020

[44] [44]

Is end-to-end learning enough for fitness activity recogni- tion? CoRR, abs/2305.08191, 2023

Antoine Mercier, Guillaume Berger, Sunny Panchal, Florian Dietrichkeit, Cornelius Böhm, Ingo Bax, and Roland Memisevic. Is end-to-end learning enough for fitness activity recogni- tion? CoRR, abs/2305.08191, 2023

work page arXiv 2023

[45] [45]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019

work page 2019

[46] [46]

Brown, Quanfu Fan, and Dan Gutfreund

Mathew Monfort, Carl V ondrick, Aude Oliva, Alex Andonian, Bolei Zhou, Kandan Ramakr- ishnan, Sarah Adel Bargal, Tom Yan, Lisa M. Brown, Quanfu Fan, and Dan Gutfreund. Mo- ments in time dataset: One million videos for event understanding. In IEEE PAMI, 2020

work page 2020

[47] [47]

Hello gpt-4o

OpenAI. “Hello gpt-4o.”. https://openai.com/index/hello-gpt-4oâĄă. [Online; ac- cessed Oct-2024]

work page 2024

[48] [48]

Sci- enceqa: A novel resource for question answering on scholarly articles

Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. Sci- enceqa: A novel resource for question answering on scholarly articles. International Journal on Digital Libraries, 23(3):289–301, 2022

work page 2022

[49] [49]

Assembly101: A large-scale multi-view video dataset for under- standing procedural activities

Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for under- standing procedural activities. In CVPR, 2022

work page 2022

[50] [50]

NTU RGB+D: A large scale dataset for 3d human activity analysis

Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. NTU RGB+D: A large scale dataset for 3d human activity analysis. In CVPR, 2016

work page 2016

[51] [51]

Finegym: A hierarchical video dataset for fine-grained action understanding

Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding. In CVPR, 2020. 13

work page 2020

[52] [52]

Finegym: A hierarchical video dataset for fine-grained action understanding

Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding. In CVPR, 2020

work page 2020

[53] [53]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hug- ginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta

Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016

work page 2016

[55] [55]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[56] [56]

ViperGPT: Visual Inference via Python Execution for Reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execu- tion for reasoning. CoRR, abs/2303.08128, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [57]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [59]

Chatvideo: A tracklet-centric multimodal and versatile video understanding system

Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, Zuxuan Wu, and Yu-Gang Jiang. Chatvideo: A tracklet-centric multimodal and versatile video understanding system. CoRR, abs/2304.14407, 2023

work page arXiv 2023

[60] [60]

Vatex: A large-scale, high-quality multilingual dataset for video-and-language research

Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, 2019

work page 2019

[61] [61]

Holoassist: an egocentric human interaction dataset for interactive AI assistants in the real world

Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bo- hus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for interactive AI assistants in the real world. In ICCV, 2023

work page 2023

[62] [62]

STAR: A benchmark for situated reasoning in real-world videos

Bo Wu, Shoubin Yu, Zhenfang Chen, Josh Tenenbaum, and Chuang Gan. STAR: A benchmark for situated reasoning in real-world videos. In NeurIPS, 2021

work page 2021

[63] [63]

Advancing high-resolution video-language representation with large-scale video transcriptions

Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. In CVPR, 2022

work page 2022

[64] [64]

Socratic models: Composing zero-shot multimodal reasoning with language

Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Marcin Choromanski, Adrian Wong, Ste- fan Welker, Federico Tombari, Aveek Purohit, Michael S Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete Florence. Socratic models: Composing zero-shot multimodal reasoning with language. In ICLR, 2023

work page 2023

[65] [65]

Video-llama: An instruction-tuned audio-visual lan- guage model for video understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual lan- guage model for video understanding. In EMNLP - System Demonstrations , 2023

work page 2023

[66] [66]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero- init attention. CoRR, abs/2303.16199, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[67] [67]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with BERT. In ICLR, 2020

work page 2020

[68] [68]

Llava-next: A strong zero-shot video understanding model, April

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, April

work page

[69] [69]

URL https://llava-vl.github.io/blog/2024-04-30-llava-next-video/ . 14

work page 2024

[70] [70]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multi- modal chain-of-thought reasoning in language models. CoRR, abs/2302.00923, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[71] [71]

Judging llm-as-a-judge with mt- bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt- bench and chatbot arena. In NeurIPS, 2023

work page 2023

[72] [72]

What exercise is the user doing? Describe how they are doing it

Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018. 15 Appendix A Overview Here we provide: (1) Additional qualitative examples from Q EVD -FIT-300K and Q EVD -FIT– COACH; (2) Additional data collection and annotation details, including instructions provided to the crowd-worker...

work page 2018

[73] [73]

<exercise_name> (<variant>) – where variant may describe the side of the body being used or describing the position of the body

work page

[74] [74]

left” and “right

<exercise_name> For example: Templated name: spider man pushup Descriptive response: The user is doing spider man pushups. Templated name: lunges (left leg out in front) Descriptive response: The user is doing lunges with their left leg in front. 17 (a) Distribution of fine-grained class labels. (b) Distribution of clip lengths. Figure 6: Statistics of th...

work page