arxiv: 2604.08294 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Can Vision Language Models Judge Action Quality? An Empirical Evaluation

Miguel Monte e Freitas , Rui Henriques , Ricardo Rei , Pedro Henrique Martins

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords vision language modelsaction quality assessmentempirical evaluationmodel biasesfine-grained assessmentmultimodal modelssports analyticsphysical therapy applications

0 comments

The pith

Vision language models perform only marginally above random chance on action quality assessment and retain core limitations after bias fixes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates leading vision language models on judging the quality of actions such as fitness exercises, figure skating and diving. Models like Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 score barely above random even when given skeleton data, grounding instructions or in-context examples. Two recurring biases emerge: a default assumption of correct execution and sensitivity to how the question is worded. Reformulating tasks to be contrastive reduces these biases only slightly, which leads the authors to conclude that the models face a deeper problem with fine-grained movement assessment. The work supplies a clear baseline and maps out specific failure modes that must be solved before VLMs can be used reliably in coaching or therapy.

Core claim

State-of-the-art VLMs achieve only marginal performance above random chance across activity domains, tasks, visual representations and prompting strategies. Isolated gains appear from skeleton incorporation, reasoning structures or in-context learning, yet no strategy works consistently. Prediction distributions expose a bias toward labeling executions as correct regardless of evidence and a sensitivity to superficial linguistic framing. Reformulating the tasks contrastively to counter these biases produces minimal improvement, which indicates that the models' shortcomings extend beyond surface biases to a fundamental difficulty with fine-grained movement quality assessment.

What carries the argument

Large-scale empirical evaluation of multiple VLMs on AQA datasets, combined with systematic testing of representations and prompts plus analysis of output distributions to identify biases.

If this is right

VLMs cannot yet be deployed for reliable action quality scoring in physical therapy, sports coaching or judging.
Bias mitigation through contrastive reformulation or added modalities is insufficient on its own.
Future VLM-based AQA systems must target deeper architectural or training changes to handle fine-grained kinematic differences.
The reported baseline enables direct measurement of progress in subsequent research.
Identified failure modes such as over-prediction of correctness should be explicitly tested in new model releases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same models may face comparable limits on other tasks that require precise discrimination of temporal or kinematic variations, such as medical movement analysis.
Training data may lack enough examples of subtle quality differences, suggesting data curation as one possible direction.
Integration of explicit biomechanical constraints could help overcome the observed ceiling on performance.
Real-time or multi-view AQA scenarios would likely expose the limitation even more clearly.

Load-bearing premise

The chosen datasets, activity domains and prompting strategies are representative enough to support a general claim of fundamental limitation rather than a limitation of the tested setups.

What would settle it

Demonstration of a VLM that achieves high accuracy on the same or similar AQA tasks by reliably distinguishing subtle execution differences without defaulting to positive predictions would falsify the claim of fundamental difficulty.

Figures

Figures reproduced from arXiv: 2604.08294 by Miguel Monte e Freitas, Pedro Henrique Martins, Ricardo Rei, Rui Henriques.

**Figure 1.** Figure 1: Prompting effects on model predictions. For classification tasks (top), each subplot shows the percentage of predictions matching the correct execution form. For regression tasks (bottom), each subplot shows the mean predicted score normalised to 0–1. In both cases, results are shown across all models and three prompting strategies: base prompt (blue), positive guidelines (green), and negative guidelines (… view at source ↗

**Figure 2.** Figure 2: Accuracy (%) on contrastive tasks for MTL-AQA and FineFS as a function of the score gap between samples. Larger gaps generally correspond to easier comparisons, as the quality difference between samples becomes more pronounced. 5. Discussion Performance overview. Baseline results demonstrate that off-the-shelf VLMs achieve only marginal improvements over random chance across a broad range of datasets and t… view at source ↗

read the original abstract

Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models' limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLMs do poorly on action quality assessment across tested domains and mitigations, but the leap to fundamental limitation rests on limited prompting and domain coverage.

read the letter

The paper's core finding is that leading VLMs such as Gemini 3.1 Pro, Qwen3-VL, and InternVL3.5 score only marginally above chance on action quality assessment in fitness, figure skating, and diving videos. Adding skeleton data, grounding, reasoning chains, or in-context examples produces only isolated gains, and reformulating the task contrastively to counter the observed biases yields little extra lift. The authors also document two clear patterns: models default to labeling executions as correct regardless of the video, and they shift answers based on superficial wording changes. This supplies a useful multi-domain baseline that later work can reference when measuring progress on AQA. The systematic bias checks and the contrastive test are concrete contributions that make the negative result more informative than a simple accuracy table. The main soft spot is the claim of fundamental difficulty. The experiments cover a solid but finite set of current elicitation methods and three activity domains; they do not include fine-tuning, multi-turn visual dialogue, or other domains such as gymnastics or rehabilitation exercises. If those routes produce substantially better results, the limitation is configuration-specific rather than inherent. The abstract also omits exact sample sizes, statistical tests, and full tables, though the full manuscript presumably supplies them. Readers building VLM pipelines for sports analysis or physical therapy will get practical guidance on where off-the-shelf models break. The work is coherent on its own terms and supplies falsifiable empirical claims, so it merits peer review even if the interpretation stays cautious.

Referee Report

2 major / 3 minor

Summary. The paper presents a comprehensive empirical evaluation of state-of-the-art VLMs (Gemini 3.1 Pro, Qwen3-VL, InternVL3.5) on Action Quality Assessment (AQA) across domains including fitness, figure skating, and diving. Baseline performance is reported as only marginally above random chance; strategies such as skeleton input, grounding instructions, reasoning structures, in-context learning, and contrastive reformulation produce only isolated or minimal gains. The authors identify two systematic biases (over-prediction of correct execution and sensitivity to linguistic framing) and conclude that the results indicate a fundamental difficulty with fine-grained movement quality assessment, establishing a baseline for future VLM-based AQA work.

Significance. If the empirical findings hold under broader testing, the work is significant for documenting current VLM limitations on fine-grained visual judgment tasks with direct relevance to physical therapy, sports coaching, and judging applications. It supplies a reproducible baseline, identifies concrete failure modes (bias patterns and resistance to standard mitigations), and offers an actionable outline that can guide targeted improvements in visual reasoning for quality assessment.

major comments (2)

[Abstract and §5] Abstract and §5 (Discussion/Conclusion): The claim that results point to a 'fundamental difficulty' with fine-grained movement quality assessment is load-bearing for the paper's central contribution. This inference rests on the tested domains, task framings, and prompting repertoire being representative; the manuscript provides no experiments with additional domains, multi-turn visual reasoning protocols, or lightweight adaptation/fine-tuning regimes that could produce substantially higher accuracy, so the generalization from 'minimal improvement in tested configurations' to 'fundamental limitation' is not yet supported by the evidence.
[Results section] Results section (tables/figures reporting performance): The manuscript reports near-chance performance and minimal gains from mitigations but does not include exact sample sizes per domain, statistical significance tests (e.g., p-values or confidence intervals on accuracy differences), or full per-model/per-strategy result tables. Without these, it is difficult to evaluate the robustness of the 'only marginally above random chance' and 'minimal improvement' claims that underpin the fundamental-difficulty conclusion.

minor comments (3)

[Abstract and §3] Clarify model version nomenclature (e.g., 'Gemini 3.1 Pro' appears to be a non-standard designation; confirm whether this refers to Gemini 1.5 Pro, Gemini 2.0, or another variant).
[§3 (Experimental Setup)] Add explicit dataset statistics (number of videos/clips per activity domain, train/test splits, and annotation protocols) to the experimental setup section for reproducibility.
[Figures and Tables] Ensure all figures include error bars or variance measures and that table captions fully describe the metrics and conditions shown.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help strengthen the rigor and clarity of our empirical evaluation. We respond to each major comment below and describe the corresponding revisions.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Discussion/Conclusion): The claim that results point to a 'fundamental difficulty' with fine-grained movement quality assessment is load-bearing for the paper's central contribution. This inference rests on the tested domains, task framings, and prompting repertoire being representative; the manuscript provides no experiments with additional domains, multi-turn visual reasoning protocols, or lightweight adaptation/fine-tuning regimes that could produce substantially higher accuracy, so the generalization from 'minimal improvement in tested configurations' to 'fundamental limitation' is not yet supported by the evidence.

Authors: We agree that the phrasing 'fundamental difficulty' represents a strong generalization. Our evaluation covers representative AQA domains (fitness, figure skating, diving) and a broad range of standard prompting and input strategies, yet we acknowledge that the absence of multi-turn protocols or adaptation experiments limits the strength of the claim. In revision, we will moderate the language in the abstract and Section 5 to state that the results 'highlight substantial challenges and limitations in current VLM performance on fine-grained AQA' rather than asserting a fundamental difficulty. We will also add explicit discussion of the evaluation's scope and the value of future adaptation-based work. revision: partial
Referee: [Results section] Results section (tables/figures reporting performance): The manuscript reports near-chance performance and minimal gains from mitigations but does not include exact sample sizes per domain, statistical significance tests (e.g., p-values or confidence intervals on accuracy differences), or full per-model/per-strategy result tables. Without these, it is difficult to evaluate the robustness of the 'only marginally above random chance' and 'minimal improvement' claims that underpin the fundamental-difficulty conclusion.

Authors: We thank the referee for identifying this presentational gap. In the revised manuscript we will report exact sample sizes per domain and task. We will add statistical tests, including binomial tests against chance level and confidence intervals on accuracy differences, along with pairwise comparisons between prompting strategies. Complete per-model and per-strategy result tables will be placed in an appendix to support transparent evaluation of all claims. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmarking with no derivations or self-referential reductions

full rationale

The paper is an empirical evaluation study that benchmarks VLMs on AQA tasks using existing datasets, reports observed accuracies and biases, and tests prompting variants. No equations, fitted parameters, or first-principles derivations appear; claims rest on direct experimental outcomes rather than any reduction to inputs by construction. Self-citations, if present, are not load-bearing for the central empirical findings. The generalization to 'fundamental difficulty' is an interpretive step whose strength depends on domain coverage, but this is a question of external validity, not circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no free parameters, new axioms beyond standard AI evaluation practices, or invented entities; it relies on existing VLM architectures and benchmark tasks.

axioms (1)

domain assumption Action quality can be meaningfully quantified and compared using video inputs and language-based scoring prompts on the selected activity domains.
This premise enables the entire benchmarking framework.

pith-pipeline@v0.9.0 · 5514 in / 1131 out tokens · 64621 ms · 2026-05-10T18:29:11.933973+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance... Reformulating tasks contrastively... yields minimal improvement, suggesting... fundamental difficulty with fine-grained movement quality assessment.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Analysis of prediction distributions uncovers two systematic biases... sensitivity to superficial linguistic framing.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 3 canonical work pages · 1 internal anchor

[1]

System card: Claude sonnet 4.6

Anthropic. System card: Claude sonnet 4.6. Technical re- port, Anthropic, 2026. 6

2026
[2]

Action quality assessment with temporal parsing transformer

Yang Bai, Desen Zhou, Songyang Zhang, Jian Wang, Errui Ding, Yu Guan, Yang Long, and Jingdong Wang. Action quality assessment with temporal parsing transformer. In European Conference on Computer Vision, pages 422–438,
[3]

PhD thesis, University of Luxembourg,

Renato Manuel Lemos Baptista.Human Motion Analysis Using 3D Skeleton Representation in the Context of Real- World Applications: From Home-Based Rehabilitation to Sensing in the Wild. PhD thesis, University of Luxembourg,
[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz...

1901
[5]

Marianna Capecci, Maria Gabriella Ceravolo, Francesco Ferracuti, Sabrina Iarlori, Andrea Monteriu, Luca Romeo, and Federica Verdini. The kimore dataset: Kinematic assess- ment of movement and clinical scores for remote monitor- ing of physical rehabilitation.IEEE Transactions on Neural Systems and Rehabilitation Engineering, 27(7):1436–1448,
[6]

Quo vadis, action recognition? A new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. InIEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 1, 7

2017
[7]

SportsCap: Monocular 3D human motion capture and fine-grained understanding in challenging sports videos

Xin Chen, Anqi Pang, Wei Yang, Yuexin Ma, Lan Xu, and Jingyi Yu. SportsCap: Monocular 3D human motion capture and fine-grained understanding in challenging sports videos. International Journal of Computer Vision, 129:2846–2864,
[8]

Learning and fusing multiple hidden substages for action quality assessment.Knowledge- Based Systems, 229:107388, 2021

Li-Jia Dong, Hong-Bo Zhang, Qinghongya Shi, Qing Lei, Ji- Xiang Du, and Shangce Gao. Learning and fusing multiple hidden substages for action quality assessment.Knowledge- Based Systems, 229:107388, 2021. 1

2021
[9]

End-to-end action quality assessment with action parsing transformer

Hang Fang, Wengang Zhou, and Houqiang Li. End-to-end action quality assessment with action parsing transformer. In IEEE International Conference on Visual Communications and Image Processing, pages 1–5, 2023. 7

2023
[10]

Fine-grained spatio-temporal parsing net- work for action quality assessment.IEEE Transactions on Image Processing, 32:6386–6400, 2023

Kumie Gedamu, Yanli Ji, Yang Yang, Jie Shao, and Heng Tao Shen. Fine-grained spatio-temporal parsing net- work for action quality assessment.IEEE Transactions on Image Processing, 32:6386–6400, 2023. 1

2023
[11]

Vertex AI.https://cloud.google

Google Cloud. Vertex AI.https://cloud.google. com/vertex-ai, 2023. 3

2023
[12]

Gemini 3.1 pro model card.https: / / deepmind

Google DeepMind. Gemini 3.1 pro model card.https: / / deepmind . google / models / model - cards / gemini-3-1-pro/, 2026. Google DeepMind. 1, 2

2026
[13]

Action quality assessment using transformers.arXiv preprint arXiv:2207.12318, 2022

Abhay Iyer, Mohammad Alali, Hemanth Bodala, and Sunit Vaidya. Action quality assessment using transformers.arXiv preprint arXiv:2207.12318, 2022. 1

work page arXiv 2022
[14]

Localization-assisted uncertainty score disentanglement network for action quality assessment

Yanli Ji, Lingfeng Ye, Huili Huang, Lijing Mao, Yang Zhou, and Lingling Gao. Localization-assisted uncertainty score disentanglement network for action quality assessment. In Proceedings of the 31st ACM International Conference on Multimedia, pages 1–10. ACM, 2023. 2

2023
[15]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 3

2023
[16]

EgoExo-Fitness: Towards egocentric and exocentric full-body action under- standing

Yuan-Ming Li, Wei-Jin Huang, An-Lan Wang, Ling-An Zeng, Jing-Ke Meng, and Wei-Shi Zheng. EgoExo-Fitness: Towards egocentric and exocentric full-body action under- standing. InEuropean Conference on Computer Vision (ECCV), pages 363–382. Springer, 2024. 2

2024
[17]

Tech- Coach: Towards technical-point-aware descriptive action coaching, 2025

Yuan-Ming Li, An-Lan Wang, Kun-Yu Lin, Yu-Ming Tang, Ling-An Zeng, Jian-Fang Hu, and Wei-Shi Zheng. Tech- Coach: Towards technical-point-aware descriptive action coaching, 2025. 1, 7

2025
[18]

Action quality as- sessment with ignoring scene context

Takasuke Nagai, Shoichiro Takeda, Masaaki Matsumura, Shinya Shimizu, and Susumu Yamamoto. Action quality as- sessment with ignoring scene context. InIEEE International Conference on Image Processing, pages 1189–1193, 2021. 1

2021
[19]

Eagle-Eye: Extreme-pose action grader using detail bird’s- eye view

Mahdiar Nekoui, Fidel Omar Tito Cruz, and Li Cheng. Eagle-Eye: Extreme-pose action grader using detail bird’s- eye view. InIEEE/CVF Winter Conference on Applications of Computer Vision, pages 394–402, 2021. 1

2021
[20]

Assess- ing the quality of soccer shots from single-camera video with vision-language models and motion features

Filip Noworolnik and Joanna Jaworek-Korjakowska. Assess- ing the quality of soccer shots from single-camera video with vision-language models and motion features. InIEEE/CVF International Conference on Computer Vision Workshops, pages 2733–2740, 2025. 7

2025
[21]

Action assess- ment by joint relation graphs

Jia-Hui Pan, Jibin Gao, and Wei-Shi Zheng. Action assess- ment by joint relation graphs. InIEEE/CVF International Conference on Computer Vision, pages 6331–6340, 2019. 1

2019
[22]

Learning to score olympic events

Paritosh Parmar and Brendan Tran Morris. Learning to score olympic events. InIEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 20–28, 2017. 1, 7

2017
[23]

What and how well you performed? A multitask learning approach to action quality assessment

Paritosh Parmar and Brendan Tran Morris. What and how well you performed? A multitask learning approach to action quality assessment. InIEEE Conference on Computer Vision and Pattern Recognition, pages 304–313, 2019. 1, 2, 7

2019
[24]

Do- main knowledge-informed self-supervised representations for workout form assessment

Paritosh Parmar, Amol Gharat, and Helge Rhodin. Do- main knowledge-informed self-supervised representations for workout form assessment. InEuropean Conference on Computer Vision (ECCV). Springer, 2022. 1, 2 9

2022
[25]

Pose-guided matching based on deep learning for assessing quality of action on rehabili- tation training.Biomedical Signal Processing and Control, 72:103323, 2022

Yuhang Qiu, Jiping Wang, Zhe Jin, Honghui Chen, Min- gliang Zhang, and Liquan Guo. Pose-guided matching based on deep learning for assessing quality of action on rehabili- tation training.Biomedical Signal Processing and Control, 72:103323, 2022. 1

2022
[26]

Qwen3-VL technical report, 2025

Qwen Team. Qwen3-VL technical report, 2025. 1, 2

2025
[27]

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstan- tine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, et al. The prompt report: A systematic survey of prompting techniques.arXiv preprint arXiv:2406.06608, 2024. 3

work page internal anchor Pith review arXiv 2024
[28]

Can VLMs actually see and read? a survey on modal- ity collapse in vision-language models

Mong Yuan Sim, Wei Emma Zhang, Xiang Dai, and Biaoyan Fang. Can VLMs actually see and read? a survey on modal- ity collapse in vision-language models. InFindings of the As- sociation for Computational Linguistics: ACL 2025, pages 24452–24470, Vienna, Austria, 2025. Association for Com- putational Linguistics. 8

2025
[29]

A data set of human body movements for physical rehabilitation exercises.Data, 3(1):2, 2018

Aleksandar Vakanski, Hyung-pil Jun, David Paul, and Rus- sell Baker. A data set of human body movements for physical rehabilitation exercises.Data, 3(1):2, 2018. 1

2018
[30]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 2017. 1

2017
[31]

TSA-Net: Tube self-attention network for ac- tion quality assessment

Shunli Wang, Dingkang Yang, Peng Zhai, Chixiao Chen, and Lihua Zhang. TSA-Net: Tube self-attention network for ac- tion quality assessment. InACM International Conference on Multimedia, pages 4902–4910, 2021. 1, 7

2021
[32]

InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025

Weiyun Wang, Zhangwei Gao, Lixin Gu, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025. 1, 2

2025
[33]

Chain-of- thought prompting elicits reasoning in large language mod- els

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain-of- thought prompting elicits reasoning in large language mod- els. InAdvances in Neural Information Processing Systems,
[34]

HieroAction: Hierarchically guided VLM for fine-grained action analysis, 2025

Junhao Wu et al. HieroAction: Hierarchically guided VLM for fine-grained action analysis, 2025. 1, 4, 7

2025
[35]

Hager, and Trac D

Xiang Xiang, Ye Tian, Austin Reiter, Gregory D. Hager, and Trac D. Tran. S3D: Stacking segmental P3D for action qual- ity assessment. InIEEE International Conference on Image Processing, pages 928–932, 2018. 1, 7

2018
[36]

LLM-FMS: A fine-grained dataset for functional movement screen action quality assessment.PLOS ONE, 20 (3):e0313707, 2025

Qingjun Xing, Xing Xing, Peng Guo, Zhen Tang, and Yue Shen. LLM-FMS: A fine-grained dataset for functional movement screen action quality assessment.PLOS ONE, 20 (3):e0313707, 2025. 2

2025
[37]

Likert scor- ing with grade decoupling for long-term action assessment

Angchi Xu, Ling-An Zeng, and Wei-Shi Zheng. Likert scor- ing with grade decoupling for long-term action assessment. InIEEE Conference on Computer Vision and Pattern Recog- nition, pages 3232–3241, 2022. 1, 7

2022
[38]

Sam 3d body: Robust full-body human mesh recovery.arXiv preprint arXiv:2602.15989,

Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Doll´ar, and Kris Kitani. SAM 3D Body: Robust full-body human mesh recovery.arXiv preprint arXiv:2602.15989,

work page arXiv
[39]

A decade of action quality assessment: Largest systematic survey of trends, challenges, and future directions.International Journal of Computer Vi- sion, 134:73, 2026

Hao Yin, Paritosh Parmar, Daoliang Xu, Yang Zhang, Tianyou Zheng, and Weiwei Fu. A decade of action quality assessment: Largest systematic survey of trends, challenges, and future directions.International Journal of Computer Vi- sion, 134:73, 2026. 3

2026
[40]

Hybrid dynamic-static context-aware attention network for action assessment in long videos

Ling-An Zeng, Fa-Ting Hong, Wei-Shi Zheng, Qi-Zhi Yu, Wei Zeng, Yao-Wei Wang, and Jian-Huang Lai. Hybrid dynamic-static context-aware attention network for action assessment in long videos. InACM International Confer- ence on Multimedia, pages 2526–2534, 2020. 1

2020
[41]

a{video / image}of someone

Shiyi Zhang, Sule Bai, Guangyi Chen, Lei Chen, Jiwen Lu, Junle Wang, and Yansong Tang. Narrative action evaluation with prompt-guided multimodal interaction. InIEEE Con- ference on Computer Vision and Pattern Recognition, pages 18430–18439, 2024. 1, 7 10 Can Vision Language Models Judge Action Quality? An Empirical Evaluation Appendix A. Dataset Preproces...

2024
[42]

First, carefully examine the provided image in detail
[43]

Identify the person’s body position, joint angles, and alignment
[44]

For each question below, look at the specific body parts mentioned in the image
[45]

Base your answer ONLY on what is actually visible in this specific image
[46]

No explanations, no additional text

Compare what you see against each available option to select the best match RESPONSE FORMAT: You must respond ONLY with a valid JSON object. No explanations, no additional text. The JSON object must be in this exact format:{”{Rule ID}”: ”your answer”,} — ANALYSIS QUESTIONS: For each question, examine the relevant body parts in the image and select the opt...
[48]

Focus on the specific body parts and movements mentioned in the statement below
[49]

Observe how the person executes the movement throughout the video
[50]

Visual Grounding Prompt — Fitness-AQA You are analyzing a video of someone performing the exercise:{Action Name} CRITICAL INSTRUCTIONS:

Base your answer ONLY on what is actually visible in this specific video — STATEMENT TO VERIFY: ”{Keypoint Statement}” Observe the video and determine: Is this instruction being correctly followed? (Answer with the exact word that matches your observation) — RESPONSE FORMAT: Respond with ONLY ”True” or ”False” (nothing else). Visual Grounding Prompt — Fit...
[52]

Identify the person’s body position, movement patterns, and form throughout
[53]

For each error type below, observe the specific body parts mentioned across frames
[55]

No explanations, no additional text

Determine if each specific form error is present at any point during the movement RESPONSE FORMAT: You must respond ONLY with a valid JSON object. No explanations, no additional text. The JSON object must be in this exact format:{”{Error Name}”: ”True or False”,} — ERROR DETECTION: For each error type, examine the relevant body parts in the video and dete...
[57]

Identify the execution quality, movement patterns, and form throughout
[58]

Observe technique, flow, control, and any errors or excellent features
[60]

No explanations, no additional text

Determine the Grade of Execution (GOE) score for this execution RESPONSE FORMAT: You must respond with ONLY a single numeric value. No explanations, no additional text. Output only the numeric value from -5 to 5. — GOE SCALE: - Ranges from -5 (very poor) to 5 (exceptional) - 0 indicates meeting basic requirements - Positive GOE for good execution features...
[61]

First, carefully examine ALL frames in the provided video
[62]

Identify the execution quality, body positions, and form throughout
[63]

Observe technique, water entry, rotation control, and any errors or excellent features
[64]

Base your answer ONLY on what is actually visible in this specific video
[65]

No explanations, no additional text

Determine the execution score for this dive RESPONSE FORMAT: You must respond with ONLY a single numeric value. No explanations, no additional text. Output only the numeric value from 0 to 10. — SCORE RANGE: - Ranges from 0 (very poor) to 10 (perfect) - Higher scores indicate better execution quality - Consider execution quality, body position, form, tech...
[67]

— STEP 2 - ANSWER QUESTIONS: Based on your detailed observation above, answer the following questions

Then, answer specific questions based on your observations — STEP 1 - DETAILED OBSERV ATION: Before answering any questions, carefully examine the image and describe: - Overall body position and posture - Head and neck alignment - Shoulder position and alignment - Spine curvature and torso position - Hip position and alignment - Knee position and alignmen...
[68]

First, describe what you observe regarding the specific aspect mentioned
[69]

Focus on the relevant body parts and movements mentioned

Then, determine if the statement is True or False — STATEMENT TO VERIFY: ”{Keypoint Statement}” STEP 1 - OBSERV ATION: Describe what you observe in the video regarding this specific instruction. Focus on the relevant body parts and movements mentioned. STEP 2 - DETERMINATION: Based on your observation, is the instruction being correctly followed? — RESPON...
[71]

— STEP 2 - ERROR DETECTION: Based on your detailed observation above, determine if each form error is present

Then, determine if each form error is present based on your observations — STEP 1 - DETAILED OBSERV ATION: Before answering any questions, carefully examine the video and describe: - Overall movement pattern and exercise technique - Body position and posture throughout the movement - Arm and hand positioning - Leg and foot positioning - Torso and spine al...
[73]

— STEP 2 - GOE PREDICTION: Based on your detailed observation above, determine the GOE score

Then, determine the Grade of Execution (GOE) based on your observations — STEP 1 - DETAILED OBSERV ATION: Before assigning a score, carefully examine the video and describe: - Overall movement pattern and execution quality - Body position and posture throughout - Technical aspects (form, positions, alignment) - Speed, flow, and control - Height/amplitude ...
[74]

First, provide a detailed description of what you observe
[75]

— STEP 2 - SCORE PREDICTION: Based on your detailed observation above, determine the execution score

Then, determine the execution score based on your observations — STEP 1 - DETAILED OBSERV ATION: Before assigning a score, carefully examine the video and describe: - Overall movement pattern and execution quality - Body position and posture throughout the dive - Technical aspects (form, positions, alignment) - Rotation control and speed - Water entry qua...
[77]

The ¡output¿ tag must contain ONLY valid JSON with answers from the provided options
[79]

Base your analysis ONLY on what is visible in the image Begin your structured analysis now. Structured Reasoning Prompt — EgoExo-Fitness You are analyzing a video of someone performing the exercise:{Action Name} — STATEMENT TO VERIFY: ”{Keypoint Statement}” — INSTRUCTIONS: You must analyze this video using a structured reasoning process. Follow the EXACT ...
[81]

The ¡output¿ tag must contain ONLY the word ”True” or ”False”
[83]

Base your analysis ONLY on what is visible in the video Begin your structured analysis now. Structured Reasoning Prompt — Fitness-AQA You are analyzing a video of someone performing:{Action Name} POSSIBLE ERRORS TO DETECT: -{Error Name}: ”{Error Description}” — INSTRUCTIONS: You must analyze this video using a structured reasoning process. Follow the EXAC...
[85]

The ¡output¿ tag must contain ONLY valid JSON with ”True” or ”False” for each error
[87]

Structured Reasoning Prompt — FineFS You are analyzing a video of someone executing a{Action Name}({Action Name})

Base your analysis ONLY on what is visible in the video Begin your structured analysis now. Structured Reasoning Prompt — FineFS You are analyzing a video of someone executing a{Action Name}({Action Name}). GOE SCALE: - Ranges from -5 (very poor) to 5 (exceptional) - 0 indicates meeting basic requirements - Positive GOE for good execution features (height...
[89]

The ¡output¿ tag must contain ONLY a single numeric value from -5 to 5
[91]

Structured Reasoning Prompt — MTL-AQA You are analyzing a video of someone executing a dive

Base your analysis ONLY on what is visible in the video Begin your structured analysis now. Structured Reasoning Prompt — MTL-AQA You are analyzing a video of someone executing a dive. SCORE RANGE: - Ranges from 0 (very poor) to 10 (perfect) - Higher scores indicate better execution quality - Consider execution quality, body position, form, technique, and...
[92]

You MUST use all five tags in order:<look>,<decompose>,<analyse>,<assess>, ¡output¿
[93]

The ¡output¿ tag must contain ONLY a single numeric value from 0 to 10
[94]

Do not skip any tags or change their order
[95]

Base your analysis ONLY on what is visible in the video Begin your structured analysis now. B.5. Guideline Prompts The Positive and Negative Guidelines Prompts share the same structure, represented here with the placeholder{Guidelines}, which is substituted with best-form or worst-form guidelines depending on the variant. Guidelines Prompt — LLM-FMS You a...