arxiv: 2604.10466 · v1 · submitted 2026-04-12 · 💻 cs.CV

Recognition: unknown

ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos

Arjun Somayazulu , Kristen Grauman

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords motion editingskill learningexpert videosmasked modelingunpaired trainingvideo analysismotor skillscomputer vision

0 comments

The pith

ExpertEdit edits novice motions toward higher skill by learning an expert motion prior from unpaired expert videos alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to generate personalized visual feedback for motor skill learning by automatically refining a person's own performance to appear more expert. It trains exclusively on expert demonstration videos using a masked language modeling objective that learns to replace masked motion segments with expert-level versions. At test time the system identifies and masks skill-critical portions of a novice video then projects them into the learned expert manifold. This produces localized refinements without any paired novice-expert examples or manual edit instructions. The approach is shown to surpass supervised motion-editing baselines on realism and expert-quality metrics across eight techniques in three sports.

Core claim

ExpertEdit trains an expert motion prior on unpaired expert videos by masking random spans and training the model to infill them with expert refinements. At inference, novice motion is masked at skill-critical moments and the masked segments are projected into this prior, yielding localized skill improvements without paired supervision or explicit guidance.

What carries the argument

The masked language modeling objective that infills masked motion spans with expert-level refinements to form the learned expert motion prior.

If this is right

The method outperforms supervised motion editing baselines on multiple metrics of realism and expert quality across eight diverse techniques and three sports.
Skill improvements occur locally at masked critical moments rather than globally across the entire sequence.
Training requires only unpaired expert videos, removing the need to collect or align novice-expert pairs.
Inference needs no manual edit guidance or paired examples, enabling fully automatic application to new videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same masking-and-projection strategy could be tested on other sequential data such as dance sequences or rehabilitation exercises where expert demonstrations exist but paired data do not.
Real-time versions of the pipeline might support immediate visual feedback loops in training apps if the masking step can be made fast enough.
Performance across different body proportions or camera viewpoints would be a natural next test to determine how far the expert manifold generalizes beyond the training distributions.

Load-bearing premise

Masking novice motion at skill-critical moments and projecting it into the learned expert manifold produces localized meaningful skill improvements without paired supervision or explicit edit guidance.

What would settle it

Quantitative evaluation on the eight techniques from Ego-Exo4D and Karate Kyokushin showing no gains in motion realism or expert quality metrics over state-of-the-art supervised motion editing methods would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.10466 by Arjun Somayazulu, Kristen Grauman.

**Figure 1.** Figure 1: Skill-driven motion editing. Given a 3D motion sequence extracted from a novice activity video, ExpertEdit produces personalized skill edits by refining poses within regions where skill differences are pronounced. They are tweaked to exhibit expert-like precision and form, while preserving the original execution’s motion path and body orientation, as well as its source poses at all non skill-critical momen… view at source ↗

**Figure 2.** Figure 2: ExpertEdit approach. We tokenize expert pose motion sequences and mask the key action phase as determined by task-specific kinematic criteria. We train a bi-directional transformer, MotionInfiller, to predict the expert pose tokens at the masked positions. At inference, we mask skill-critical action phases in a novice motion (see Sec. 4) and infill these regions with expert-like motion. 3 ExpertEdit We exp… view at source ↗

**Figure 3.** Figure 3: ExpertEdit sequence visualization: We show novice source pose (blue) and edited pose (orange) at several frames for all techniques. ExpertEdit makes subtle pose refinements that improve form at skill-critical action moments, including raising the knee on the shooting hand-side higher during layups (Mikan, reverse), extending legs further on kicks (spin back, roundhouse), moving the shooting hand under the … view at source ↗

**Figure 1.** Figure 1: ExpertEdit performance as a function of training data. [PITH_FULL_IMAGE:figures/full_fig_p023_1.png] view at source ↗

read the original abstract

Visual feedback is critical for motor skill acquisition in sports and rehabilitation, and psychological studies show that observing near-perfect versions of one's own performance accelerates learning more effectively than watching expert demonstrations alone. We propose to enable such personalized feedback by automatically editing a person's motion to reflect higher skill. Existing motion editing approaches are poorly suited for this setting because they assume paired input-output data -- rare and expensive to curate for skill-driven tasks -- and explicit edit guidance at inference. We introduce ExpertEdit, a framework for skill-driven motion editing trained exclusively on unpaired expert video demonstrations. ExpertEdit learns an expert motion prior with a masked language modeling objective that infills masked motion spans with expert-level refinements. At inference, novice motion is masked at skill-critical moments and projected into the learned expert manifold, producing localized skill improvements without paired supervision or manual edit guidance. Across eight diverse techniques and three sports from Ego-Exo4D and Karate Kyokushin, ExpertEdit outperforms state-of-the-art supervised motion editing methods on multiple metrics of motion realism and expert quality. Project page: https://vision.cs.utexas.edu/projects/expert_edit/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ExpertEdit trains an expert motion prior on unpaired videos with masked infilling and edits novice inputs by masking at skill-critical points, outperforming supervised baselines on realism metrics, but the inference masking step's automatic detection mechanism is underspecified in the abstract.

read the letter

The core of this paper is a masked modeling approach that learns a skill prior solely from expert videos, then at inference masks segments of novice motion and fills them from the prior to produce more expert-like output. This avoids the paired novice-expert data that most motion editing methods require, which is a practical advantage for domains like sports or rehab where such pairs are hard to collect. The evaluation runs on Ego-Exo4D and Karate Kyokushin data across eight techniques and reports gains over supervised baselines on realism and expert-quality metrics, which is a direct and relevant comparison. The training objective itself is standard masked language modeling applied to motion, so the novelty sits in the unpaired expert-only setup plus the inference-time projection for localized edits. The soft spot is the inference masking procedure. The abstract states that novice motion is masked at skill-critical moments without manual guidance, yet it gives no mechanism for locating those moments automatically from the novice input. If the paper relies on heuristics, a separate model, or any form of external signal for that step, the no-explicit-guidance claim weakens and the comparison to supervised methods becomes less clean. The full manuscript likely spells this out, but from the abstract alone the load-bearing detail is missing. This work is aimed at people doing human motion analysis or video-based skill feedback in computer vision. It is coherent enough and grounded enough in the reported experiments to merit peer review, mainly so the inference details and metric definitions can be checked in full.

Referee Report

2 major / 2 minor

Summary. The paper proposes ExpertEdit, a framework for skill-driven motion editing trained solely on unpaired expert videos. It learns an expert motion prior via masked language modeling that infills masked spans with expert refinements. At inference, novice motions are masked at skill-critical moments and projected into the learned expert manifold to produce localized improvements. The method is evaluated on eight techniques across three sports from Ego-Exo4D and Karate Kyokushin, claiming to outperform state-of-the-art supervised motion editing baselines on metrics of motion realism and expert quality without requiring paired data or explicit edit guidance.

Significance. If the central claims hold after clarification, the work would be significant for computer vision applications in personalized feedback for sports and rehabilitation. Learning skill-aware priors from unpaired experts via masked modeling avoids the data collection burden of paired supervision and could enable scalable motion editing; the reported outperformance on diverse techniques provides a concrete benchmark for future unpaired methods.

major comments (2)

[§4] §4 (Inference procedure): The mechanism for automatically identifying 'skill-critical moments' in novice input is unspecified. The abstract claims masking occurs 'at skill-critical moments' and produces improvements 'without ... manual edit guidance,' but if detection relies on heuristics, a separate model, or any form of selection, this constitutes explicit guidance. This is load-bearing for the no-supervision claim and for fair comparison to supervised baselines that receive explicit paired signals.
[§5] §5 (Experiments and results): The outperformance claims on realism and expert quality metrics across eight techniques lack reported statistical significance tests, details on baseline re-implementations or hyperparameter matching, and confirmation that metric choices were not post-hoc. Without these, the superiority over supervised methods cannot be verified as robust, directly affecting the central empirical claim.

minor comments (2)

[Abstract] Abstract: The reference to 'psychological studies' on observing near-perfect versions of one's performance should include specific citations for traceability.
[Method] Method section: Notation for motion representations (e.g., how spans are masked and projected) could be introduced with a clear equation or diagram earlier to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the inference procedure and experimental reporting. We address each major comment below and outline revisions to improve clarity and rigor.

read point-by-point responses

Referee: [§4] §4 (Inference procedure): The mechanism for automatically identifying 'skill-critical moments' in novice input is unspecified. The abstract claims masking occurs 'at skill-critical moments' and produces improvements 'without ... manual edit guidance,' but if detection relies on heuristics, a separate model, or any form of selection, this constitutes explicit guidance. This is load-bearing for the no-supervision claim and for fair comparison to supervised baselines that receive explicit paired signals.

Authors: We appreciate the referee's emphasis on this point, as it directly relates to our central claim of operating without manual edit guidance. Section 4 of the manuscript specifies that masking at inference is performed automatically on the novice input motion alone, using only information derived from the input sequence itself and without any user-provided masks, paired supervision, or external signals. This is distinct from the explicit edit guidance supplied to the supervised baselines. We acknowledge that the current description could be more explicit about the precise automatic identification process. In the revised version, we will expand §4 with additional algorithmic details and pseudocode to demonstrate that the procedure requires no manual intervention or paired data, thereby reinforcing rather than weakening the no-supervision claim. revision: partial
Referee: [§5] §5 (Experiments and results): The outperformance claims on realism and expert quality metrics across eight techniques lack reported statistical significance tests, details on baseline re-implementations or hyperparameter matching, and confirmation that metric choices were not post-hoc. Without these, the superiority over supervised methods cannot be verified as robust, directly affecting the central empirical claim.

Authors: We agree that these elements are essential for verifying the robustness of the empirical results. In the revised manuscript, we will add statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) for all metric comparisons across the eight techniques. We will also include an expanded methods or appendix section providing full details on baseline re-implementations, including hyperparameter choices and how they were aligned with the original publications. Finally, we will add an explicit statement confirming that the evaluation metrics were selected a priori based on prior motion editing literature and not chosen post-hoc. These changes will be incorporated without altering the reported numerical results. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation uses standard masked modeling on unpaired data with independent external evaluation

full rationale

The paper's core pipeline—masked language modeling on unpaired expert videos to learn a motion prior, followed by inference-time masking of novice inputs and projection into the prior—is described without any equations or steps that reduce by construction to fitted parameters, self-citations, or renamed inputs. Training objective and inference procedure are distinct, and performance claims rest on comparisons to supervised baselines on external datasets (Ego-Exo4D, Karate Kyokushin), which are not forced by the training process itself. No load-bearing self-citation chains or ansatzes are invoked in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the domain assumption that expert motions occupy a learnable manifold separable from novice motions via masking and infilling, plus the premise that skill-critical moments can be identified and edited without explicit supervision.

axioms (2)

domain assumption Expert motions form a manifold that masked language modeling on unpaired videos can capture as a prior for infilling refinements.
This is the central modeling choice enabling training without paired data.
domain assumption Skill-critical moments in novice motion can be masked and projected into the expert manifold to yield localized improvements.
This is the key inference assumption stated in the abstract.

pith-pipeline@v0.9.0 · 5493 in / 1398 out tokens · 52775 ms · 2026-05-10T15:08:48.666827+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 64 canonical work pages · 4 internal anchors

[1]

ACM Transactions on Graphics39(4) (Aug 2020)

Aberman, K., Weng, Y., Lischinski, D., Cohen-Or, D., Chen, B.: Unpaired motion style transfer from video to animation. ACM Transactions on Graphics39(4) (Aug 2020). https://doi.org/10.1145/3386569.3392469 , http://dx.doi.org/ 10.1145/3386569.33924694

work page doi:10.1145/3386569.3392469 2020
[2]

006724, 9, 11

Ashutosh, K., Nagarajan, T., Pavlakos, G., Kitani, K., Grauman, K.: Expertaf: Expert actionable feedback from video (2025), https://arxiv.org/abs/2408. 006724, 9, 11

2025
[3]

Athanasiou, N., Cseke, A., Diomataris, M., Black, M.J., Varol, G.: Motionfix: Text- driven 3d human motion editing (2024),https://arxiv.org/abs/2408.00712 2, 3, 12, 13, 14, 23

work page arXiv 2024
[4]

Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now (2019),https: //arxiv.org/abs/1808.073714

work page arXiv 2019
[5]

In: Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology

Cheng, L., Xie, X., Peng, Y., Feng, M., He, Y., Cao, A., Wu, Y., Zhang, H., Wu, Y.: Vismimic: Integrating motion chain in feedback video generation for motor coaching. In: Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. UIST ’25, Association for Computing Machinery, New York, NY, USA (2025). https://doi.org/10.1145/...

work page doi:10.1145/3746059.3747794 2025
[6]

Delmas, G., Weinzaepfel, P., Moreno-Noguer, F., Rogez, G.: Posefix: Correcting 3d human poses with natural language (2024),https://arxiv.org/abs/2309.08480 2, 4

work page arXiv 2024
[7]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Ding, Y., Zhang, S., Shenglan, L., Zhang, J., Chen, W., Haifei, D., dong, b., Sun, T.: 2m-af: A strong multi-modality framework for human action quality assessment with self-supervised representation learning. In: Proceedings of the 32nd ACM International Conference on Multimedia. p. 1564–1572. MM ’24, Association for Computing Machinery, New York, NY, US...

work page doi:10.1145/3664647.36810843 2024
[8]

In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Dittakavi, B., Bavikadi, D., Desai, S.V., Chakraborty, S., Reddy, N., Balasub- ramanian, V.N., Callepalli, B., Sharma, A.: Pose tutor: An explainable system for pose correction in the wild. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 3539–3548 (2022). https://doi.org/10.1109/CVPRW56347.2022.003984

work page doi:10.1109/cvprw56347.2022.003984 2022
[9]

Doughty, H., Damen, D., Mayol-Cuevas, W.: Who’s better? who’s best? pairwise deep ranking for skill determination (2018),https://arxiv.org/abs/1703.09913 3

work page arXiv 2018
[10]

In: Proceedings of the 12th ACM SIGGRAPH Conference on Motion, Interaction and Games

Du, H., Herrmann, E., Sprenger, J., Fischer, K., Slusallek, P.: Stylistic locomotion modeling and synthesis using variational generative models. In: Proceedings of the 12th ACM SIGGRAPH Conference on Motion, Interaction and Games. MIG ’19, Association for Computing Machinery, New York, NY, USA (2019).https://doi. org/10.1145/3359566.3360083,https://doi.or...

work page doi:10.1145/3359566.3360083 2019
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Fieraru, M., Zanfir, M., Pirlea, S.C., Olaru, V., Sminchisescu, C.: Aifit: Automatic 3d human-interpretable feedback models for fitness training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9919–9928 (June 2021) 4

2021
[12]

Rgb↔x: Image decomposition and synthesis using material- and lighting-aware diffusion models

Goel, P., Wang, K.C., Liu, C.K., Fatahalian, K.: Iterative motion editing with natural language. In: Special Interest Group on Computer Graphics and Inter- active Techniques Conference Conference Papers. p. 1–9. SIGGRAPH ’24, ACM (Jul 2024). https://doi.org/10.1145/3641519.3657447, http://dx.doi.org/10. 1145/3641519.36574474 16 Arjun Somayazulu and Kriste...

work page doi:10.1145/3641519.3657447 2024
[13]

Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V., Bansal, S., Boote, B., Byrne, E., Chavis, Z., Chen, J., Cheng, F., Chu, F.J., Crane, S., Dasgupta, A., Dong, J., Escobar, M., Forigua, C., Gebreselasie, A., Haresh, S., Huang, J., Islam, M.M., Jain, S., Khirodkar, R., Kukreja, D., Liang, K.J., Liu, J.W....

work page arXiv 2024
[14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5152–5161 (June 2022) 12

2022
[15]

Hartwig, S., Engel, D., Sick, L., Kniesel, H., Payer, T., Poonam, P., Glöckler, M., Bäuerle, A., Ropinski, T.: A survey on quality metrics for text-to-image generation (2025),https://arxiv.org/abs/2403.1182111

work page arXiv 2025
[16]

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium (2018), https://arxiv.org/abs/1706.0850011

work page Pith review arXiv 2018
[17]

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models (2020),https: //arxiv.org/abs/2006.1123911

work page internal anchor Pith review arXiv 2020
[18]

SIGGRAPH Asia 2015 Technical Briefs , year =

Holden, D., Saito, J., Komura, T., Joyce, T.: Learning motion manifolds with convolutional autoencoders. In: SIGGRAPH Asia 2015 Technical Briefs. SA ’15, Association for Computing Machinery, New York, NY, USA (2015).https://doi. org/10.1145/2820903.2820918,https://doi.org/10.1145/2820903.28209184

work page doi:10.1145/2820903.2820918 2015
[19]

Hu, L., Zhang, Z., Ye, Y., Xu, Y., Xia, S.: Diffusion-based human motion style transfer with semantic guidance (2024),https://arxiv.org/abs/2405.066464

work page arXiv 2024
[20]

Hu, L., Gao, X., Zhang, P., Sun, K., Zhang, B., Bo, L.: Animate anyone: Consistent and controllable image-to-video synthesis for character animation (2024),https: //arxiv.org/abs/2311.171174

work page arXiv 2024
[21]

Huh, M., Xue, Z., Das, U., Ashutosh, K., Grauman, K., Pavel, A.: Vid2coach: Transforming how-to videos into task assistants (2025),https://arxiv.org/abs/ 2506.007174

work page arXiv 2025
[22]

ACM Transactions on Graphics41(3), 1–16 (Jun 2022).https: //doi.org/10.1145/3516429,http://dx.doi.org/10.1145/35164294

Jang, D.K., Park, S., Lee, S.H.: Motion puzzle: Arbitrary motion style transfer by body part. ACM Transactions on Graphics41(3), 1–16 (Jun 2022).https: //doi.org/10.1145/3516429,http://dx.doi.org/10.1145/35164294

work page doi:10.1145/3516429 2022
[23]

Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: Motiongpt: Human motion as a foreign language (2023),https://arxiv.org/abs/2306.147954

work page arXiv 2023
[24]

207242, 4

Jiang, N., Li, H., Yuan, Z., He, Z., Chen, Y., Liu, T., Zhu, Y., Huang, S.: Dynamic motion blending for versatile motion editing (2025),https://arxiv.org/abs/2503. 207242, 4

2025
[25]

Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation (2018),https://arxiv.org/abs/1710.1019611 ExpertEdit 17

work page internal anchor Pith review arXiv 2018
[26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Kim, B., Kim, J., Chang, H.J., Choi, J.Y.: Most: Motion style transformer between diverse action contents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1705–1714 (June 2024) 4

2024
[27]

Kim, J., Kim, J., Choi, S.: Flame: Free-form language-based motion synthesis & editing (2023),https://arxiv.org/abs/2209.003492, 3, 12, 13

work page arXiv 2023
[28]

Li, J., Cao, J., Zhang, H., Rempe, D., Kautz, J., Iqbal, U., Yuan, Y.: Genmo: A generalist model for human motion (2025),https://arxiv.org/abs/2505.01425 11

work page arXiv 2025
[29]

Li, Z., Cheng, K., Ghosh, A., Bhattacharya, U., Gui, L., Bera, A.: Simmotionedit: Text-based human motion editing with motion similarity prediction (2025),https: //arxiv.org/abs/2503.182112, 3, 4, 12, 13, 14, 23, 24

work page arXiv 2025
[30]

IEEE Transactions on Visualization and Computer Graphics30(7), 3180–3195 (Jul 2024)

Liu, J., Saquib, N., Zhutian, C., Kazi, R.H., Wei, L.Y., Fu, H., Tai, C.L.: Posecoach: A customizable analysis and visualization system for video-based running coaching. IEEE Transactions on Visualization and Computer Graphics30(7), 3180–3195 (Jul 2024). https://doi.org/10.1109/tvcg.2022.3230855 , http://dx.doi.org/10. 1109/TVCG.2022.32308554

work page doi:10.1109/tvcg.2022.3230855 2024
[31]

Liu, S.L., Ding, Y.N., Yan, G., Zhang, S.F., Zhang, J.R., Chen, W.Y., Xu, X.H.: Fine-grained action analysis: A multi-modality and multi-task dataset of figure skating (2024),https://arxiv.org/abs/2307.027303

work page arXiv 2024
[32]

Liu, W., Piao, Z., Min, J., Luo, W., Ma, L., Gao, S.: Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis (2019),https://arxiv.org/abs/1909.122244

work page arXiv 2019
[33]

ACM Trans

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia)34(6), 248:1–248:16 (Oct 2015) 5, 12

2015
[34]

105424, 7, 11, 12, 22

Lucas, T., Baradel, F., Weinzaepfel, P., Rogez, G.: Posegpt: Quantization-based 3d human motion generation and forecasting (2022),https://arxiv.org/abs/2210. 105424, 7, 11, 12, 22

2022
[35]

Majeedi, A., Gajjala, V.R., GNVV, S.S.S.N., Li, Y.: Rica2: Rubric-informed, cali- brated assessment of actions (2024),https://arxiv.org/abs/2408.021383

work page arXiv 2024
[36]

In: ProceedingsoftheIEEE/CVFInternationalConferenceonComputerVision(ICCV) Workshops

Noworolnik, F., Jaworek-Korjakowska, J.: Assessing the quality of soccer shots from single-camera video with vision-language models and motion features. In: ProceedingsoftheIEEE/CVFInternationalConferenceonComputerVision(ICCV) Workshops. pp. 2733–2740 (October 2025) 3

2025
[37]

Pan, Y., Zhang, C., Bertasius, G.: Basket: A large-scale video dataset for fine-grained skill estimation (2025),https://arxiv.org/abs/2503.207813

work page arXiv 2025
[38]

In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII

Parmar, P., Gharat, A., Rhodin, H.: Domain knowledge-informed self-supervised representations for workout form assessment. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII. pp. 105–123. Springer (2022) 3

2022
[39]

org/abs/1611.051253

Parmar, P., Morris, B.T.: Learning to score olympic events (2017),https://arxiv. org/abs/1611.051253

work page arXiv 2017
[40]

Parmar, P., Morris, B.T.: What and how well you performed? a multitask learning approach to action quality assessment (2019),https://arxiv.org/abs/1904.04346 3

work page arXiv 2019
[41]

Petrovich,M.,Black, M.J., Varol, G.: Action-conditioned 3dhuman motion synthesis with transformer vae (2021),https://arxiv.org/abs/2104.0567011

work page arXiv 2021
[42]

Petrovich, M., Black, M.J., Varol, G.: Temos: Generating diverse human motions from textual descriptions (2022),https://arxiv.org/abs/2204.1410911 18 Arjun Somayazulu and Kristen Grauman

work page arXiv 2022
[43]

Sport Psychologist17, 220–241 (06 2003).https://doi

Ram, N., McCullagh, P.: Self-modeling: Influence on psychological responses and physical performance. Sport Psychologist17, 220–241 (06 2003).https://doi. org/10.1123/tsp.17.2.2201

work page doi:10.1123/tsp.17.2.2201 2003
[44]

Rao, J., Wu, H., Jiang, H., Zhang, Y., Wang, Y., Xie, W.: Towards universal soccer video understanding (2025),https://arxiv.org/abs/2412.018203

work page arXiv 2025
[45]

In: Proceedings of the 27th International Conference on Multimodal Interaction

Richardson, A., Putze, F.: Motion diffusion autoencoders: Enabling attribute manipulation in human motion demonstrated on karate techniques. In: Proceedings of the 27th International Conference on Multimodal Interaction. p. 372–380. ICMI ’25, ACM (Oct 2025).https://doi.org/10.1145/3716553.3750773, http://dx. doi.org/10.1145/3716553.375077310, 20

work page doi:10.1145/3716553.3750773 2025
[46]

Shin, S., Kim, J., Halilaj, E., Black, M.J.: Wham: Reconstructing world-grounded humans with accurate 3d motion (2024),https://arxiv.org/abs/2312.07531 12

work page arXiv 2024
[47]

Frontiers in Psy- chology2, 155 (2011).https://doi.org/10.3389/fpsyg.2011.001551

Ste-Marie, D.M., Vertes, K., Rymal, A.M., Martini, R.: Feedforward self-modeling enhances skill acquisition in children learning trampoline skills. Frontiers in Psy- chology2, 155 (2011).https://doi.org/10.3389/fpsyg.2011.001551

work page doi:10.3389/fpsyg.2011.001551 2011
[48]

Steel, K.A., Mudie, K., Sandoval, R., Anderson, D., Dogramaci, S., Rehmanjan, M., Birznieks, I.: Can video self-modeling improve affected limb reach and grasp ability in stroke patients? Journal of Motor Behavior50(2), 117–126 (2018).https: //doi.org/10.1080/00222895.2017.1306480, epub 2017 May 19 1

work page doi:10.1080/00222895.2017.1306480 2018
[49]

Scientific Data8(01 2021).https://doi.org/10.1038/s41597-021-00801-51, 9

Szczęsna, A., Błaszczyszyn, M., Pawlyta, M.: Optical motion capture dataset of selected techniques in beginner and advanced kyokushin karate athletes. Scientific Data8(01 2021).https://doi.org/10.1038/s41597-021-00801-51, 9

work page doi:10.1038/s41597-021-00801-51 2021
[50]

Tao, T., Zhan, X., Chen, Z., van de Panne, M.: Style-erd: Responsive and coherent online motion style transfer (2022),https://arxiv.org/abs/2203.025744

work page arXiv 2022
[51]

Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model (2022),https://arxiv.org/abs/2209.1491611

work page internal anchor Pith review arXiv 2022
[52]

Tu, S., Dai, Q., Cheng, Z.Q., Hu, H., Han, X., Wu, Z., Jiang, Y.G.: Motioneditor: Editing video motion via content-aware diffusion (2023),https://arxiv.org/abs/ 2311.188302, 4

work page arXiv 2023
[53]

Tu, S., Dai, Q., Zhang, Z., Xie, S., Cheng, Z.Q., Luo, C., Han, X., Wu, Z., Jiang, Y.G.: Motionfollower: Editing video motion via lightweight score-guided diffusion (2024),https://arxiv.org/abs/2405.203252, 4

work page arXiv 2024
[54]

Villegas, R., Yang, J., Ceylan, D., Lee, H.: Neural kinematic networks for unsuper- vised motion retargetting (2018),https://arxiv.org/abs/1804.056534

work page arXiv 2018
[55]

Wang, Y., Huang, D., Zhang, Y., Ouyang, W., Jiao, J., Feng, X., Zhou, Y., Wan, P., Tang, S., Xu, D.: Motiongpt-2: A general-purpose motion-language model for motion generation and understanding (2024),https://arxiv.org/abs/2410.21747 4

work page arXiv 2024
[56]

In: Com- puter Vision – ECCV 2024: 18th European Conference, Milan, Italy, Septem- ber 29–October 4, 2024, Proceedings, Part XLII

Xu, H., Ke, X., Li, Y., Xu, R., Wu, H., Lin, X., Guo, W.: Vision-language ac- tion knowledge learning for semantic-aware action quality assessment. In: Com- puter Vision – ECCV 2024: 18th European Conference, Milan, Italy, Septem- ber 29–October 4, 2024, Proceedings, Part XLII. p. 423–440. Springer-Verlag, Berlin, Heidelberg (2024). https://doi.org/10.100...

work page doi:10.1007/978-3-031-72946-1_24 2024
[57]

Xu, J., Rao, Y., Yu, X., Chen, G., Zhou, J., Lu, J.: Finediving: A fine-grained dataset for procedure-aware action quality assessment (2022),https://arxiv.org/ abs/2204.036463

work page arXiv 2022
[58]

Xu, J., Yin, S., Zhao, G., Wang, Z., Peng, Y.: Fineparser: A fine-grained spatio- temporal action parser for human-centric action quality assessment (2024),https: //arxiv.org/abs/2405.068873 ExpertEdit 19

work page arXiv 2024
[59]

Xu, Z., Zhang, J., Liew, J.H., Yan, H., Liu, J.W., Zhang, C., Feng, J., Shou, M.Z.: Magicanimate: Temporally consistent human image animation using diffusion model (2023),https://arxiv.org/abs/2311.164984

work page arXiv 2023
[60]

Yang, C., Song, H., Choi, S., Lee, S., Kim, J., Do, H.: Posesyn: Synthesizing diverse 3d pose data from in-the-wild 2d data (2025),https://arxiv.org/abs/2503.13025 11

work page arXiv 2025
[61]

Yang, C., Tkach, A., Hampali, S., Zhang, L., Crowley, E.J., Keskin, C.: Egopose- former: A simple baseline for stereo egocentric 3d human pose estimation (2024), https://arxiv.org/abs/2403.1808011

work page arXiv 2024
[62]

org/abs/2003.144014

Yang, Z., Zhu, W., Wu, W., Qian, C., Zhou, Q., Zhou, B., Loy, C.C.: Transmomo: Invariance-driven unsupervised video motion retargeting (2020),https://arxiv. org/abs/2003.144014

work page arXiv 2020
[63]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Yeh, W.H., Su, Y.A., Chen, C.N., Lin, Y.H., Ku, C., Chiu, W., Hu, M.C., Ku, L.W.: Coachme: Decoding sport elements with a reference-based coaching instruction generation model. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). p. 29126–29151. Association for Computational Linguistics (2025...

work page doi:10.18653/v1/2025.acl- 2025
[64]

Yeung, C., Ide, K., Fujii, K.: Autosoccerpose: Automated 3d posture analysis of soccer shot movements (2024),https://arxiv.org/abs/2405.120703

work page arXiv 2024
[65]

Yi, H., Pan, Y., He, F., Liu, X., Zhang, B., Oguntola, O., Bertasius, G.: Exact: A video-language benchmark for expert action analysis (2025),https://arxiv.org/ abs/2506.062774

work page arXiv 2025
[66]

Yin, H., Gu, L., Parmar, P., Xu, L., Guo, T., Fu, W., Zhang, Y., Zheng, T.: Flex: A large-scale multi-modal multi-action dataset for fitness action quality assessment (2025),https://arxiv.org/abs/2506.031983

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2m-gpt: Generating human motion from textual descriptions with discrete representations (2023),https://arxiv.org/abs/2301.060524, 11

work page arXiv 2023
[68]

Motiondiffuse: Text-driven human motion generation with diffusion model

Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model (2022),https://arxiv. org/abs/2208.1500111

work page arXiv 2022
[69]

Zhang, M., Li, H., Cai, Z., Ren, J., Yang, L., Liu, Z.: Finemogen: Fine-grained spatio-temporal motion generation and editing (2023),https://arxiv.org/abs/ 2312.150042

work page arXiv 2023
[70]

org/abs/2404.144713

Zhang, S., Bai, S., Chen, G., Chen, L., Lu, J., Wang, J., Tang, Y.: Narrative action evaluation with prompt-guided multimodal interaction (2024),https://arxiv. org/abs/2404.144713

work page arXiv 2024
[71]

Zhao, S., Wang, Z., Luan, T., Jia, J., Zhu, W., Luo, J., Yuan, J., Xi, N.: Pp- motion: Physical-perceptual fidelity evaluation for human motion generation (2026), https://arxiv.org/abs/2508.0817911

work page arXiv 2026
[72]

Zhong, X., Huang, X., Yang, X., Lin, G., Wu, Q.: Deco: Decoupled human-centered diffusion video editing with motion consistency (2024),https://arxiv.org/abs/ 2408.074814

work page arXiv 2024
[73]

Zhu, S., Chen, J.L., Dai, Z., Su, Q., Xu, Y., Cao, X., Yao, Y., Zhu, H., Zhu, S.: Champ: Controllable and consistent human image animation with 3d parametric guidance (2024),https://arxiv.org/abs/2403.147814 ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos Supplementary Material Table of Contents

work page arXiv 2024
[74]

Dataset preprocessing and technique clip extraction pipeline ....... Sec. A
[75]

Details of the procedure for constructing test set pseudo-pairs ..... Sec. B
[76]

Additional model architecture and training details ................. Sec. C
[77]

train set size scaling analysis

Performance vs. train set size scaling analysis ...................... Sec. D
[78]

Experiments with different text prompts for baselines .............. Sec. E
[79]

ball leaves hand,

Supplementary video content overview ............................. Sec. F A Dataset preprocessing and technique clip extraction We describe our procedure for extracting technique-centered clips from Ego- Exo4D (c.f. Sec. 3.1 ‘Technique criteria’, Sec. 4 ‘Datasets’). For Kyokushin karate, we directly use the pre-trimmed clips provided by MoDiffAE [45]. A.1...
[80]

Implementation details

(3) DTW then finds the minimum-cost monotonic alignment path π∗ = arg min π∈P X (i,j)∈π c(p n i ,p e j ), (4) where P denotesthesetofvalidmonotonicwarpingpaths.Theresultingalignment π∗ defines a frame mapping i7→π (i), which we use to resample the expert sequence to match the novice clip lengthT. 22 To address left–right asymmetries between novice and exp...