pith. machine review for the scientific record. sign in

arxiv: 2604.10466 · v1 · submitted 2026-04-12 · 💻 cs.CV

Recognition: unknown

ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords motion editingskill learningexpert videosmasked modelingunpaired trainingvideo analysismotor skillscomputer vision
0
0 comments X

The pith

ExpertEdit edits novice motions toward higher skill by learning an expert motion prior from unpaired expert videos alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to generate personalized visual feedback for motor skill learning by automatically refining a person's own performance to appear more expert. It trains exclusively on expert demonstration videos using a masked language modeling objective that learns to replace masked motion segments with expert-level versions. At test time the system identifies and masks skill-critical portions of a novice video then projects them into the learned expert manifold. This produces localized refinements without any paired novice-expert examples or manual edit instructions. The approach is shown to surpass supervised motion-editing baselines on realism and expert-quality metrics across eight techniques in three sports.

Core claim

ExpertEdit trains an expert motion prior on unpaired expert videos by masking random spans and training the model to infill them with expert refinements. At inference, novice motion is masked at skill-critical moments and the masked segments are projected into this prior, yielding localized skill improvements without paired supervision or explicit guidance.

What carries the argument

The masked language modeling objective that infills masked motion spans with expert-level refinements to form the learned expert motion prior.

If this is right

  • The method outperforms supervised motion editing baselines on multiple metrics of realism and expert quality across eight diverse techniques and three sports.
  • Skill improvements occur locally at masked critical moments rather than globally across the entire sequence.
  • Training requires only unpaired expert videos, removing the need to collect or align novice-expert pairs.
  • Inference needs no manual edit guidance or paired examples, enabling fully automatic application to new videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same masking-and-projection strategy could be tested on other sequential data such as dance sequences or rehabilitation exercises where expert demonstrations exist but paired data do not.
  • Real-time versions of the pipeline might support immediate visual feedback loops in training apps if the masking step can be made fast enough.
  • Performance across different body proportions or camera viewpoints would be a natural next test to determine how far the expert manifold generalizes beyond the training distributions.

Load-bearing premise

Masking novice motion at skill-critical moments and projecting it into the learned expert manifold produces localized meaningful skill improvements without paired supervision or explicit edit guidance.

What would settle it

Quantitative evaluation on the eight techniques from Ego-Exo4D and Karate Kyokushin showing no gains in motion realism or expert quality metrics over state-of-the-art supervised motion editing methods would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.10466 by Arjun Somayazulu, Kristen Grauman.

Figure 1
Figure 1. Figure 1: Skill-driven motion editing. Given a 3D motion sequence extracted from a novice activity video, ExpertEdit produces personalized skill edits by refining poses within regions where skill differences are pronounced. They are tweaked to exhibit expert-like precision and form, while preserving the original execution’s motion path and body orientation, as well as its source poses at all non skill-critical momen… view at source ↗
Figure 2
Figure 2. Figure 2: ExpertEdit approach. We tokenize expert pose motion sequences and mask the key action phase as determined by task-specific kinematic criteria. We train a bi-directional transformer, MotionInfiller, to predict the expert pose tokens at the masked positions. At inference, we mask skill-critical action phases in a novice motion (see Sec. 4) and infill these regions with expert-like motion. 3 ExpertEdit We exp… view at source ↗
Figure 3
Figure 3. Figure 3: ExpertEdit sequence visualization: We show novice source pose (blue) and edited pose (orange) at several frames for all techniques. ExpertEdit makes subtle pose refinements that improve form at skill-critical action moments, including raising the knee on the shooting hand-side higher during layups (Mikan, reverse), extending legs further on kicks (spin back, roundhouse), moving the shooting hand under the … view at source ↗
Figure 1
Figure 1. Figure 1: ExpertEdit performance as a function of training data. [PITH_FULL_IMAGE:figures/full_fig_p023_1.png] view at source ↗
read the original abstract

Visual feedback is critical for motor skill acquisition in sports and rehabilitation, and psychological studies show that observing near-perfect versions of one's own performance accelerates learning more effectively than watching expert demonstrations alone. We propose to enable such personalized feedback by automatically editing a person's motion to reflect higher skill. Existing motion editing approaches are poorly suited for this setting because they assume paired input-output data -- rare and expensive to curate for skill-driven tasks -- and explicit edit guidance at inference. We introduce ExpertEdit, a framework for skill-driven motion editing trained exclusively on unpaired expert video demonstrations. ExpertEdit learns an expert motion prior with a masked language modeling objective that infills masked motion spans with expert-level refinements. At inference, novice motion is masked at skill-critical moments and projected into the learned expert manifold, producing localized skill improvements without paired supervision or manual edit guidance. Across eight diverse techniques and three sports from Ego-Exo4D and Karate Kyokushin, ExpertEdit outperforms state-of-the-art supervised motion editing methods on multiple metrics of motion realism and expert quality. Project page: https://vision.cs.utexas.edu/projects/expert_edit/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes ExpertEdit, a framework for skill-driven motion editing trained solely on unpaired expert videos. It learns an expert motion prior via masked language modeling that infills masked spans with expert refinements. At inference, novice motions are masked at skill-critical moments and projected into the learned expert manifold to produce localized improvements. The method is evaluated on eight techniques across three sports from Ego-Exo4D and Karate Kyokushin, claiming to outperform state-of-the-art supervised motion editing baselines on metrics of motion realism and expert quality without requiring paired data or explicit edit guidance.

Significance. If the central claims hold after clarification, the work would be significant for computer vision applications in personalized feedback for sports and rehabilitation. Learning skill-aware priors from unpaired experts via masked modeling avoids the data collection burden of paired supervision and could enable scalable motion editing; the reported outperformance on diverse techniques provides a concrete benchmark for future unpaired methods.

major comments (2)
  1. [§4] §4 (Inference procedure): The mechanism for automatically identifying 'skill-critical moments' in novice input is unspecified. The abstract claims masking occurs 'at skill-critical moments' and produces improvements 'without ... manual edit guidance,' but if detection relies on heuristics, a separate model, or any form of selection, this constitutes explicit guidance. This is load-bearing for the no-supervision claim and for fair comparison to supervised baselines that receive explicit paired signals.
  2. [§5] §5 (Experiments and results): The outperformance claims on realism and expert quality metrics across eight techniques lack reported statistical significance tests, details on baseline re-implementations or hyperparameter matching, and confirmation that metric choices were not post-hoc. Without these, the superiority over supervised methods cannot be verified as robust, directly affecting the central empirical claim.
minor comments (2)
  1. [Abstract] Abstract: The reference to 'psychological studies' on observing near-perfect versions of one's performance should include specific citations for traceability.
  2. [Method] Method section: Notation for motion representations (e.g., how spans are masked and projected) could be introduced with a clear equation or diagram earlier to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the inference procedure and experimental reporting. We address each major comment below and outline revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§4] §4 (Inference procedure): The mechanism for automatically identifying 'skill-critical moments' in novice input is unspecified. The abstract claims masking occurs 'at skill-critical moments' and produces improvements 'without ... manual edit guidance,' but if detection relies on heuristics, a separate model, or any form of selection, this constitutes explicit guidance. This is load-bearing for the no-supervision claim and for fair comparison to supervised baselines that receive explicit paired signals.

    Authors: We appreciate the referee's emphasis on this point, as it directly relates to our central claim of operating without manual edit guidance. Section 4 of the manuscript specifies that masking at inference is performed automatically on the novice input motion alone, using only information derived from the input sequence itself and without any user-provided masks, paired supervision, or external signals. This is distinct from the explicit edit guidance supplied to the supervised baselines. We acknowledge that the current description could be more explicit about the precise automatic identification process. In the revised version, we will expand §4 with additional algorithmic details and pseudocode to demonstrate that the procedure requires no manual intervention or paired data, thereby reinforcing rather than weakening the no-supervision claim. revision: partial

  2. Referee: [§5] §5 (Experiments and results): The outperformance claims on realism and expert quality metrics across eight techniques lack reported statistical significance tests, details on baseline re-implementations or hyperparameter matching, and confirmation that metric choices were not post-hoc. Without these, the superiority over supervised methods cannot be verified as robust, directly affecting the central empirical claim.

    Authors: We agree that these elements are essential for verifying the robustness of the empirical results. In the revised manuscript, we will add statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) for all metric comparisons across the eight techniques. We will also include an expanded methods or appendix section providing full details on baseline re-implementations, including hyperparameter choices and how they were aligned with the original publications. Finally, we will add an explicit statement confirming that the evaluation metrics were selected a priori based on prior motion editing literature and not chosen post-hoc. These changes will be incorporated without altering the reported numerical results. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation uses standard masked modeling on unpaired data with independent external evaluation

full rationale

The paper's core pipeline—masked language modeling on unpaired expert videos to learn a motion prior, followed by inference-time masking of novice inputs and projection into the prior—is described without any equations or steps that reduce by construction to fitted parameters, self-citations, or renamed inputs. Training objective and inference procedure are distinct, and performance claims rest on comparisons to supervised baselines on external datasets (Ego-Exo4D, Karate Kyokushin), which are not forced by the training process itself. No load-bearing self-citation chains or ansatzes are invoked in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the domain assumption that expert motions occupy a learnable manifold separable from novice motions via masking and infilling, plus the premise that skill-critical moments can be identified and edited without explicit supervision.

axioms (2)
  • domain assumption Expert motions form a manifold that masked language modeling on unpaired videos can capture as a prior for infilling refinements.
    This is the central modeling choice enabling training without paired data.
  • domain assumption Skill-critical moments in novice motion can be masked and projected into the expert manifold to yield localized improvements.
    This is the key inference assumption stated in the abstract.

pith-pipeline@v0.9.0 · 5493 in / 1398 out tokens · 52775 ms · 2026-05-10T15:08:48.666827+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 64 canonical work pages · 4 internal anchors

  1. [1]

    ACM Transactions on Graphics39(4) (Aug 2020)

    Aberman, K., Weng, Y., Lischinski, D., Cohen-Or, D., Chen, B.: Unpaired motion style transfer from video to animation. ACM Transactions on Graphics39(4) (Aug 2020). https://doi.org/10.1145/3386569.3392469 , http://dx.doi.org/ 10.1145/3386569.33924694

  2. [2]

    006724, 9, 11

    Ashutosh, K., Nagarajan, T., Pavlakos, G., Kitani, K., Grauman, K.: Expertaf: Expert actionable feedback from video (2025), https://arxiv.org/abs/2408. 006724, 9, 11

  3. [3]

    Athanasiou, N., Cseke, A., Diomataris, M., Black, M.J., Varol, G.: Motionfix: Text- driven 3d human motion editing (2024),https://arxiv.org/abs/2408.00712 2, 3, 12, 13, 14, 23

  4. [4]

    Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now (2019),https: //arxiv.org/abs/1808.073714

  5. [5]

    In: Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology

    Cheng, L., Xie, X., Peng, Y., Feng, M., He, Y., Cao, A., Wu, Y., Zhang, H., Wu, Y.: Vismimic: Integrating motion chain in feedback video generation for motor coaching. In: Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. UIST ’25, Association for Computing Machinery, New York, NY, USA (2025). https://doi.org/10.1145/...

  6. [6]

    Delmas, G., Weinzaepfel, P., Moreno-Noguer, F., Rogez, G.: Posefix: Correcting 3d human poses with natural language (2024),https://arxiv.org/abs/2309.08480 2, 4

  7. [7]

    In: Proceedings of the 32nd ACM International Conference on Multimedia

    Ding, Y., Zhang, S., Shenglan, L., Zhang, J., Chen, W., Haifei, D., dong, b., Sun, T.: 2m-af: A strong multi-modality framework for human action quality assessment with self-supervised representation learning. In: Proceedings of the 32nd ACM International Conference on Multimedia. p. 1564–1572. MM ’24, Association for Computing Machinery, New York, NY, US...

  8. [8]

    In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    Dittakavi, B., Bavikadi, D., Desai, S.V., Chakraborty, S., Reddy, N., Balasub- ramanian, V.N., Callepalli, B., Sharma, A.: Pose tutor: An explainable system for pose correction in the wild. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 3539–3548 (2022). https://doi.org/10.1109/CVPRW56347.2022.003984

  9. [9]

    Doughty, H., Damen, D., Mayol-Cuevas, W.: Who’s better? who’s best? pairwise deep ranking for skill determination (2018),https://arxiv.org/abs/1703.09913 3

  10. [10]

    In: Proceedings of the 12th ACM SIGGRAPH Conference on Motion, Interaction and Games

    Du, H., Herrmann, E., Sprenger, J., Fischer, K., Slusallek, P.: Stylistic locomotion modeling and synthesis using variational generative models. In: Proceedings of the 12th ACM SIGGRAPH Conference on Motion, Interaction and Games. MIG ’19, Association for Computing Machinery, New York, NY, USA (2019).https://doi. org/10.1145/3359566.3360083,https://doi.or...

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Fieraru, M., Zanfir, M., Pirlea, S.C., Olaru, V., Sminchisescu, C.: Aifit: Automatic 3d human-interpretable feedback models for fitness training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9919–9928 (June 2021) 4

  12. [12]

    Rgb↔x: Image decomposition and synthesis using material- and lighting-aware diffusion models

    Goel, P., Wang, K.C., Liu, C.K., Fatahalian, K.: Iterative motion editing with natural language. In: Special Interest Group on Computer Graphics and Inter- active Techniques Conference Conference Papers. p. 1–9. SIGGRAPH ’24, ACM (Jul 2024). https://doi.org/10.1145/3641519.3657447, http://dx.doi.org/10. 1145/3641519.36574474 16 Arjun Somayazulu and Kriste...

  13. [13]

    Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V., Bansal, S., Boote, B., Byrne, E., Chavis, Z., Chen, J., Cheng, F., Chu, F.J., Crane, S., Dasgupta, A., Dong, J., Escobar, M., Forigua, C., Gebreselasie, A., Haresh, S., Huang, J., Islam, M.M., Jain, S., Khirodkar, R., Kukreja, D., Liang, K.J., Liu, J.W....

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5152–5161 (June 2022) 12

  15. [15]

    Hartwig, S., Engel, D., Sick, L., Kniesel, H., Payer, T., Poonam, P., Glöckler, M., Bäuerle, A., Ropinski, T.: A survey on quality metrics for text-to-image generation (2025),https://arxiv.org/abs/2403.1182111

  16. [16]

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium (2018), https://arxiv.org/abs/1706.0850011

  17. [17]

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models (2020),https: //arxiv.org/abs/2006.1123911

  18. [18]

    SIGGRAPH Asia 2015 Technical Briefs , year =

    Holden, D., Saito, J., Komura, T., Joyce, T.: Learning motion manifolds with convolutional autoencoders. In: SIGGRAPH Asia 2015 Technical Briefs. SA ’15, Association for Computing Machinery, New York, NY, USA (2015).https://doi. org/10.1145/2820903.2820918,https://doi.org/10.1145/2820903.28209184

  19. [19]

    Hu, L., Zhang, Z., Ye, Y., Xu, Y., Xia, S.: Diffusion-based human motion style transfer with semantic guidance (2024),https://arxiv.org/abs/2405.066464

  20. [20]

    Hu, L., Gao, X., Zhang, P., Sun, K., Zhang, B., Bo, L.: Animate anyone: Consistent and controllable image-to-video synthesis for character animation (2024),https: //arxiv.org/abs/2311.171174

  21. [21]

    Huh, M., Xue, Z., Das, U., Ashutosh, K., Grauman, K., Pavel, A.: Vid2coach: Transforming how-to videos into task assistants (2025),https://arxiv.org/abs/ 2506.007174

  22. [22]

    ACM Transactions on Graphics41(3), 1–16 (Jun 2022).https: //doi.org/10.1145/3516429,http://dx.doi.org/10.1145/35164294

    Jang, D.K., Park, S., Lee, S.H.: Motion puzzle: Arbitrary motion style transfer by body part. ACM Transactions on Graphics41(3), 1–16 (Jun 2022).https: //doi.org/10.1145/3516429,http://dx.doi.org/10.1145/35164294

  23. [23]

    Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: Motiongpt: Human motion as a foreign language (2023),https://arxiv.org/abs/2306.147954

  24. [24]

    207242, 4

    Jiang, N., Li, H., Yuan, Z., He, Z., Chen, Y., Liu, T., Zhu, Y., Huang, S.: Dynamic motion blending for versatile motion editing (2025),https://arxiv.org/abs/2503. 207242, 4

  25. [25]

    Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation (2018),https://arxiv.org/abs/1710.1019611 ExpertEdit 17

  26. [26]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Kim, B., Kim, J., Chang, H.J., Choi, J.Y.: Most: Motion style transformer between diverse action contents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1705–1714 (June 2024) 4

  27. [27]

    Kim, J., Kim, J., Choi, S.: Flame: Free-form language-based motion synthesis & editing (2023),https://arxiv.org/abs/2209.003492, 3, 12, 13

  28. [28]

    Li, J., Cao, J., Zhang, H., Rempe, D., Kautz, J., Iqbal, U., Yuan, Y.: Genmo: A generalist model for human motion (2025),https://arxiv.org/abs/2505.01425 11

  29. [29]

    Li, Z., Cheng, K., Ghosh, A., Bhattacharya, U., Gui, L., Bera, A.: Simmotionedit: Text-based human motion editing with motion similarity prediction (2025),https: //arxiv.org/abs/2503.182112, 3, 4, 12, 13, 14, 23, 24

  30. [30]

    IEEE Transactions on Visualization and Computer Graphics30(7), 3180–3195 (Jul 2024)

    Liu, J., Saquib, N., Zhutian, C., Kazi, R.H., Wei, L.Y., Fu, H., Tai, C.L.: Posecoach: A customizable analysis and visualization system for video-based running coaching. IEEE Transactions on Visualization and Computer Graphics30(7), 3180–3195 (Jul 2024). https://doi.org/10.1109/tvcg.2022.3230855 , http://dx.doi.org/10. 1109/TVCG.2022.32308554

  31. [31]

    Liu, S.L., Ding, Y.N., Yan, G., Zhang, S.F., Zhang, J.R., Chen, W.Y., Xu, X.H.: Fine-grained action analysis: A multi-modality and multi-task dataset of figure skating (2024),https://arxiv.org/abs/2307.027303

  32. [32]

    Liu, W., Piao, Z., Min, J., Luo, W., Ma, L., Gao, S.: Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis (2019),https://arxiv.org/abs/1909.122244

  33. [33]

    ACM Trans

    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia)34(6), 248:1–248:16 (Oct 2015) 5, 12

  34. [34]

    105424, 7, 11, 12, 22

    Lucas, T., Baradel, F., Weinzaepfel, P., Rogez, G.: Posegpt: Quantization-based 3d human motion generation and forecasting (2022),https://arxiv.org/abs/2210. 105424, 7, 11, 12, 22

  35. [35]

    Majeedi, A., Gajjala, V.R., GNVV, S.S.S.N., Li, Y.: Rica2: Rubric-informed, cali- brated assessment of actions (2024),https://arxiv.org/abs/2408.021383

  36. [36]

    In: ProceedingsoftheIEEE/CVFInternationalConferenceonComputerVision(ICCV) Workshops

    Noworolnik, F., Jaworek-Korjakowska, J.: Assessing the quality of soccer shots from single-camera video with vision-language models and motion features. In: ProceedingsoftheIEEE/CVFInternationalConferenceonComputerVision(ICCV) Workshops. pp. 2733–2740 (October 2025) 3

  37. [37]

    Pan, Y., Zhang, C., Bertasius, G.: Basket: A large-scale video dataset for fine-grained skill estimation (2025),https://arxiv.org/abs/2503.207813

  38. [38]

    In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII

    Parmar, P., Gharat, A., Rhodin, H.: Domain knowledge-informed self-supervised representations for workout form assessment. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII. pp. 105–123. Springer (2022) 3

  39. [39]

    org/abs/1611.051253

    Parmar, P., Morris, B.T.: Learning to score olympic events (2017),https://arxiv. org/abs/1611.051253

  40. [40]

    Parmar, P., Morris, B.T.: What and how well you performed? a multitask learning approach to action quality assessment (2019),https://arxiv.org/abs/1904.04346 3

  41. [41]

    Petrovich,M.,Black, M.J., Varol, G.: Action-conditioned 3dhuman motion synthesis with transformer vae (2021),https://arxiv.org/abs/2104.0567011

  42. [42]

    Petrovich, M., Black, M.J., Varol, G.: Temos: Generating diverse human motions from textual descriptions (2022),https://arxiv.org/abs/2204.1410911 18 Arjun Somayazulu and Kristen Grauman

  43. [43]

    Sport Psychologist17, 220–241 (06 2003).https://doi

    Ram, N., McCullagh, P.: Self-modeling: Influence on psychological responses and physical performance. Sport Psychologist17, 220–241 (06 2003).https://doi. org/10.1123/tsp.17.2.2201

  44. [44]

    Rao, J., Wu, H., Jiang, H., Zhang, Y., Wang, Y., Xie, W.: Towards universal soccer video understanding (2025),https://arxiv.org/abs/2412.018203

  45. [45]

    In: Proceedings of the 27th International Conference on Multimodal Interaction

    Richardson, A., Putze, F.: Motion diffusion autoencoders: Enabling attribute manipulation in human motion demonstrated on karate techniques. In: Proceedings of the 27th International Conference on Multimodal Interaction. p. 372–380. ICMI ’25, ACM (Oct 2025).https://doi.org/10.1145/3716553.3750773, http://dx. doi.org/10.1145/3716553.375077310, 20

  46. [46]

    Shin, S., Kim, J., Halilaj, E., Black, M.J.: Wham: Reconstructing world-grounded humans with accurate 3d motion (2024),https://arxiv.org/abs/2312.07531 12

  47. [47]

    Frontiers in Psy- chology2, 155 (2011).https://doi.org/10.3389/fpsyg.2011.001551

    Ste-Marie, D.M., Vertes, K., Rymal, A.M., Martini, R.: Feedforward self-modeling enhances skill acquisition in children learning trampoline skills. Frontiers in Psy- chology2, 155 (2011).https://doi.org/10.3389/fpsyg.2011.001551

  48. [48]

    Steel, K.A., Mudie, K., Sandoval, R., Anderson, D., Dogramaci, S., Rehmanjan, M., Birznieks, I.: Can video self-modeling improve affected limb reach and grasp ability in stroke patients? Journal of Motor Behavior50(2), 117–126 (2018).https: //doi.org/10.1080/00222895.2017.1306480, epub 2017 May 19 1

  49. [49]

    Scientific Data8(01 2021).https://doi.org/10.1038/s41597-021-00801-51, 9

    Szczęsna, A., Błaszczyszyn, M., Pawlyta, M.: Optical motion capture dataset of selected techniques in beginner and advanced kyokushin karate athletes. Scientific Data8(01 2021).https://doi.org/10.1038/s41597-021-00801-51, 9

  50. [50]

    Tao, T., Zhan, X., Chen, Z., van de Panne, M.: Style-erd: Responsive and coherent online motion style transfer (2022),https://arxiv.org/abs/2203.025744

  51. [51]

    Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model (2022),https://arxiv.org/abs/2209.1491611

  52. [52]

    Tu, S., Dai, Q., Cheng, Z.Q., Hu, H., Han, X., Wu, Z., Jiang, Y.G.: Motioneditor: Editing video motion via content-aware diffusion (2023),https://arxiv.org/abs/ 2311.188302, 4

  53. [53]

    Tu, S., Dai, Q., Zhang, Z., Xie, S., Cheng, Z.Q., Luo, C., Han, X., Wu, Z., Jiang, Y.G.: Motionfollower: Editing video motion via lightweight score-guided diffusion (2024),https://arxiv.org/abs/2405.203252, 4

  54. [54]

    Villegas, R., Yang, J., Ceylan, D., Lee, H.: Neural kinematic networks for unsuper- vised motion retargetting (2018),https://arxiv.org/abs/1804.056534

  55. [55]

    Wang, Y., Huang, D., Zhang, Y., Ouyang, W., Jiao, J., Feng, X., Zhou, Y., Wan, P., Tang, S., Xu, D.: Motiongpt-2: A general-purpose motion-language model for motion generation and understanding (2024),https://arxiv.org/abs/2410.21747 4

  56. [56]

    In: Com- puter Vision – ECCV 2024: 18th European Conference, Milan, Italy, Septem- ber 29–October 4, 2024, Proceedings, Part XLII

    Xu, H., Ke, X., Li, Y., Xu, R., Wu, H., Lin, X., Guo, W.: Vision-language ac- tion knowledge learning for semantic-aware action quality assessment. In: Com- puter Vision – ECCV 2024: 18th European Conference, Milan, Italy, Septem- ber 29–October 4, 2024, Proceedings, Part XLII. p. 423–440. Springer-Verlag, Berlin, Heidelberg (2024). https://doi.org/10.100...

  57. [57]

    Xu, J., Rao, Y., Yu, X., Chen, G., Zhou, J., Lu, J.: Finediving: A fine-grained dataset for procedure-aware action quality assessment (2022),https://arxiv.org/ abs/2204.036463

  58. [58]

    Xu, J., Yin, S., Zhao, G., Wang, Z., Peng, Y.: Fineparser: A fine-grained spatio- temporal action parser for human-centric action quality assessment (2024),https: //arxiv.org/abs/2405.068873 ExpertEdit 19

  59. [59]

    Xu, Z., Zhang, J., Liew, J.H., Yan, H., Liu, J.W., Zhang, C., Feng, J., Shou, M.Z.: Magicanimate: Temporally consistent human image animation using diffusion model (2023),https://arxiv.org/abs/2311.164984

  60. [60]

    Yang, C., Song, H., Choi, S., Lee, S., Kim, J., Do, H.: Posesyn: Synthesizing diverse 3d pose data from in-the-wild 2d data (2025),https://arxiv.org/abs/2503.13025 11

  61. [61]

    Yang, C., Tkach, A., Hampali, S., Zhang, L., Crowley, E.J., Keskin, C.: Egopose- former: A simple baseline for stereo egocentric 3d human pose estimation (2024), https://arxiv.org/abs/2403.1808011

  62. [62]

    org/abs/2003.144014

    Yang, Z., Zhu, W., Wu, W., Qian, C., Zhou, Q., Zhou, B., Loy, C.C.: Transmomo: Invariance-driven unsupervised video motion retargeting (2020),https://arxiv. org/abs/2003.144014

  63. [63]

    In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Yeh, W.H., Su, Y.A., Chen, C.N., Lin, Y.H., Ku, C., Chiu, W., Hu, M.C., Ku, L.W.: Coachme: Decoding sport elements with a reference-based coaching instruction generation model. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). p. 29126–29151. Association for Computational Linguistics (2025...

  64. [64]

    Yeung, C., Ide, K., Fujii, K.: Autosoccerpose: Automated 3d posture analysis of soccer shot movements (2024),https://arxiv.org/abs/2405.120703

  65. [65]

    Yi, H., Pan, Y., He, F., Liu, X., Zhang, B., Oguntola, O., Bertasius, G.: Exact: A video-language benchmark for expert action analysis (2025),https://arxiv.org/ abs/2506.062774

  66. [66]

    Yin, H., Gu, L., Parmar, P., Xu, L., Guo, T., Fu, W., Zhang, Y., Zheng, T.: Flex: A large-scale multi-modal multi-action dataset for fitness action quality assessment (2025),https://arxiv.org/abs/2506.031983

  67. [67]

    Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2m-gpt: Generating human motion from textual descriptions with discrete representations (2023),https://arxiv.org/abs/2301.060524, 11

  68. [68]

    Motiondiffuse: Text-driven human motion generation with diffusion model

    Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model (2022),https://arxiv. org/abs/2208.1500111

  69. [69]

    Zhang, M., Li, H., Cai, Z., Ren, J., Yang, L., Liu, Z.: Finemogen: Fine-grained spatio-temporal motion generation and editing (2023),https://arxiv.org/abs/ 2312.150042

  70. [70]

    org/abs/2404.144713

    Zhang, S., Bai, S., Chen, G., Chen, L., Lu, J., Wang, J., Tang, Y.: Narrative action evaluation with prompt-guided multimodal interaction (2024),https://arxiv. org/abs/2404.144713

  71. [71]

    Zhao, S., Wang, Z., Luan, T., Jia, J., Zhu, W., Luo, J., Yuan, J., Xi, N.: Pp- motion: Physical-perceptual fidelity evaluation for human motion generation (2026), https://arxiv.org/abs/2508.0817911

  72. [72]

    Zhong, X., Huang, X., Yang, X., Lin, G., Wu, Q.: Deco: Decoupled human-centered diffusion video editing with motion consistency (2024),https://arxiv.org/abs/ 2408.074814

  73. [73]

    Zhu, S., Chen, J.L., Dai, Z., Su, Q., Xu, Y., Cao, X., Yao, Y., Zhu, H., Zhu, S.: Champ: Controllable and consistent human image animation with 3d parametric guidance (2024),https://arxiv.org/abs/2403.147814 ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos Supplementary Material Table of Contents

  74. [74]

    Dataset preprocessing and technique clip extraction pipeline ....... Sec. A

  75. [75]

    Details of the procedure for constructing test set pseudo-pairs ..... Sec. B

  76. [76]

    Additional model architecture and training details ................. Sec. C

  77. [77]

    train set size scaling analysis

    Performance vs. train set size scaling analysis ...................... Sec. D

  78. [78]

    Experiments with different text prompts for baselines .............. Sec. E

  79. [79]

    ball leaves hand,

    Supplementary video content overview ............................. Sec. F A Dataset preprocessing and technique clip extraction We describe our procedure for extracting technique-centered clips from Ego- Exo4D (c.f. Sec. 3.1 ‘Technique criteria’, Sec. 4 ‘Datasets’). For Kyokushin karate, we directly use the pre-trimmed clips provided by MoDiffAE [45]. A.1...

  80. [80]

    Implementation details

    (3) DTW then finds the minimum-cost monotonic alignment path π∗ = arg min π∈P X (i,j)∈π c(p n i ,p e j ), (4) where P denotesthesetofvalidmonotonicwarpingpaths.Theresultingalignment π∗ defines a frame mapping i7→π (i), which we use to resample the expert sequence to match the novice clip lengthT. 22 To address left–right asymmetries between novice and exp...