pith. machine review for the scientific record. sign in

arxiv: 2605.01409 · v1 · submitted 2026-05-02 · 💻 cs.IR · cs.CV· cs.MM

Recognition: unknown

Interactive Multi-Turn Retrieval for Health Videos

Authors on Pith no claims yet

Pith reviewed 2026-05-09 17:49 UTC · model grok-4.3

classification 💻 cs.IR cs.CVcs.MM
keywords health video retrievalmulti-turn dialogueinteractive retrievalvideo searchmedical educationdialogue-aware rankingtwo-stage retrieval
0
0 comments X

The pith

Multi-turn dialogue refines vague health video queries into clinically precise retrievals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Health videos for training and education often start with imprecise user needs that only become specific after follow-ups on posture, equipment, contraindications, or patient conditions. The paper builds MHVRC, a corpus of multi-turn health video queries generated from video descriptions and refinements, and introduces DATR, a two-stage system that first coarsely retrieves candidates then re-ranks them using fused dialogue context. Experiments on the corpus show gains over single-turn baselines, and user studies find that the refined queries align better with the fine-grained procedural content in the videos.

Core claim

The authors construct MHVRC by pairing video-grounded descriptions from VideoChat-Flash with query refinements from DeepSeek, then demonstrate that DATR's coarse dual-encoder stage followed by multi-turn query fusion and cross-encoder re-ranking produces ranked lists that better match the evolving information needs in health video retrieval.

What carries the argument

DATR (Dialogue-Aware Two-Stage Retrieval), which performs fast coarse retrieval via CLIP-style dual encoder on sparsely sampled frames then re-ranks top candidates by fusing the full multi-turn query history in a lightweight cross-encoder.

If this is right

  • Health video systems can shift from one-shot queries to handling constraints such as hand placement, equipment availability, and patient-specific cautions that emerge only in follow-up turns.
  • Automated construction of multi-turn query corpora enables scalable evaluation of interactive retrieval without requiring exhaustive manual annotation.
  • Fine-grained procedural details in instructional videos become retrievable once dialogue context is fused rather than relying on an initial broad query alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-stage coarse-then-fusion pattern could be tested on procedural videos outside health, such as exercise routines or equipment maintenance, where initial queries are similarly underspecified.
  • Live deployment logs from actual users might expose refinement patterns, such as clarification on contraindications, that differ from the AI-generated sequences used to build MHVRC.
  • Pairing DATR with a conversational interface would allow testing whether users naturally produce the multi-turn refinements the framework is designed to exploit.

Load-bearing premise

The multi-turn queries created by VideoChat-Flash and DeepSeek sufficiently stand in for the way real users would progressively refine their searches for health videos.

What would settle it

A controlled study that collects live multi-turn interactions from health professionals or patients using the system and checks whether the videos retrieved match those returned under the AI-generated query sequences in MHVRC.

Figures

Figures reproduced from arXiv: 2605.01409 by Baoming Zhang, Chengzheng Wu, Kaixing Yang, Ke Qiu, Ruiyu Mao, Xulong Tang.

Figure 1
Figure 1. Figure 1: Single-turn versus multi-turn health video retrieval. Single-turn search often matches broad topic words, while interactive view at source ↗
Figure 2
Figure 2. Figure 2: Construction pipeline of MHVRC. VideoChat-Flash produces procedural descriptions from health videos, while DeepSeek view at source ↗
Figure 3
Figure 3. Figure 3: Overview of DATR. Stage I retrieves candidates with view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of the two-stage retrieval process. The wide pipeline is placed as a double-column figure to preserve readability. view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative retrieval examples from DATR. Retrieved videos match refined health queries involving exercise, rehabilitation, and view at source ↗
read the original abstract

The growing availability of health-related instructional videos creates new opportunities for clinical training, patient rehabilitation, and health education, yet existing retrieval systems remain largely single-turn: a user submits one query and receives one ranked list. This interaction is brittle in health scenarios, where information needs are often vague at first and become clinically meaningful only after follow-up constraints such as posture, hand placement, contraindications, equipment, or patient condition are specified. We introduce interactive multi-turn semantic retrieval for health videos and construct MHVRC, a Multi-Turn Health Video Retrieval Corpus, by combining video-grounded descriptions from VideoChat-Flash with query refinements generated by DeepSeek. We further propose DATR, a Dialogue-Aware Two-Stage Retrieval framework. DATR first performs efficient coarse retrieval with a CLIP-style dual encoder and sparse frame sampling, then re-ranks the top candidates through multi-turn query fusion and a lightweight cross-encoder scoring module. Experiments on MHVRC show consistent gains over strong text-video retrieval baselines, while user studies indicate that refined multi-turn queries better capture fine-grained procedural semantics than single-turn annotations. The work establishes a benchmark and a scalable technical recipe for interactive health video retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces interactive multi-turn semantic retrieval for health videos, constructs the MHVRC corpus by combining video-grounded descriptions from VideoChat-Flash with query refinements from DeepSeek, proposes the DATR Dialogue-Aware Two-Stage Retrieval framework (coarse CLIP-style retrieval followed by multi-turn query fusion and cross-encoder re-ranking), and reports consistent gains over text-video baselines on MHVRC together with user studies claiming that refined multi-turn queries better capture fine-grained procedural semantics.

Significance. If the synthetic MHVRC data and user studies can be shown to align with real clinical information needs, the work would establish a useful benchmark and an efficient technical recipe for handling evolving, constraint-rich queries in health video retrieval, an area with direct applications to clinical training, rehabilitation, and patient education. The two-stage design is a pragmatic contribution for scaling multi-turn interactions.

major comments (1)
  1. [Abstract and §3] Abstract and §3 (MHVRC construction): The entire evaluation corpus is generated from VideoChat-Flash descriptions and DeepSeek refinements with no reported human validation, inter-annotator agreement against real health-professional or patient queries, or out-of-distribution testing on independently collected health queries. This makes the headline claims of gains over baselines and user-study preferences for multi-turn queries vulnerable to being artifacts of the LLM generation distribution rather than evidence that DATR solves the stated clinical problem.
minor comments (1)
  1. [Abstract] Abstract: The claim of 'consistent gains' is stated without any quantitative metrics, specific baseline names, or error bars; a one-sentence summary of the key numbers would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the major comment on the MHVRC corpus construction point by point below.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (MHVRC construction): The entire evaluation corpus is generated from VideoChat-Flash descriptions and DeepSeek refinements with no reported human validation, inter-annotator agreement against real health-professional or patient queries, or out-of-distribution testing on independently collected health queries. This makes the headline claims of gains over baselines and user-study preferences for multi-turn queries vulnerable to being artifacts of the LLM generation distribution rather than evidence that DATR solves the stated clinical problem.

    Authors: We appreciate the referee's observation concerning the synthetic construction of the MHVRC corpus. As described in §3, the corpus was built by first using VideoChat-Flash to produce video-grounded descriptions and then employing DeepSeek to generate refined multi-turn queries. This methodology enables the creation of a substantial dataset for studying interactive retrieval without the prohibitive costs of manual annotation by domain experts. We concur that the lack of reported human validation, inter-annotator agreement metrics with health professionals or patients, and out-of-distribution evaluation on independently sourced queries represents a significant limitation. Consequently, the observed improvements and user study preferences may partly reflect characteristics of the LLM-generated distribution. The user studies do incorporate human judgments on query quality, offering partial human validation. In the revised manuscript, we will revise the abstract and §3 to more clearly delineate the synthetic nature of MHVRC, include additional details on the generation pipeline, and add a dedicated limitations paragraph discussing the need for future real-world clinical query validation. We maintain that the DATR framework and the benchmark provide a pragmatic foundation for further research in this area. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs the MHVRC benchmark by applying external LLMs (VideoChat-Flash for video descriptions, DeepSeek for multi-turn refinements) and evaluates the DATR framework (CLIP-style coarse retrieval plus multi-turn fusion and cross-encoder re-ranking) empirically on that corpus, reporting gains over baselines plus user-study support for multi-turn queries. No equations, fitted parameters, or first-principles derivations are present that reduce to self-definition or tautology. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling occur. The central claims rest on standard retrieval components applied to a new task and synthetic corpus; the evaluation chain is self-contained and does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim relies on the assumption that AI-generated data can stand in for human interactions, which is a domain assumption without independent validation mentioned.

axioms (1)
  • domain assumption The synthetic data generation using VideoChat-Flash and DeepSeek produces realistic multi-turn queries for health videos.
    The corpus construction relies on this to create the benchmark.

pith-pipeline@v0.9.0 · 5518 in / 1142 out tokens · 62791 ms · 2026-05-09T17:49:11.447858+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    MedVidQA: A large-scale med- ical video question answering dataset

    Asma Ben Abacha, Wen-wai Yim, Yujuan Fan, Thomas Lin, and Dina Demner-Fushman. MedVidQA: A large-scale med- ical video question answering dataset. InProceedings of the Thirteenth Language Resources and Evaluation Conference, 2022

  2. [2]

    ViViT: A video vision transformer

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. ViViT: A video vision transformer. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 6836–6846, 2021

  3. [3]

    Au- tomatic exercise assessment in physical rehabilitation.Sen- sors, 19(19):4113, 2019

    Sujin Bae, Jooyeon Kim, Jihye Park, and Sangyoun Lee. Au- tomatic exercise assessment in physical rehabilitation.Sen- sors, 19(19):4113, 2019

  4. [4]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 1728–1738, 2021

  5. [5]

    Is space-time attention all you need for video understanding? InProceedings of the International Conference on Machine Learning, 2021

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InProceedings of the International Conference on Machine Learning, 2021

  6. [6]

    MagicPose: Realistic human poses and facial ex- pressions retargeting with identity-aware diffusion.arXiv preprint arXiv:2311.12052, 2023

    Di Chang, Yichun Shi, Quankai Gao, Jiawei Xu, and Hongbo Fu. MagicPose: Realistic human poses and facial ex- pressions retargeting with identity-aware diffusion.arXiv preprint arXiv:2311.12052, 2023

  7. [7]

    Executing your commands via motion diffusion in latent space

    Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  8. [8]

    Chapman, and Clement J

    Dina Demner-Fushman, Wendy W. Chapman, and Clement J. McDonald. What can natural language processing do for clinical decision support?Journal of Biomedical Informatics, 42(5):760–772, 2009

  9. [9]

    A survey of conversational search.Foundations and Trends in Information Retrieval, 14(5):371–490, 2021

    Jianfeng Gao, Chenyan Xue, Anlei Dong, and Jiafeng Chen. A survey of conversational search.Foundations and Trends in Information Retrieval, 14(5):371–490, 2021

  10. [10]

    TM2D: Bimodality driven 3d dance generation via music- text integration

    Kehong Gong, Defu Lian, Heng Chang, Chuan Guo, Zi- hang Jiang, Xinxin Zuo, Michael Bi Mi, and Xinchao Wang. TM2D: Bimodality driven 3d dance generation via music- text integration. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2023

  11. [11]

    Generat- ing diverse and natural 3d human motions from text

    Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Generat- ing diverse and natural 3d human motions from text. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5152–5161, 2022

  12. [12]

    Localizing mo- ments in video with natural language

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing mo- ments in video with natural language. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 5804–5813, 2017

  13. [13]

    Animate anyone: Consistent and controllable image-to-video synthesis for character animation

    Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, and Liefeng Bo. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  14. [14]

    BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion

    Tianzhi Jia, Kaixing Yang, Xiaole Yang, Xulong Tang, Ke Qiu, Shikui Wei, and Yao Zhao. BiTDiff: Fine-grained 3d conducting motion generation via BiMamba-transformer dif- fusion.arXiv preprint arXiv:2604.04395, 2026

  15. [15]

    MotionGPT: Human motion as a foreign language

    Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. MotionGPT: Human motion as a foreign language. In Advances in Neural Information Processing Systems, 2023

  16. [16]

    SV-RCNet: Work- flow recognition from surgical videos using recurrent con- volutional network.IEEE Transactions on Medical Imaging, 37(5):1114–1126, 2018

    Yueming Jin, Qi Dou, Hao Chen, Lequan Yu, Jing Qin, Chi-Wing Fu, and Pheng-Ann Heng. SV-RCNet: Work- flow recognition from surgical videos using recurrent con- volutional network.IEEE Transactions on Medical Imaging, 37(5):1114–1126, 2018

  17. [17]

    Dense-captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 706–715, 2017

  18. [18]

    Jie Lei, Chenliang Lyu, Liangchen Chen, Yao Li, Xiaowu Lu, and Tamara L. Berg. Less is more: CLIPBERT for video- and-language learning via sparse sampling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 7331–7341, 2021

  19. [19]

    Berg, and Mohit Bansal

    Jie Lei, Licheng Yu, Tamara L. Berg, and Mohit Bansal. TVQA: Localized, compositional video question answering. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1369–1379, 2018

  20. [20]

    Berg, and Mohit Bansal

    Jie Lei, Licheng Yu, Tamara L. Berg, and Mohit Bansal. TVR: A large-scale dataset for video-subtitle moment re- trieval. InProceedings of the European Conference on Com- puter Vision, pages 447–463, 2020

  21. [21]

    HERO: Hierarchical encoder for video+language omni-representation pre-training

    Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. HERO: Hierarchical encoder for video+language omni-representation pre-training. InPro- ceedings of the Conference on Empirical Methods in Natural Language Processing, pages 2046–2065, 2020

  22. [22]

    Ross, and Angjoo Kanazawa

    Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. AI Choreographer: Music conditioned 3d dance generation with AIST++. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13401– 13412, 2021

  23. [23]

    Smart rehabilitation based on artificial intelligence and internet of things: A survey

    Yong Li, Jie Hu, and Yu Zhang. Smart rehabilitation based on artificial intelligence and internet of things: A survey. IEEE Access, 8:180246–180271, 2020

  24. [24]

    CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

    Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. In Advances in Neural Information Processing Systems, 2021

  25. [25]

    Passage Re-ranking with BERT

    Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with BERT.arXiv preprint arXiv:1901.04085, 2019

  26. [26]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning, pages 8748–8763, 2021

  27. [27]

    Bailando: 3d dance generation by actor-critic GPT with choreographic memory

    Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. Bailando: 3d dance generation by actor-critic GPT with choreographic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  28. [28]

    Personalized dance synthesis based on physical and cognitive intensities

    Xulong Tang, Eun Yeo, Ruiyu Mao, Xiaohu Guo, and Rawan Alghofaili. Personalized dance synthesis based on physical and cognitive intensities. InProceedings of the IEEE Confer- ence on Virtual Reality and 3D User Interfaces, pages 261– 271, 2026

  29. [29]

    Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H. Bermano. Human motion diffusion model. InInternational Conference on Learning Representations, 2023

  30. [30]

    Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel de Mathelin, and Nicolas Padoy

    Andru P. Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel de Mathelin, and Nicolas Padoy. En- doNet: A deep architecture for recognition tasks on laparo- scopic videos.IEEE Transactions on Medical Imaging, 36(1):86–97, 2017

  31. [31]

    Com- puter vision for musculoskeletal rehabilitation: A survey

    Jiang Wang, Xinyi Liu, Zhenyu Jiang, and Qi Zhang. Com- puter vision for musculoskeletal rehabilitation: A survey. IEEE Reviews in Biomedical Engineering, 2023

  32. [32]

    Dance- Camera3D: 3d camera movement synthesis with music and dance

    Zixuan Wang, Jia Jia, Shikun Sun, Haozhe Wu, Rong Han, Zhenyu Li, Di Tang, Jiaqing Zhou, and Jiebo Luo. Dance- Camera3D: 3d camera movement synthesis with music and dance. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 7892–7901, 2024

  33. [33]

    MSR-VTT: A large video description dataset for bridging video and lan- guage

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A large video description dataset for bridging video and lan- guage. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 5288–5296, 2016

  34. [34]

    Megadance: Mixture-of- experts architecture for genre-aware 3d dance generation

    Kaixing Yang, Xulong Tang, Ziqiao Peng, Yuxuan Hu, Jun He, and Hongyan Liu. MEGADance: Mixture-of-experts architecture for genre-aware 3d dance generation.arXiv preprint arXiv:2505.17543, 2025

  35. [35]

    Flow- erdance: Meanflow for efficient and refined 3d dance generation.arXiv preprint arXiv:2511.21029, 2025

    Kaixing Yang, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang, Jun He, and Hongyan Liu. FlowerDance: MeanFlow for efficient and refined 3d dance generation. arXiv preprint arXiv:2511.21029, 2025

  36. [36]

    BeatDance: A beat-based model-agnostic contrastive learning framework for music-dance retrieval

    Kaixing Yang, Xukun Zhou, Xulong Tang, Ran Diao, Hongyan Liu, Jun He, and Zhaoxin Fan. BeatDance: A beat-based model-agnostic contrastive learning framework for music-dance retrieval. InProceedings of the 2024 Inter- national Conference on Multimedia Retrieval, pages 11–19, 2024

  37. [37]

    MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

    Kaixing Yang, Jiashu Zhu, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang, Jiahong Wu, Xiangxiang Chu, Hongyan Liu, and Jun He. MACE-Dance: Motion- appearance cascaded experts for music-driven dance video generation.arXiv preprint arXiv:2512.18181, 2025

  38. [38]

    arXiv preprint arXiv:2511.03334 (2025)

    Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, You- liang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, and Limin Wang. UniA VGen: Unified audio and video generation with asymmetric cross-modal interactions.arXiv preprint arXiv:2511.03334, 2025

  39. [39]

    SemTalk: Holistic co-speech motion generation with frame-level se- mantic emphasis

    Xiangyue Zhang, Jianfang Li, Jiaxu Zhang, Ziqiang Dang, Jianqiang Ren, Liefeng Bo, and Zhigang Tu. SemTalk: Holistic co-speech motion generation with frame-level se- mantic emphasis. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13761–13771, 2025

  40. [40]

    EchoMask: Speech-queried attention-based mask modeling for holistic co-speech mo- tion generation

    Xiangyue Zhang, Jianfang Li, Jiaxu Zhang, Jianqiang Ren, Liefeng Bo, and Zhigang Tu. EchoMask: Speech-queried attention-based mask modeling for holistic co-speech mo- tion generation. InProceedings of the ACM International Conference on Multimedia, pages 10827–10836, 2025

  41. [41]

    Robust 2D skeleton action recognition via de- coupling and distilling 3D latent features.IEEE Transactions on Circuits and Systems for Video Technology, 2025

    Xiangyue Zhang, Kunkun Pan, Di Wang, Xinchen Jiang, and Zhigang Tu. Robust 2D skeleton action recognition via de- coupling and distilling 3D latent features.IEEE Transactions on Circuits and Systems for Video Technology, 2025

  42. [42]

    Luowei Zhou, Chenliang Xu, and Jason J. Corso. To- wards automatic learning of procedures from web instruc- tional videos. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7590–7598, 2018

  43. [43]

    MotionBERT: A unified perspective on learning human motion representations

    Wentao Zhu, Xiaoxuan Ma, Zhaoyang Liu, Libin Liu, Wayne Wu, and Yizhou Wang. MotionBERT: A unified perspective on learning human motion representations. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2023