arxiv: 2605.01409 · v1 · submitted 2026-05-02 · 💻 cs.IR · cs.CV· cs.MM

Recognition: unknown

Interactive Multi-Turn Retrieval for Health Videos

Chengzheng Wu , Ke Qiu , Baoming Zhang , Ruiyu Mao , Xulong Tang , Kaixing Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 17:49 UTC · model grok-4.3

classification 💻 cs.IR cs.CVcs.MM

keywords health video retrievalmulti-turn dialogueinteractive retrievalvideo searchmedical educationdialogue-aware rankingtwo-stage retrieval

0 comments

The pith

Multi-turn dialogue refines vague health video queries into clinically precise retrievals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Health videos for training and education often start with imprecise user needs that only become specific after follow-ups on posture, equipment, contraindications, or patient conditions. The paper builds MHVRC, a corpus of multi-turn health video queries generated from video descriptions and refinements, and introduces DATR, a two-stage system that first coarsely retrieves candidates then re-ranks them using fused dialogue context. Experiments on the corpus show gains over single-turn baselines, and user studies find that the refined queries align better with the fine-grained procedural content in the videos.

Core claim

The authors construct MHVRC by pairing video-grounded descriptions from VideoChat-Flash with query refinements from DeepSeek, then demonstrate that DATR's coarse dual-encoder stage followed by multi-turn query fusion and cross-encoder re-ranking produces ranked lists that better match the evolving information needs in health video retrieval.

What carries the argument

DATR (Dialogue-Aware Two-Stage Retrieval), which performs fast coarse retrieval via CLIP-style dual encoder on sparsely sampled frames then re-ranks top candidates by fusing the full multi-turn query history in a lightweight cross-encoder.

If this is right

Health video systems can shift from one-shot queries to handling constraints such as hand placement, equipment availability, and patient-specific cautions that emerge only in follow-up turns.
Automated construction of multi-turn query corpora enables scalable evaluation of interactive retrieval without requiring exhaustive manual annotation.
Fine-grained procedural details in instructional videos become retrievable once dialogue context is fused rather than relying on an initial broad query alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage coarse-then-fusion pattern could be tested on procedural videos outside health, such as exercise routines or equipment maintenance, where initial queries are similarly underspecified.
Live deployment logs from actual users might expose refinement patterns, such as clarification on contraindications, that differ from the AI-generated sequences used to build MHVRC.
Pairing DATR with a conversational interface would allow testing whether users naturally produce the multi-turn refinements the framework is designed to exploit.

Load-bearing premise

The multi-turn queries created by VideoChat-Flash and DeepSeek sufficiently stand in for the way real users would progressively refine their searches for health videos.

What would settle it

A controlled study that collects live multi-turn interactions from health professionals or patients using the system and checks whether the videos retrieved match those returned under the AI-generated query sequences in MHVRC.

Figures

Figures reproduced from arXiv: 2605.01409 by Baoming Zhang, Chengzheng Wu, Kaixing Yang, Ke Qiu, Ruiyu Mao, Xulong Tang.

**Figure 1.** Figure 1: Single-turn versus multi-turn health video retrieval. Single-turn search often matches broad topic words, while interactive view at source ↗

**Figure 2.** Figure 2: Construction pipeline of MHVRC. VideoChat-Flash produces procedural descriptions from health videos, while DeepSeek view at source ↗

**Figure 3.** Figure 3: Overview of DATR. Stage I retrieves candidates with view at source ↗

**Figure 4.** Figure 4: Architecture of the two-stage retrieval process. The wide pipeline is placed as a double-column figure to preserve readability. view at source ↗

**Figure 5.** Figure 5: Qualitative retrieval examples from DATR. Retrieved videos match refined health queries involving exercise, rehabilitation, and view at source ↗

read the original abstract

The growing availability of health-related instructional videos creates new opportunities for clinical training, patient rehabilitation, and health education, yet existing retrieval systems remain largely single-turn: a user submits one query and receives one ranked list. This interaction is brittle in health scenarios, where information needs are often vague at first and become clinically meaningful only after follow-up constraints such as posture, hand placement, contraindications, equipment, or patient condition are specified. We introduce interactive multi-turn semantic retrieval for health videos and construct MHVRC, a Multi-Turn Health Video Retrieval Corpus, by combining video-grounded descriptions from VideoChat-Flash with query refinements generated by DeepSeek. We further propose DATR, a Dialogue-Aware Two-Stage Retrieval framework. DATR first performs efficient coarse retrieval with a CLIP-style dual encoder and sparse frame sampling, then re-ranks the top candidates through multi-turn query fusion and a lightweight cross-encoder scoring module. Experiments on MHVRC show consistent gains over strong text-video retrieval baselines, while user studies indicate that refined multi-turn queries better capture fine-grained procedural semantics than single-turn annotations. The work establishes a benchmark and a scalable technical recipe for interactive health video retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a synthetic benchmark for multi-turn health video retrieval and a dialogue-aware framework, but its results may not translate to real user interactions due to the LLM-generated data.

read the letter

The paper's main takeaway is a new corpus and framework for multi-turn retrieval of health videos, but the evaluation stays within LLM-generated queries. MHVRC combines video descriptions from VideoChat-Flash with refinements from DeepSeek. DATR uses a CLIP-style dual encoder for fast initial retrieval on sampled frames, followed by query fusion and cross-encoder re-ranking for the top candidates. This setup handles the evolving information needs in health scenarios better than single-turn methods, which is a useful extension of existing retrieval techniques. The concern is that without human validation of the generated multi-turn queries or tests on independently collected real-user data, the reported gains and user study results may not generalize. The paper claims better capture of procedural semantics, but that claim rests on the synthetic corpus. Readers working on specialized video retrieval or health informatics would get the most from this. It provides a benchmark that others can build on. I would recommend sending it for peer review, as it opens up an interactive angle in a practical domain, though revisions should address the data generation process.

Referee Report

1 major / 1 minor

Summary. The paper introduces interactive multi-turn semantic retrieval for health videos, constructs the MHVRC corpus by combining video-grounded descriptions from VideoChat-Flash with query refinements from DeepSeek, proposes the DATR Dialogue-Aware Two-Stage Retrieval framework (coarse CLIP-style retrieval followed by multi-turn query fusion and cross-encoder re-ranking), and reports consistent gains over text-video baselines on MHVRC together with user studies claiming that refined multi-turn queries better capture fine-grained procedural semantics.

Significance. If the synthetic MHVRC data and user studies can be shown to align with real clinical information needs, the work would establish a useful benchmark and an efficient technical recipe for handling evolving, constraint-rich queries in health video retrieval, an area with direct applications to clinical training, rehabilitation, and patient education. The two-stage design is a pragmatic contribution for scaling multi-turn interactions.

major comments (1)

[Abstract and §3] Abstract and §3 (MHVRC construction): The entire evaluation corpus is generated from VideoChat-Flash descriptions and DeepSeek refinements with no reported human validation, inter-annotator agreement against real health-professional or patient queries, or out-of-distribution testing on independently collected health queries. This makes the headline claims of gains over baselines and user-study preferences for multi-turn queries vulnerable to being artifacts of the LLM generation distribution rather than evidence that DATR solves the stated clinical problem.

minor comments (1)

[Abstract] Abstract: The claim of 'consistent gains' is stated without any quantitative metrics, specific baseline names, or error bars; a one-sentence summary of the key numbers would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the major comment on the MHVRC corpus construction point by point below.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (MHVRC construction): The entire evaluation corpus is generated from VideoChat-Flash descriptions and DeepSeek refinements with no reported human validation, inter-annotator agreement against real health-professional or patient queries, or out-of-distribution testing on independently collected health queries. This makes the headline claims of gains over baselines and user-study preferences for multi-turn queries vulnerable to being artifacts of the LLM generation distribution rather than evidence that DATR solves the stated clinical problem.

Authors: We appreciate the referee's observation concerning the synthetic construction of the MHVRC corpus. As described in §3, the corpus was built by first using VideoChat-Flash to produce video-grounded descriptions and then employing DeepSeek to generate refined multi-turn queries. This methodology enables the creation of a substantial dataset for studying interactive retrieval without the prohibitive costs of manual annotation by domain experts. We concur that the lack of reported human validation, inter-annotator agreement metrics with health professionals or patients, and out-of-distribution evaluation on independently sourced queries represents a significant limitation. Consequently, the observed improvements and user study preferences may partly reflect characteristics of the LLM-generated distribution. The user studies do incorporate human judgments on query quality, offering partial human validation. In the revised manuscript, we will revise the abstract and §3 to more clearly delineate the synthetic nature of MHVRC, include additional details on the generation pipeline, and add a dedicated limitations paragraph discussing the need for future real-world clinical query validation. We maintain that the DATR framework and the benchmark provide a pragmatic foundation for further research in this area. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs the MHVRC benchmark by applying external LLMs (VideoChat-Flash for video descriptions, DeepSeek for multi-turn refinements) and evaluates the DATR framework (CLIP-style coarse retrieval plus multi-turn fusion and cross-encoder re-ranking) empirically on that corpus, reporting gains over baselines plus user-study support for multi-turn queries. No equations, fitted parameters, or first-principles derivations are present that reduce to self-definition or tautology. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling occur. The central claims rest on standard retrieval components applied to a new task and synthetic corpus; the evaluation chain is self-contained and does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim relies on the assumption that AI-generated data can stand in for human interactions, which is a domain assumption without independent validation mentioned.

axioms (1)

domain assumption The synthetic data generation using VideoChat-Flash and DeepSeek produces realistic multi-turn queries for health videos.
The corpus construction relies on this to create the benchmark.

pith-pipeline@v0.9.0 · 5518 in / 1142 out tokens · 62791 ms · 2026-05-09T17:49:11.447858+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 7 canonical work pages · 3 internal anchors

[1]

MedVidQA: A large-scale med- ical video question answering dataset

Asma Ben Abacha, Wen-wai Yim, Yujuan Fan, Thomas Lin, and Dina Demner-Fushman. MedVidQA: A large-scale med- ical video question answering dataset. InProceedings of the Thirteenth Language Resources and Evaluation Conference, 2022

2022
[2]

ViViT: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. ViViT: A video vision transformer. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 6836–6846, 2021

2021
[3]

Au- tomatic exercise assessment in physical rehabilitation.Sen- sors, 19(19):4113, 2019

Sujin Bae, Jooyeon Kim, Jihye Park, and Sangyoun Lee. Au- tomatic exercise assessment in physical rehabilitation.Sen- sors, 19(19):4113, 2019

2019
[4]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 1728–1738, 2021

2021
[5]

Is space-time attention all you need for video understanding? InProceedings of the International Conference on Machine Learning, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InProceedings of the International Conference on Machine Learning, 2021

2021
[6]

MagicPose: Realistic human poses and facial ex- pressions retargeting with identity-aware diffusion.arXiv preprint arXiv:2311.12052, 2023

Di Chang, Yichun Shi, Quankai Gao, Jiawei Xu, and Hongbo Fu. MagicPose: Realistic human poses and facial ex- pressions retargeting with identity-aware diffusion.arXiv preprint arXiv:2311.12052, 2023

work page arXiv 2023
[7]

Executing your commands via motion diffusion in latent space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023
[8]

Chapman, and Clement J

Dina Demner-Fushman, Wendy W. Chapman, and Clement J. McDonald. What can natural language processing do for clinical decision support?Journal of Biomedical Informatics, 42(5):760–772, 2009

2009
[9]

A survey of conversational search.Foundations and Trends in Information Retrieval, 14(5):371–490, 2021

Jianfeng Gao, Chenyan Xue, Anlei Dong, and Jiafeng Chen. A survey of conversational search.Foundations and Trends in Information Retrieval, 14(5):371–490, 2021

2021
[10]

TM2D: Bimodality driven 3d dance generation via music- text integration

Kehong Gong, Defu Lian, Heng Chang, Chuan Guo, Zi- hang Jiang, Xinxin Zuo, Michael Bi Mi, and Xinchao Wang. TM2D: Bimodality driven 3d dance generation via music- text integration. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2023

2023
[11]

Generat- ing diverse and natural 3d human motions from text

Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Generat- ing diverse and natural 3d human motions from text. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5152–5161, 2022

2022
[12]

Localizing mo- ments in video with natural language

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing mo- ments in video with natural language. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 5804–5813, 2017

2017
[13]

Animate anyone: Consistent and controllable image-to-video synthesis for character animation

Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, and Liefeng Bo. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[14]

BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion

Tianzhi Jia, Kaixing Yang, Xiaole Yang, Xulong Tang, Ke Qiu, Shikui Wei, and Yao Zhao. BiTDiff: Fine-grained 3d conducting motion generation via BiMamba-transformer dif- fusion.arXiv preprint arXiv:2604.04395, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

MotionGPT: Human motion as a foreign language

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. MotionGPT: Human motion as a foreign language. In Advances in Neural Information Processing Systems, 2023

2023
[16]

SV-RCNet: Work- flow recognition from surgical videos using recurrent con- volutional network.IEEE Transactions on Medical Imaging, 37(5):1114–1126, 2018

Yueming Jin, Qi Dou, Hao Chen, Lequan Yu, Jing Qin, Chi-Wing Fu, and Pheng-Ann Heng. SV-RCNet: Work- flow recognition from surgical videos using recurrent con- volutional network.IEEE Transactions on Medical Imaging, 37(5):1114–1126, 2018

2018
[17]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 706–715, 2017

2017
[18]

Jie Lei, Chenliang Lyu, Liangchen Chen, Yao Li, Xiaowu Lu, and Tamara L. Berg. Less is more: CLIPBERT for video- and-language learning via sparse sampling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 7331–7341, 2021

2021
[19]

Berg, and Mohit Bansal

Jie Lei, Licheng Yu, Tamara L. Berg, and Mohit Bansal. TVQA: Localized, compositional video question answering. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1369–1379, 2018

2018
[20]

Berg, and Mohit Bansal

Jie Lei, Licheng Yu, Tamara L. Berg, and Mohit Bansal. TVR: A large-scale dataset for video-subtitle moment re- trieval. InProceedings of the European Conference on Com- puter Vision, pages 447–463, 2020

2020
[21]

HERO: Hierarchical encoder for video+language omni-representation pre-training

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. HERO: Hierarchical encoder for video+language omni-representation pre-training. InPro- ceedings of the Conference on Empirical Methods in Natural Language Processing, pages 2046–2065, 2020

2046
[22]

Ross, and Angjoo Kanazawa

Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. AI Choreographer: Music conditioned 3d dance generation with AIST++. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13401– 13412, 2021

2021
[23]

Smart rehabilitation based on artificial intelligence and internet of things: A survey

Yong Li, Jie Hu, and Yu Zhang. Smart rehabilitation based on artificial intelligence and internet of things: A survey. IEEE Access, 8:180246–180271, 2020

2020
[24]

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. In Advances in Neural Information Processing Systems, 2021

2021
[25]

Passage Re-ranking with BERT

Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with BERT.arXiv preprint arXiv:1901.04085, 2019

work page internal anchor Pith review arXiv 1901
[26]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning, pages 8748–8763, 2021

2021
[27]

Bailando: 3d dance generation by actor-critic GPT with choreographic memory

Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. Bailando: 3d dance generation by actor-critic GPT with choreographic memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022
[28]

Personalized dance synthesis based on physical and cognitive intensities

Xulong Tang, Eun Yeo, Ruiyu Mao, Xiaohu Guo, and Rawan Alghofaili. Personalized dance synthesis based on physical and cognitive intensities. InProceedings of the IEEE Confer- ence on Virtual Reality and 3D User Interfaces, pages 261– 271, 2026

2026
[29]

Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H. Bermano. Human motion diffusion model. InInternational Conference on Learning Representations, 2023

2023
[30]

Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel de Mathelin, and Nicolas Padoy

Andru P. Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel de Mathelin, and Nicolas Padoy. En- doNet: A deep architecture for recognition tasks on laparo- scopic videos.IEEE Transactions on Medical Imaging, 36(1):86–97, 2017

2017
[31]

Com- puter vision for musculoskeletal rehabilitation: A survey

Jiang Wang, Xinyi Liu, Zhenyu Jiang, and Qi Zhang. Com- puter vision for musculoskeletal rehabilitation: A survey. IEEE Reviews in Biomedical Engineering, 2023

2023
[32]

Dance- Camera3D: 3d camera movement synthesis with music and dance

Zixuan Wang, Jia Jia, Shikun Sun, Haozhe Wu, Rong Han, Zhenyu Li, Di Tang, Jiaqing Zhou, and Jiebo Luo. Dance- Camera3D: 3d camera movement synthesis with music and dance. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 7892–7901, 2024

2024
[33]

MSR-VTT: A large video description dataset for bridging video and lan- guage

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A large video description dataset for bridging video and lan- guage. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 5288–5296, 2016

2016
[34]

Megadance: Mixture-of- experts architecture for genre-aware 3d dance generation

Kaixing Yang, Xulong Tang, Ziqiao Peng, Yuxuan Hu, Jun He, and Hongyan Liu. MEGADance: Mixture-of-experts architecture for genre-aware 3d dance generation.arXiv preprint arXiv:2505.17543, 2025

work page arXiv 2025
[35]

Flow- erdance: Meanflow for efficient and refined 3d dance generation.arXiv preprint arXiv:2511.21029, 2025

Kaixing Yang, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang, Jun He, and Hongyan Liu. FlowerDance: MeanFlow for efficient and refined 3d dance generation. arXiv preprint arXiv:2511.21029, 2025

work page arXiv 2025
[36]

BeatDance: A beat-based model-agnostic contrastive learning framework for music-dance retrieval

Kaixing Yang, Xukun Zhou, Xulong Tang, Ran Diao, Hongyan Liu, Jun He, and Zhaoxin Fan. BeatDance: A beat-based model-agnostic contrastive learning framework for music-dance retrieval. InProceedings of the 2024 Inter- national Conference on Multimedia Retrieval, pages 11–19, 2024

2024
[37]

MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

Kaixing Yang, Jiashu Zhu, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang, Jiahong Wu, Xiangxiang Chu, Hongyan Liu, and Jun He. MACE-Dance: Motion- appearance cascaded experts for music-driven dance video generation.arXiv preprint arXiv:2512.18181, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

arXiv preprint arXiv:2511.03334 (2025)

Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, You- liang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, and Limin Wang. UniA VGen: Unified audio and video generation with asymmetric cross-modal interactions.arXiv preprint arXiv:2511.03334, 2025

work page arXiv 2025
[39]

SemTalk: Holistic co-speech motion generation with frame-level se- mantic emphasis

Xiangyue Zhang, Jianfang Li, Jiaxu Zhang, Ziqiang Dang, Jianqiang Ren, Liefeng Bo, and Zhigang Tu. SemTalk: Holistic co-speech motion generation with frame-level se- mantic emphasis. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13761–13771, 2025

2025
[40]

EchoMask: Speech-queried attention-based mask modeling for holistic co-speech mo- tion generation

Xiangyue Zhang, Jianfang Li, Jiaxu Zhang, Jianqiang Ren, Liefeng Bo, and Zhigang Tu. EchoMask: Speech-queried attention-based mask modeling for holistic co-speech mo- tion generation. InProceedings of the ACM International Conference on Multimedia, pages 10827–10836, 2025

2025
[41]

Robust 2D skeleton action recognition via de- coupling and distilling 3D latent features.IEEE Transactions on Circuits and Systems for Video Technology, 2025

Xiangyue Zhang, Kunkun Pan, Di Wang, Xinchen Jiang, and Zhigang Tu. Robust 2D skeleton action recognition via de- coupling and distilling 3D latent features.IEEE Transactions on Circuits and Systems for Video Technology, 2025

2025
[42]

Luowei Zhou, Chenliang Xu, and Jason J. Corso. To- wards automatic learning of procedures from web instruc- tional videos. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7590–7598, 2018

2018
[43]

MotionBERT: A unified perspective on learning human motion representations

Wentao Zhu, Xiaoxuan Ma, Zhaoyang Liu, Libin Liu, Wayne Wu, and Yizhou Wang. MotionBERT: A unified perspective on learning human motion representations. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2023

2023