arxiv: 2605.01809 · v1 · submitted 2026-05-03 · 💻 cs.SD · cs.AI

Recognition: unknown

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

Xiaoda Yang , Majun Zhang , Changhao Pan , Nick Huang , Yang Yuguang , Fan Zhuo , Pengfei Zhou , Jin Zhou

show 5 more authors

Sizhe Shan Shan Yang Miles Yang Yang You Zhou Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-09 16:07 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords music-dance co-generationevaluation benchmarkrhythmic alignmentcross-modal synchronizationbeat-level metricsaudio-visual generationtext-driven synthesis

0 comments

The pith

TMD-Bench shows that current music-dance generators produce quality outputs but often lack consistent rhythmic coupling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TMD-Bench as an evaluation framework for text-driven music-dance co-generation that measures three aspects: the quality of the generated music and dance on their own, how well they follow text instructions, and how precisely the dance movements align with the musical beats and phrasing. Standard evaluation methods miss this last requirement because they treat audio and video separately or use only broad consistency checks. TMD-Bench addresses the gap with physical metrics that track beat-level synchronization, combined with human perceptual ratings, and it rests on a specially curated dataset of rhythm-matched music-dance pairs plus a tool that produces detailed music descriptions. A sympathetic reader would care because better rhythmic alignment is required for convincing virtual productions and interactive media where music must visibly drive choreography.

Core claim

TMD-Bench evaluates music-dance co-generation systems across unimodal generation quality, instruction adherence, and cross-modal rhythmic alignment by combining computable physical metrics with perceptual multimodal judgments on a curated rhythm-aligned music-dance dataset supported by a fine-grained Music Captioner. It reveals that modern commercial audio-visual models such as Veo 3 and Sora 2 generate high-quality music and video while rhythmic coupling remains less consistently optimized. The unified baseline RhyJAM trained on rhythm-aligned data achieves competitive beat-level synchronization while maintaining competitive unimodal fidelity.

What carries the argument

TMD-Bench, the multi-level evaluation paradigm that integrates physical beat-synchronization metrics with perceptual judgments on a rhythm-aligned dataset and a structured Music Captioner.

If this is right

Commercial models require additional optimization focused on rhythmic coupling between generated music and dance.
Training on explicitly rhythm-aligned data enables models to reach competitive synchronization without loss of single-modality quality.
Future music-dance systems should incorporate explicit objectives for rhythmic and kinetic coherence.
Evaluation protocols for audio-visual generation need to combine objective metrics with human judgments to assess fine temporal alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-level approach could be adapted to evaluate other paired generation tasks that require precise timing, such as speech-driven gesture or music-driven animation.
Wider use of rhythm-aligned training sets may raise performance in broader audio-visual synthesis domains.
Developers could incorporate TMD-Bench scores directly into model training loops to target synchronization improvements.

Load-bearing premise

The curated rhythm-aligned dataset together with the chosen physical metrics and perceptual judgments provide a reliable and unbiased measure of cross-modal rhythmic alignment.

What would settle it

An independent test in which models that score high on TMD-Bench rhythmic metrics receive low ratings from choreographers for actual music-dance synchronization, or vice versa, would show the benchmark does not capture the intended quality.

Figures

Figures reproduced from arXiv: 2605.01809 by Changhao Pan, Fan Zhuo, Jin Zhou, Majun Zhang, Miles Yang, Nick Huang, Pengfei Zhou, Shan Yang, Sizhe Shan, Xiaoda Yang, Yang You, Yang Yuguang, Zhou Zhao.

**Figure 1.** Figure 1: Overview of TMD-Bench, a benchmark for text-driven music–dance co-generation. The top-left panel illustrates the dataset construction pipeline. The dataset overview (top-right) shows the distribution of dance-related attributes in the 10k dataset; each concentric ring corresponds to an attribute—Performer Cardinality, Dance Style Category, Performer Attributes, and Scene Context (from inner to outer). The … view at source ↗

**Figure 2.** Figure 2: Overview of our evaluation framework for music–dance generation. The benchmark decomposes video, audio, and crossmodal alignment into complementary dimensions. that have been widely used in prior work (Tjandra et al., 2025) and shown to be stable in practice: Production Quality (PQ), Production Complexity (PC), Content Enjoyment (CE), and Content Usefulness (CU). To further address aspects that are dif… view at source ↗

**Figure 4.** Figure 4: Overview of the unified diffusion architecture for textdriven music–dance generation view at source ↗

**Figure 6.** Figure 6: Visual Audio Alignment prompt used for MLLM-based evaluation. The template defines key concepts, step-by-step reasoning, and a 1–5 alignment score for rhythmic synchronization. 15 view at source ↗

**Figure 7.** Figure 7: Video Instruction Following prompt. The judge first describes the observed video, then constructs an ideal caption from the text instruction, and finally rates semantic compliance on a 1–5 scale with accompanying reasoning. 16 view at source ↗

**Figure 8.** Figure 8: Video Visual Quality evaluation prompt. The template guides the model to score both imaging quality and aesthetic quality, capturing technical fidelity and overall visual appeal. 17 view at source ↗

**Figure 9.** Figure 9: Prompt used to rate video motion and temporal consistency. 18 view at source ↗

**Figure 10.** Figure 10: Prompt for auditory aesthetics. The judge outputs four MOS-style scores—production complexity, content enjoyment, production quality, and content usefulness. 19 view at source ↗

**Figure 11.** Figure 11: Qualitative examples of text-driven music–dance co-generation (Case 1). Each row depicts sampled video frames, the corresponding audio waveform, and semantic tags, illustrating identity consistency and rhythm-aware motion under varying prompts. 20 view at source ↗

**Figure 12.** Figure 12: Qualitative examples of text-driven music–dance co-generation (Case 2). Each row depicts sampled video frames, the corresponding audio waveform, and semantic tags, illustrating identity consistency and rhythm-aware motion under varying prompts. 21 view at source ↗

**Figure 13.** Figure 13: Qualitative examples of text-driven music–dance co-generation (Case 3). Each row depicts sampled video frames, the corresponding audio waveform, and semantic tags, illustrating identity consistency and rhythm-aware motion under varying prompts. 22 view at source ↗

read the original abstract

Unified audio-visual generation is rapidly gaining industrial and creative relevance, enabling applications in virtual production and interactive media. However, when moving from general audio-video synthesis to music-dance co-generation, the task becomes substantially harder: musical rhythm, phrasing, and accents must drive choreographic motion at fine temporal resolution, and such rhythmic coupling is not captured by unimodal metrics or generic audiovisual consistency scores used in current evaluation practice. We introduce TMD-Bench, a benchmark for text-driven music-dance co-generation that assesses systems across unimodal generation quality, instruction adherence, and cross-modal rhythmic alignment. The benchmark integrates computable physical metrics with perceptual multimodal judgments, and is supported by a curated rhythm-aligned music-dance dataset and a fine-grained Music Captioner for structured music semantics. TMD-Bench further reveals that (i) modern commercial audio-visual models, such as Veo 3 and Sora 2, produce high-quality music and video, while rhythmic coupling remains less consistently optimized and leaves room for improvement, and (ii) our unified baseline RhyJAM trained on rhythm-aligned data achieves competitive beat-level synchronization while maintaining competitive unimodal fidelity. This presents prospects for building next-generation music-dance models that explicitly optimize rhythmic and kinetic coherence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TMD-Bench adds a focused benchmark for rhythmic alignment in music-dance generation, but its findings on commercial models and the RhyJAM baseline rest on dataset and metric choices that need closer checks.

read the letter

The key takeaway is that TMD-Bench introduces a multi-level evaluation for text-driven music-dance co-generation, with emphasis on cross-modal rhythmic alignment that standard metrics overlook. The authors supply a rhythm-aligned dataset, a fine-grained Music Captioner, and a baseline model called RhyJAM that reportedly matches commercial systems on beat sync while holding unimodal quality steady. This directly addresses a gap where musical phrasing and accents must drive motion at fine resolution, and generic audiovisual scores fall short. They combine physical metrics like beat synchronization with perceptual judgments, which is a practical step for this domain. The results on models such as Veo 3 and Sora 2 suggest high separate-track quality but inconsistent coupling, pointing to room for targeted improvements. The baseline shows that training on aligned data can close some of that gap without major trade-offs. That part of the work is useful and grounded in the problem statement. The soft spots center on validation. The claims about inconsistent rhythmic coupling and competitive baseline performance depend on the dataset truly representing aligned pairs without genre skew, the captioner extracting semantics that correctly inform alignment tests, and the metrics capturing phrasing and kinetic details rather than just coarse onsets. The abstract leaves these details thin, so it is hard to judge if the reported gaps are robust or partly artifacts of curation and metric sensitivity. Minor issues like missing error analysis or explicit checks against human perception at the claimed temporal scale add to the uncertainty. This paper is for researchers working on audiovisual generation models, especially those building or comparing systems for music and dance content in creative applications. Readers focused on benchmarks and evaluation design will get the most from the multi-level structure and the concrete baseline. It shows clear engagement with the limitations of prior metrics and offers something concrete to test against. The work deserves serious referee time because new benchmarks in this area can shape how progress is measured, even if the current pipeline requires more scrutiny on robustness. I would send it to peer review with instructions to examine the dataset construction, metric validation, and any potential biases in the evaluation setup.

Referee Report

3 major / 2 minor

Summary. The paper introduces TMD-Bench, a benchmark for text-driven music-dance co-generation that evaluates systems on unimodal generation quality, instruction adherence, and cross-modal rhythmic alignment. It integrates computable physical metrics with perceptual multimodal judgments, supported by a curated rhythm-aligned music-dance dataset and a fine-grained Music Captioner. Using this benchmark, the authors evaluate commercial models such as Veo 3 and Sora 2, finding high unimodal quality but inconsistent rhythmic coupling, and present a unified baseline RhyJAM trained on rhythm-aligned data that achieves competitive beat-level synchronization while maintaining unimodal fidelity.

Significance. If the benchmark's dataset, captioner, and metrics are shown to be robust and free of systematic bias, TMD-Bench would provide a valuable contribution by addressing the gap in fine-grained rhythmic evaluation for music-dance co-generation, which existing unimodal or generic audiovisual metrics do not capture. The curated dataset and RhyJAM baseline could serve as useful resources for the community, enabling more targeted progress on kinetic coherence.

major comments (3)

[§3] §3 (Dataset Curation): The central claims about inconsistent rhythmic coupling in commercial models and the competitiveness of RhyJAM rest on the assumption that the curated rhythm-aligned music-dance dataset consists of genuinely aligned pairs without curation or genre bias. The manuscript must provide explicit details on verification of alignment (e.g., beat detection algorithms, manual annotation protocols) and genre/style distribution statistics; without these, the reported 'room for improvement' cannot be reliably interpreted.
[§4.2] §4.2 (Metrics and Evaluation): The physical metrics for beat synchronization and perceptual judgments are claimed to assess cross-modal rhythmic alignment at fine temporal resolution. However, no ablation studies, sensitivity analysis, or correlation with human judgments on specific elements (phrasing, accents, kinetic coherence) are described to rule out confounding by unimodal quality or gaps in coverage; this directly undermines the load-bearing conclusion that commercial models leave room for improvement.
[§5] §5 (Baseline RhyJAM): The claim that RhyJAM achieves competitive beat-level synchronization while maintaining unimodal fidelity depends on full transparency of its training procedure, architecture, and how rhythm-aligned data is leveraged. Missing details on these aspects (e.g., loss terms for alignment, data preprocessing) make it impossible to determine whether results are attributable to the benchmark design or the model itself.

minor comments (2)

[Abstract] Abstract: The phrase 'fine temporal resolution' is used without specifying the exact temporal scales or beat granularity employed in the metrics; adding this would improve clarity.
[Figures] Figure/Table captions: Ensure all figures comparing model outputs include explicit labels for rhythmic alignment scores and error bars where applicable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the insightful comments that will help improve the clarity and robustness of our work on TMD-Bench. We address each of the major comments below and commit to making the suggested revisions to enhance the manuscript.

read point-by-point responses

Referee: [§3] §3 (Dataset Curation): The central claims about inconsistent rhythmic coupling in commercial models and the competitiveness of RhyJAM rest on the assumption that the curated rhythm-aligned music-dance dataset consists of genuinely aligned pairs without curation or genre bias. The manuscript must provide explicit details on verification of alignment (e.g., beat detection algorithms, manual annotation protocols) and genre/style distribution statistics; without these, the reported 'room for improvement' cannot be reliably interpreted.

Authors: We agree that providing more explicit details on the dataset curation is important for the community to interpret the benchmark results reliably. In the revised manuscript, we will expand Section 3 to include specific information on the alignment verification process, including the beat detection algorithms employed and the manual annotation protocols followed, as well as statistics on the genre and style distribution of the dataset. revision: yes
Referee: [§4.2] §4.2 (Metrics and Evaluation): The physical metrics for beat synchronization and perceptual judgments are claimed to assess cross-modal rhythmic alignment at fine temporal resolution. However, no ablation studies, sensitivity analysis, or correlation with human judgments on specific elements (phrasing, accents, kinetic coherence) are described to rule out confounding by unimodal quality or gaps in coverage; this directly undermines the load-bearing conclusion that commercial models leave room for improvement.

Authors: We thank the referee for highlighting this aspect. Our physical metrics are computed using independent feature extractors for audio beats and video motion onsets to minimize confounding with unimodal quality. To further strengthen the evaluation, we will incorporate a sensitivity analysis and report correlations with human judgments on rhythmic elements in the revised version of the paper. revision: partial
Referee: [§5] §5 (Baseline RhyJAM): The claim that RhyJAM achieves competitive beat-level synchronization while maintaining unimodal fidelity depends on full transparency of its training procedure, architecture, and how rhythm-aligned data is leveraged. Missing details on these aspects (e.g., loss terms for alignment, data preprocessing) make it impossible to determine whether results are attributable to the benchmark design or the model itself.

Authors: We acknowledge the need for greater transparency in describing the RhyJAM baseline. In the revised manuscript, we will provide comprehensive details on the model architecture, training procedure, loss functions including those for rhythmic alignment, and data preprocessing steps in Section 5 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and dataset introduction with independent evaluation metrics

full rationale

The paper introduces TMD-Bench as a new multi-level evaluation framework for music-dance co-generation, supported by a curated rhythm-aligned dataset and a fine-grained Music Captioner. No equations, predictive derivations, or first-principles results are presented that reduce by construction to fitted parameters, self-citations, or the input data itself. Claims about commercial models (Veo 3, Sora 2) and the RhyJAM baseline are framed as empirical observations from applying the benchmark's physical metrics and perceptual judgments, which are defined independently rather than tautologically. The central contribution is the benchmark paradigm and supporting resources, not a closed-loop prediction or self-referential theorem. This is a standard non-circular benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the assumption that the chosen physical metrics and perceptual judgments capture rhythmic alignment, plus the quality of the curated dataset and captioner; no free parameters or invented entities are explicitly described in the abstract.

axioms (1)

domain assumption Rhythmic coupling in music-dance can be reliably measured by a combination of computable physical metrics and perceptual multimodal judgments.
Invoked in the description of the benchmark's evaluation paradigm.

pith-pipeline@v0.9.0 · 5556 in / 1342 out tokens · 25142 ms · 2026-05-09T16:07:17.367639+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 5 internal anchors

[1]

Cheng, H

URL https: //arxiv.org/abs/2309.16429. Cheng, H. K., Schwing, A., Mitsufuji, Y ., Shibuya, T., Hayakawa, A., and Ishii, M. Taming multimodal joint training for high-quality video-to-audio synthesis,

work page arXiv
[2]

DeepMind, G

URLhttps://arxiv.org/abs/2412.15322. DeepMind, G. Veo 3,

work page arXiv
[3]

Ace-step: A step towards music generation foundation model.arXiv preprint arXiv:2506.00045,

URLhttps://arxiv.org/abs/2506.00045. Group, A. T. Wan 2.5, 2025a. URL https://tongyi. aliyun.com/wan. Group, A. T. Wan 2.6, 2025b. URL https://tongyi. aliyun.com/wan. HaCohen, Y ., Brazowski, B., Chiprut, N., Bitterman, Y ., Kvochko, A., Berkowitz, A., Shalem, D., Lifschitz, D., Moshe, D., Porat, E., Richardson, E., Shiran, G., Chachy, I., Chetboun, J., F...

work page arXiv
[4]

LTX-2: Efficient Joint Audio-Visual Foundation Model

URL https: //arxiv.org/abs/2601.03233. Hoi, S., Zhu, J., Yang, R., Gan, Q., and Xue, S. Omni- avatar: Efficient audio-driven avatar video generation with adaptive body animation,

work page Pith review arXiv
[5]

OmniAvatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025

URL https: //arxiv.org/abs/2506.18866. Hua, D., Wang, X., Zeng, B., Huang, X., Liang, H., Niu, J., Chen, X., Xu, Q., and Zhang, W. Vabench: A comprehen- sive benchmark for audio-video generation,

work page arXiv
[6]

VABench: A Comprehensive Benchmark for Audio-Video Generation

URL https://arxiv.org/abs/2512.09299. Huang, Z., He, Y ., Yu, J., Zhang, F., Si, C., Jiang, Y ., Zhang, Y ., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y ., Chen, X., Wang, L., Lin, D., Qiao, Y ., and Liu, Z. Vbench: Com- prehensive benchmark suite for video generative mod- els,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Kuaishou

URL https://arxiv.org/abs/2502.01061. Kuaishou. kling2.6,

work page arXiv
[8]

Li, R., Zhao, J., Zhang, Y ., Su, M., Ren, Z., Zhang, H., Tang, Y ., and Li, X

URL https://arxiv.org/abs/1911.02001. Li, R., Zhao, J., Zhang, Y ., Su, M., Ren, Z., Zhang, H., Tang, Y ., and Li, X. Finedance: A fine-grained choreography dataset for 3d full body dance generation,

work page arXiv 1911
[9]

Liang, Y ., Chen, Z., Ding, C., and Di, X

URL https://arxiv.org/abs/2212.03741. Liang, Y ., Chen, Z., Ding, C., and Di, X. Deepsound- v1: Start to think step-by-step in the audio generation from videos,

work page arXiv
[10]

Liu, K., Li, W., Chen, L., Wu, S., Zheng, Y ., Ji, J., Zhou, F., Jiang, R., Luo, J., Fei, H., and Chua, T.-S

URL https://arxiv.org/ abs/2503.22208. Liu, K., Li, W., Chen, L., Wu, S., Zheng, Y ., Ji, J., Zhou, F., Jiang, R., Luo, J., Fei, H., and Chua, T.-S. Javisdit: Joint audio-video diffusion transformer with hierarchi- cal spatio-temporal prior synchronization,

work page arXiv
[11]

Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025

URL https://arxiv.org/abs/2503.23377. Low, C., Wang, W., and Katyal, C. Ovi: Twin backbone cross-modal fusion for audio-video generation,

work page arXiv
[12]

URLhttps://arxiv.org/abs/2510.01284. OpenAI. Sora 2: Video generation model,

work page arXiv
[13]

URL https://openai.com/sora. Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.-Y ., Chuang, C.-Y ., Yan, D., Choudhary, D., Wang, D., Sethi, G., Pang, G., Ma, H., Misra, I., Hou, J., Wang, J., ran Jagadeesh, K., Li, K., Zhang, L., Singh, M., Williamson, M., Le, M., Yu, M., Singh, M. K., Zhang, P., Vajda, P., Duva...

work page internal anchor Pith review arXiv
[14]

Tjandra, A., Wu, Y .-C., Guo, B., Hoffman, J., Ellis, B., Vyas, A., Shi, B., Chen, S., Le, M., Zacharov, N., Wood, C., Lee, A., and Hsu, W.-N

URL https: //arxiv.org/abs/2508.16930. Tjandra, A., Wu, Y .-C., Guo, B., Hoffman, J., Ellis, B., Vyas, A., Shi, B., Chen, S., Le, M., Zacharov, N., Wood, C., Lee, A., and Hsu, W.-N. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound,

work page arXiv
[15]

URL https://arxiv.org/abs/ 2502.05139. Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T....

work page arXiv
[16]

Wan: Open and Advanced Large-Scale Video Generative Models

URL https: //arxiv.org/abs/2503.20314. Wang, C., Zhang, C., Chen, X., Chen, Z., Xu, H., Song, G., Xie, Y ., Luo, L., and Chang, D. X-dancer: Expressive music to human dance video generation, 2025a. URL https://arxiv.org/abs/2502.17414. Wang, D., Zuo, W., Li, A., Chen, L.-H., Liao, X., Zhou, D., Yin, Z., Dai, X., Jiang, D., and Yu, G. Universe-1: Unified a...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

URL https://arxiv.org/ abs/2307.06942. Wang, Z., Zhang, P., Qi, J., Xu, G. W. S., Zhang, B., and Bo, L. Omnitalker: Real-time text-driven talking head genera- tion with in-context audio-visual style replication, 2025c. URLhttps://arxiv.org/abs/2504.02433. Wu, Y ., Chen, K., Zhang, T., Hui, Y ., Nezhurina, M., Berg-Kirkpatrick, T., and Dubnov, S. Large-sca...

work page internal anchor Pith review arXiv
[18]

Large-scale con- trastive language-audio pretraining (clap),

URL https://arxiv.org/abs/2211.06687. Xu, J., Guo, Z., He, J., Hu, H., He, T., Bai, S., Chen, K., Wang, J., Fan, Y ., Dang, K., Zhang, B., Wang, X., Chu, Y ., and Lin, J. Qwen2.5-omni technical report,

work page arXiv
[19]

Qwen2.5-Omni Technical Report

URL https://arxiv.org/abs/2503.20215. You, F., Fang, M., Tang, L., Huang, R., Wang, Y ., and Zhao, Z. Momu-diffusion: On learning long-term motion- music synchronization and correspondence,

work page internal anchor Pith review arXiv
[20]

Zhang, G., Zhou, Z., Hu, T., Peng, Z., Zhang, Y ., Chen, Y ., Zhou, Y ., Lu, Q., and Wang, L

URL https://arxiv.org/abs/2411.01805. Zhang, G., Zhou, Z., Hu, T., Peng, Z., Zhang, Y ., Chen, Y ., Zhou, Y ., Lu, Q., and Wang, L. Uniavgen: Unified audio and video generation with asymmetric cross-modal in- teractions, 2025a. URL https://arxiv.org/abs/ 2511.03334. Zhang, R., Yu, B., Min, J., Xin, Y ., Wei, Z., Shi, J. N., Huang, M., Kong, X., Xin, N. L....

work page arXiv
[21]

Raters were intentionally diverse, including five students with a computer-science background and five from non-CS majors. The 100 videos were selected to cover a wide range of dance styles, music genres, and scene contexts, and were sampled in a balanced manner with a similar number of clips from each baseline to avoid skew. To reduce potential informati...

work page arXiv