CoMPAS3D: A Dataset and Benchmark for Interactive Motion

Angelica Lim; Bermet Burkanova; Chuxuan Zhang; Paige Tutt\"os\'i; Payam Jome Yazdian; Trinity Evans; Yasaman Etesam; Zoe Stanley

arxiv: 2507.19684 · v2 · pith:VS5DOXV3new · submitted 2025-07-25 · 💻 cs.LG · cs.AI· cs.CL· cs.CV

CoMPAS3D: A Dataset and Benchmark for Interactive Motion

Bermet Burkanova , Yasaman Etesam , Payam Jome Yazdian , Trinity Evans , Chuxuan Zhang , Zoe Stanley , Paige Tutt\"os\'i , Angelica Lim This is my paper

classification 💻 cs.LG cs.AIcs.CLcs.CV

keywords motionmovemetricsproficiencycompas3dcoveringdatasetevaluation

0 comments

read the original abstract

Socially interactive humanoid robots must engage with humans through their bodies, adapting in real time to a partner's movement, intent, and abilities. This requires models that understand not just how bodies move, but what movement means in a shared social context. Yet evaluation frameworks for interactive motion generation do not measure whether generated follower motion is legible within a shared movement vocabulary, nor whether it is appropriate to the partner's proficiency level. This gap has two causes: existing frameworks rely on kinematic metrics such as FID and beat alignment that cannot measure either property, and existing datasets lack the move annotations and proficiency variation needed. Salsa is well-suited as an evaluation domain: improvised, dyadic, and governed by a move vocabulary and judging criteria covering timing, musicality, technique, difficulty, partnering, and originality. We present CoMPAS3D, a motion capture dataset of improvised partner salsa paired with an evaluation framework covering kinematic quality, two objective metrics (move legibility and proficiency appropriateness), and six competition-based subjective dimensions. The dataset includes 3 hours of improvisation by 18 dancers spanning beginner, intermediate, and professional levels, with over 2,800 expert-annotated segments covering move types, errors, and stylistic elements. We define three benchmarks: move classification (analogous to transcription), proficiency estimation (fluency assessment), and follower generation (dialogue response). Fine-tuned vision-language models perform strongly on objective metrics applied to ground-truth motion sequences. Applied to Duolando and InterGen, the metrics reveal failures that kinematic metrics miss. Human evaluations confirm the gap between generated and ground-truth motion. CoMPAS3D, annotations, benchmark code, and baseline results are publicly available.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OMG: Omni-Modal Motion Generation for Generalist Humanoid Control
cs.RO 2026-06 unverdicted novelty 5.0

OMG is a diffusion model for omni-modal whole-body humanoid motion generation that uses language, audio, and reference motions after large-scale data curation to achieve state-of-the-art performance and adaptation.