arxiv: 2604.23886 · v1 · submitted 2026-04-26 · 💻 cs.GR · cs.AI

Recognition: unknown

MUSIC: Learning Muscle-Driven Dexterous Hand Control

Pei Xu , Yufei Ye , Shuchun Sun , Yu Ding , Elizabeth Schumann , C. Karen Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:46 UTC · model grok-4.3

classification 💻 cs.GR cs.AI

keywords dexterous controlmusculoskeletal handpiano playingreinforcement learningvariational autoencoderbimanual coordinationphysics-based simulation

0 comments

The pith

Hierarchical policies distill muscle tracking into latent spaces so hands can play novel piano pieces with accurate key presses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a control method for physics-based musculoskeletal hand models that generalizes piano playing to music outside the training data. Low-level reinforcement learning policies generate muscle-tendon activations to track reference motions from a large dataset. These policies are compressed into variational autoencoder latent spaces that remove fine-grained dynamics while preserving usable structure. High-level piece-specific policies then select sequences of latent actions to match note events extracted from new scores, producing coordinated two-handed performances. The work also introduces an improved hand model that increases tracking precision and biomechanical stability.

Core claim

Our approach combines high-frequency muscle-level control with low-frequency latent-space coordination in a hierarchical architecture. At the low level, general single-hand policies are trained via reinforcement learning to generate dynamic muscle-tendon activations while tracking trajectories from a large reference motion dataset. The resulting tracking policies are then distilled into variational autoencoder models, yielding smooth and structured latent spaces that abstract away low-level muscle dynamics. For the high level, we train piece-specific policies to operate in this latent space, coordinating bimanual motions based on specific goals denoted by note events extracted from given mus

What carries the argument

The hierarchical architecture in which reinforcement-learned muscle activation policies are distilled into variational autoencoder latent spaces that high-level policies query to coordinate bimanual actions according to musical note events.

If this is right

The method produces coordinated bimanual motions with accurate key presses on music outside the reference dataset.
It reaches state-of-the-art performance for piano playing among physics-based dexterous controllers.
The enhanced hand model achieves superior biomechanical stability and tracking precision compared with prior models.
Generated muscle activation patterns align with human electromyography recordings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation step could support other fine-motor skills such as typing or instrument playing by reusing the latent space across tasks.
If the latent space proves reusable, it offers a route to controllers that combine precise physics simulation with creative, goal-directed behavior.
Extending the pipeline to real hardware would require closing the gap between simulated muscle dynamics and actual tendon behavior.

Load-bearing premise

That distilling low-level muscle tracking policies into a VAE latent space preserves enough dynamic information for high-level policies to generalize to novel music pieces without loss of precision or stability.

What would settle it

Measure the fraction of missed or mistimed key presses when the system performs a set of piano pieces never seen in training or in the VAE; if accuracy falls below simpler non-latent baselines or motions become unstable, the generalization claim is falsified.

Figures

Figures reproduced from arXiv: 2604.23886 by C. Karen Liu, Elizabeth Schumann, Pei Xu, Shuchun Sun, Yu Ding, Yufei Ye.

**Figure 1.** Figure 1: Our musculoskeletal hand model (left) and diverse hand poses (right) produced by our muscle-driven motion synthesis models during piano playing. view at source ↗

**Figure 2.** Figure 2: System Overview. The whole system of our framework is trained in three stages. First, we learn a single-hand tracking policy view at source ↗

**Figure 3.** Figure 3: A demonstration trajectory is divided into three overlapping chunks. view at source ↗

**Figure 4.** Figure 4: Demonstration of our music goal representation view at source ↗

**Figure 5.** Figure 5: Illustration of our enhanced musculoskeletal hand model (left hand) view at source ↗

**Figure 6.** Figure 6: Examples of our tracking policy reproducing long-tail distributed, view at source ↗

**Figure 8.** Figure 8: F1 scores of our muscle-driven control policy on the 15 testing view at source ↗

**Figure 9.** Figure 9: Diverse finger poses when multiple keys are pressed at the same time. Our testing repertoire challenges the system with diverse dexterous behaviors, view at source ↗

**Figure 10.** Figure 10: Our policy can infer fingering while taking into account finger occupancy for future key-pressing targets. Two screenshots in each group illustrate the view at source ↗

**Figure 11.** Figure 11: Finger poses during playing arpeggios of view at source ↗

**Figure 12.** Figure 12: Two-hand coordination with overlapping and crossover (the rightmost). view at source ↗

**Figure 13.** Figure 13: Demonstrations of large hand leaps. The left and middle screenshots show the left hand leaping over the right hand and back. view at source ↗

**Figure 14.** Figure 14: Comparison of EMG recordings collected from a human subject view at source ↗

**Figure 15.** Figure 15: Learning performance of the high-level controllers with adaptive view at source ↗

**Figure 16.** Figure 16: Learning performance of the high-level controllers while using dif view at source ↗

**Figure 17.** Figure 17: Network architectures with different latent spaces. From view at source ↗

**Figure 19.** Figure 19: Learning performance of the high-level controllers using a multi view at source ↗

read the original abstract

We present a data-driven approach for physics-based, muscle-driven dexterous control that enables musculoskeletal hands to perform precise piano playing for novel pieces of music outside the reference dataset. Our approach combines high-frequency muscle-level control with low-frequency latent-space coordination in a hierarchical architecture. At the low level, general single-hand policies are trained via reinforcement learning to generate dynamic muscle-tendon activations while tracking trajectories from a large reference motion dataset. The resulting tracking policies are then distilled into variational autoencoder (VAE) models, yielding smooth and structured latent spaces that abstract away low-level muscle dynamics. For the high level, we train piece-specific policies to operate in this latent space, coordinating bimanual motions based on specific goals, denoted by note events extracted from given musical scores, to synthesize performances beyond the reference data. In addition, we present an enhanced musculoskeletal hand model that supports fine control of fingers for accurate low-level motion tracking and diverse high-level motion synthesis. We evaluate the control pipeline of our approach on a diverse piano repertoire spanning multiple musical styles and technical demands. Results demonstrate that our approach can synthesize coordinated bimanual motions with accurate key presses, and achieve the state-of-the-art performance of piano playing in physics-based dexterous control. We also show that our musculoskeletal hand model demonstrates superior biomechanical stability and tracking precision compared to the existing model, and validate that our musculoskeletal hand model and muscle-driven controller can generate physiologically plausible activation patterns that align with human electromyography (EMG) recordings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gets a hierarchical muscle-driven controller working for bimanual piano on novel pieces, but the VAE distillation step is the part that still needs checking.

read the letter

The main takeaway is that they built a working pipeline for muscle-driven dexterous hands to play new piano music. Low-level RL policies track reference motions with muscle-tendon activations, those policies get distilled into VAEs for a smoother latent space, and then piece-specific high-level policies map note events from scores to coordinated bimanual actions in that space. They also ship an updated musculoskeletal hand model that improves finger stability and tracking precision over earlier versions, plus some EMG checks showing the activations line up reasonably with human recordings. The piano task forces real demands on timing, force, and bimanual coordination, so getting generalization beyond the motion dataset is the concrete advance here. The architecture itself is a straightforward extension of prior RL tracking and latent-space work, but applying it end-to-end to muscle-level bimanual control on this benchmark is new enough to matter for the subfield. The hand model upgrade and the EMG alignment are useful additions that make the outputs more believable. The soft spot is exactly the one the stress test flags. Distilling the low-level trackers into a VAE assumes the latent space keeps the fine variations in timing and force that accurate key presses require. If rapid passages or high-force notes get smoothed out, the high-level policies cannot recover them, and the abstract's SOTA claim rests on that holding. Without reconstruction metrics, VAE ablations, or failure analysis on held-out technical repertoire, it is hard to tell how much precision is actually preserved. Training details like reward weights are also left as free parameters, which limits how easily someone could reproduce the numbers. This is for people working on physics-based hand control, hierarchical RL, or musculoskeletal simulation in graphics and robotics. A reader already building dexterous controllers or looking for a demanding benchmark would get concrete ideas from the pipeline and the model changes. I would send it to referees. The task is hard, the hierarchy is coherent, and the results are presented clearly enough that a full review can sort out whether the VAE step delivers what the claims require.

Referee Report

3 major / 2 minor

Summary. The manuscript presents MUSIC, a hierarchical data-driven framework for physics-based muscle-driven dexterous hand control to synthesize piano performances on novel music pieces. Low-level RL policies are trained to produce muscle-tendon activations that track trajectories from a reference motion dataset; these policies are distilled into VAE models to produce structured latent spaces. High-level piece-specific policies then operate in the latent space to coordinate bimanual motions from extracted note events in musical scores. An enhanced musculoskeletal hand model is introduced, with evaluations on diverse repertoire claiming SOTA performance in physics-based dexterous control plus physiological plausibility via EMG alignment.

Significance. If the central claims hold with supporting evidence, this would constitute a meaningful advance in physics-based animation and control of high-DoF musculoskeletal systems. The hierarchical RL-plus-VAE approach for generalizing precise dexterous skills beyond reference data, combined with the enhanced hand model and EMG validation, could serve as a template for other complex motor tasks and improve biomechanical fidelity in simulation.

major comments (3)

[Abstract] Abstract: the claim of 'state-of-the-art performance of piano playing in physics-based dexterous control' and 'accurate key presses' is unsupported by any quantitative metrics, error bars, ablation results, or direct comparisons; this is load-bearing because the central claim cannot be verified without these data.
[Method (VAE distillation)] VAE distillation step (described in the method): no per-sequence reconstruction errors, ablation of the VAE component, or failure-case analysis on held-out rapid/high-force passages are reported. This directly undermines the weakest assumption that the latent space retains all necessary timing, force, and activation variations for stable generalization to novel music without loss of precision.
[Evaluation] Evaluation and training details: RL reward weights, hyperparameters, training data exclusion rules, and low-level policy performance metrics are absent. These omissions prevent assessment of whether the low-level tracking policies provide a sufficiently rich basis for the high-level policies to achieve the claimed coordination and accuracy.

minor comments (2)

[Method] Clarify the exact interaction frequencies between low-level muscle control and high-level latent policies, and how the VAE encoder/decoder is queried during high-level rollouts.
[Results] Add a table or figure explicitly comparing the enhanced hand model against the prior model on biomechanical stability and tracking precision metrics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments identify key areas where additional quantitative support and implementation details will improve the verifiability of our claims. We address each major comment point by point below and will incorporate the requested information in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'state-of-the-art performance of piano playing in physics-based dexterous control' and 'accurate key presses' is unsupported by any quantitative metrics, error bars, ablation results, or direct comparisons; this is load-bearing because the central claim cannot be verified without these data.

Authors: We agree that the abstract would be strengthened by explicitly summarizing the quantitative results. The evaluation section reports performance on a diverse piano repertoire with demonstrations of coordinated bimanual motions and accurate key presses. In the revision we will update the abstract to include specific metrics (e.g., key-press success rates on novel pieces, comparison scores against baselines) together with references to the figures and tables that contain error bars and ablation results. revision: yes
Referee: [Method (VAE distillation)] VAE distillation step (described in the method): no per-sequence reconstruction errors, ablation of the VAE component, or failure-case analysis on held-out rapid/high-force passages are reported. This directly undermines the weakest assumption that the latent space retains all necessary timing, force, and activation variations for stable generalization to novel music without loss of precision.

Authors: We acknowledge the value of these analyses for validating the VAE distillation. The revised manuscript will add per-sequence reconstruction error statistics for the VAE models, an ablation study that isolates the contribution of the VAE component, and a targeted failure-case examination of held-out rapid and high-force passages to confirm that timing, force, and activation details are preserved in the latent space. revision: yes
Referee: [Evaluation] Evaluation and training details: RL reward weights, hyperparameters, training data exclusion rules, and low-level policy performance metrics are absent. These omissions prevent assessment of whether the low-level tracking policies provide a sufficiently rich basis for the high-level policies to achieve the claimed coordination and accuracy.

Authors: We will supply the omitted details in the revised version. The updated manuscript will include the full set of RL reward weights, all training hyperparameters, the criteria used to exclude sequences from the reference motion dataset, and quantitative low-level policy metrics (tracking error, success rate, etc.). These will appear in a dedicated subsection or appendix so that readers can evaluate the foundation provided to the high-level policies. revision: yes

Circularity Check

0 steps flagged

No significant circularity: hierarchical pipeline is self-contained

full rationale

The paper's chain trains low-level RL policies on reference motion trajectories to produce muscle activations, distills them into a VAE for latent coordination, and trains high-level policies on independent note events extracted from musical scores to handle novel pieces. This structure uses external data sources and standard techniques without any claimed result reducing to its own inputs by definition, without renaming fitted quantities as predictions, and without load-bearing self-citations or imported uniqueness theorems. Evaluation against held-out repertoire, EMG recordings, and biomechanical comparisons supplies independent benchmarks, confirming the derivation remains non-circular.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Based on abstract only; the approach rests on the validity of the musculoskeletal hand model and standard RL convergence assumptions, with many training hyperparameters left unspecified.

free parameters (2)

RL reward weights and hyperparameters
Standard in reinforcement learning training for tracking and coordination policies; not detailed in abstract.
VAE latent dimension and training parameters
Chosen to structure the latent space for high-level control; fitted during distillation.

axioms (2)

domain assumption The enhanced musculoskeletal hand model accurately captures human finger biomechanics and stability
Invoked to support fine control and EMG alignment claims.
ad hoc to paper Low-level tracking policies can be distilled into a VAE without losing critical control authority for novel tasks
Central to the hierarchical architecture.

pith-pipeline@v0.9.0 · 5577 in / 1292 out tokens · 25889 ms · 2026-05-08T04:46:37.649656+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 2 canonical work pages · 1 internal anchor

[1]

InProceedings of the 33rd ACM International Conference on Multimedia

Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis. InProceedings of the 33rd ACM International Conference on Multimedia. 9743–9752. Chaoyi Luo, Pengbin Tang, Yuqi Ma, and Dongjin Huang. 2024b. Learning to Play Guitar with Robotic Hands. InProceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Anim...

2019
[2]

Proximal Policy Optimization Algorithms

Biomechanical simulation and control of hands and tendinous systems.ACM Transactions on Graphics (TOG)34, 4 (2015), 1–10. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347(2017). Asad Ali Shahid, Francesco Braghin, and Loris Roveda. 2025. Robot Drumm...

work page internal anchor Pith review arXiv 2015
[3]

Shinjiro Sueda, Andrew Kaufman, and Dinesh K Pai

Gray’s anatomy: the anatomical basis of clinical practice.American journal of neuroradiology26, 10 (2005), 2703. Shinjiro Sueda, Andrew Kaufman, and Dinesh K Pai. 2008. Musculotendon simulation for hand animation. InACM SIGGRAPH 2008 papers. 1–8. Chen Tessler, Yunrong Guo, Ofir Nabati, Gal Chechik, and Xue Bin Peng. 2024. Masked- mimic: Unified physics-ba...

work page doi:10.1145/3480148 2005