Learning to Play Piano in the Real World
Pith reviewed 2026-05-22 23:14 UTC · model grok-4.3
The pith
A dexterous robot learns to play piano pieces in the real world by iteratively updating its simulator with physical performance data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We develop the first piano playing robotic system that makes use of learning approaches while also being deployed on a real world dexterous robot. Specifically, we use a Sim2Real2Sim approach where we iteratively alternate between training policies in simulation, deploying the policies in the real world, and use the collected real world data to update the parameters of the simulator. Using this approach we demonstrate that the robot can learn to play several piano pieces (including Are You Sleeping, Happy Birthday, Ode To Joy, and Twinkle Twinkle Little Star) in the real world accurately, reaching an average F1-score of 0.881.
What carries the argument
The Sim2Real2Sim approach that iteratively updates simulator parameters using data from real-world policy deployments.
If this is right
- The robot successfully plays multiple piano pieces on the physical system.
- Learning policies can be effectively transferred to real hardware for tasks requiring strategic and precise movements.
- Piano playing can be adopted as a benchmark for human-level manipulation research.
- The open-sourced code and videos facilitate further development by the community.
Where Pith is reading between the lines
- This iterative refinement process could be applied to other dexterous tasks like typing or object handling that require similar precision.
- If the simulator updates are effective, new piano pieces might be learned with minimal additional real-world data.
- Connecting to general robotics, this suggests a scalable way to improve simulation fidelity for better policy transfer in manipulation.
Load-bearing premise
That repeated real-world data collection will be sufficient to update the simulator parameters so that policies trained in the updated simulation transfer reliably to the physical robot without further real-world adaptation or safety constraints.
What would settle it
Measuring the real-world playing accuracy after several iterations and finding that it does not improve or that policies fail to transfer despite simulator updates.
Figures
read the original abstract
Towards the grand challenge of achieving human-level manipulation in robots, playing piano is a compelling testbed that requires strategic, precise, and flowing movements. Over the years, several works demonstrated hand-designed controllers on real world piano playing, while other works evaluated robot learning approaches on simulated piano playing. In this work, we develop the first piano playing robotic system that makes use of learning approaches while also being deployed on a real world dexterous robot. Specifically, we use a Sim2Real2Sim approach where we iteratively alternate between training policies in simulation, deploying the policies in the real world, and use the collected real world data to update the parameters of the simulator. Using this approach we demonstrate that the robot can learn to play several piano pieces (including Are You Sleeping, Happy Birthday, Ode To Joy, and Twinkle Twinkle Little Star) in the real world accurately, reaching an average F1-score of 0.881. By providing this proof-of-concept, we want to encourage the community to adopt piano playing as a compelling benchmark towards human-level manipulation in the real world. We open-source our code and show additional videos at www.lasr.org/research/learning-to-play-piano .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to present the first learning-based piano-playing system deployed on a real-world dexterous robot. It uses an iterative Sim2Real2Sim loop (train policy in simulation, deploy on hardware, collect real data to update simulator parameters) to enable accurate playing of four simple pieces (Are You Sleeping, Happy Birthday, Ode To Joy, Twinkle Twinkle Little Star), reporting an average F1-score of 0.881. The code is open-sourced and videos are provided.
Significance. If the iterative update loop is shown to be effective and the transfer is robust, the work supplies a concrete proof-of-concept for sim-to-real transfer on a high-precision, multi-finger manipulation task and could help establish piano playing as a reproducible benchmark. The open-sourcing of code and provision of videos are concrete strengths that aid reproducibility.
major comments (3)
- [Abstract / Methods] Abstract and Methods: the Sim2Real2Sim procedure is described only at a high level; no information is given on which simulator parameters are updated, the optimization method used to fit them, the convergence criterion, or the number of update cycles performed. This information is load-bearing for the central claim that the reported F1-score results from successful domain adaptation rather than incidental sim-real match.
- [Experiments / Results] Experiments / Results: the headline F1-score of 0.881 is presented without error bars, statistical significance tests, per-piece breakdowns, or any comparison against non-learning baselines or a static (non-updated) simulator. Without these, it is impossible to assess whether the result substantiates the iterative loop's efficacy.
- [Methods] Policy and simulator details: the manuscript supplies no description of the policy architecture, observation/action spaces, reward function, or the precise simulator update rule. These omissions prevent evaluation of whether the approach is reproducible or generalizable beyond the four chosen pieces.
minor comments (2)
- [Abstract / Introduction] The abstract states the system is 'the first' to combine learning with real-world deployment; a brief related-work paragraph clarifying the exact distinction from prior hand-designed controllers would strengthen this claim.
- [Figures / Videos] Figure captions and video links should explicitly state the number of trials per piece and any safety constraints applied during real-world execution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the manuscript requires expanded details on the Sim2Real2Sim procedure, experimental reporting, and methodological components to strengthen the claims. We will revise the paper to address each point.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and Methods: the Sim2Real2Sim procedure is described only at a high level; no information is given on which simulator parameters are updated, the optimization method used to fit them, the convergence criterion, or the number of update cycles performed. This information is load-bearing for the central claim that the reported F1-score results from successful domain adaptation rather than incidental sim-real match.
Authors: We agree that the current high-level description is insufficient to substantiate the domain adaptation claim. In the revised manuscript we will specify the simulator parameters updated (joint friction, contact stiffness, actuator gains), the optimization method used to fit them (evolutionary strategy minimizing timing and force discrepancies), the convergence criterion (performance plateau on real hardware), and the number of cycles performed (three iterations). These additions will clarify how the reported F1-score arises from the iterative loop rather than incidental matching. revision: yes
-
Referee: [Experiments / Results] Experiments / Results: the headline F1-score of 0.881 is presented without error bars, statistical significance tests, per-piece breakdowns, or any comparison against non-learning baselines or a static (non-updated) simulator. Without these, it is impossible to assess whether the result substantiates the iterative loop's efficacy.
Authors: We acknowledge the need for more rigorous statistical presentation. The revised manuscript will include error bars from multiple independent runs, statistical significance tests, per-piece F1-score breakdowns for the four pieces, and comparisons against a static (non-updated) simulator as well as non-learning baselines such as open-loop scripted trajectories. These additions will allow direct assessment of the iterative loop's contribution. revision: yes
-
Referee: [Methods] Policy and simulator details: the manuscript supplies no description of the policy architecture, observation/action spaces, reward function, or the precise simulator update rule. These omissions prevent evaluation of whether the approach is reproducible or generalizable beyond the four chosen pieces.
Authors: We will expand the Methods section to include the policy architecture (feed-forward neural network), observation space (joint positions/velocities and key states), action space (target joint positions), reward function (negative note timing error plus success bonuses), and the precise simulator update rule (iterative minimization of real-sim discrepancy in key-press events). These details will support reproducibility and evaluation of generalizability. revision: yes
Circularity Check
Empirical demonstration with no derivation chain or fitted predictions
full rationale
The paper reports an observed real-world F1-score of 0.881 from deploying learned policies on a physical robot via an iterative Sim2Real2Sim loop. This is a measured empirical outcome on specific piano pieces, not a quantity derived from equations, fitted parameters, or self-referential definitions. No load-bearing steps reduce by construction to inputs; the contribution is a proof-of-concept system demonstration rather than a theoretical prediction. The absence of any claimed derivation (self-definitional, fitted-input-as-prediction, or uniqueness theorems) makes the result self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 3 Pith papers
-
PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination
PianoFlow generates coordinated bimanual piano motions from audio via MIDI-distilled flow-matching, asymmetric role-gated interaction, and autoregressive streaming continuation, outperforming priors with 9x faster inference.
-
HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies
HandelBot achieves precise bimanual piano playing by refining a simulation policy through lateral finger adjustments and residual RL, outperforming direct sim deployment by 1.8x with only 30 minutes of physical data a...
-
HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies
HandelBot refines simulation policies via physical rollouts and residual RL to achieve precise bimanual piano playing, outperforming direct sim transfer by 1.8x with only 30 minutes of real data across five songs.
Reference graph
Works this paper leans on
-
[1]
DeX- treme: Transfer of Agile In-hand Manipulation from Simulation to Reality
Ankur Handa, Arthur Allshire, Viktor Makoviychuk, Aleksei Petrenko, Ritvik Singh, Jingzhou Liu, Denys Makoviichuk, Karl Van Wyk, Alexander Zhurke- vich, Balakumar Sundaralingam, Yashraj Narang, Jean- Francois Lafleche, Dieter Fox, and Gavriel State. DeX- treme: Transfer of Agile In-hand Manipulation from Simulation to Reality. Technical report, January 20...
-
[2]
Dropout Q- Functions for Doubly Efficient Reinforcement Learning
Takuya Hiraoka, Takahisa Imagawa, Taisei Hashimoto, Takashi Onishi, and Yoshimasa Tsuruoka. Dropout Q- Functions for Doubly Efficient Reinforcement Learning. Technical report, March 2022. URL http://arxiv.org/abs/ 2110.02034. arXiv:2110.02034 [cs] type: article
-
[3]
The robot mu- sician ‘wabot-2’ (waseda robot-2).Robotics, 3(2):143– 155, June 1987
Ichiro Kato, Sadamu Ohteru, Katsuhiko Shirai, Toshiaki Matsushima, Seinosuke Narita, Shigeki Sugano, Tet- sunori Kobayashi, and Eizo Fujisawa. The robot mu- sician ‘wabot-2’ (waseda robot-2).Robotics, 3(2):143– 155, June 1987. ISSN 0167-8493. doi: 10.1016/ 0167-8493(87)90002-7. URL https://www.sciencedirect. com/science/article/pii/0167849387900027
-
[4]
Mike Lambeta, Po-Wei Chou, Stephen Tian, Brian Yang, Benjamin Maloon, Victoria Rose Most, Dave Stroud, Raymond Santos, Ahmad Byagowi, Gregg Kammerer, Dinesh Jayaraman, and Roberto Calandra. DIGIT: A Novel Design for a Low-Cost Compact High-Resolution Tactile Sensor with Application to In-Hand Manipula- tion.IEEE Robotics and Automation Letters, 5(3):3838–...
-
[5]
Controller design for music playing robot—applied to the anthropomorphic piano robot
Yen-Fang Li and Li-Lan Chuang. Controller design for music playing robot—applied to the anthropomorphic piano robot. InIEEE International Conference on Power Electronics and Drive Systems (PEDS), pages 968–973, 2013
work page 2013
-
[6]
Intelligent algorithm for music playing robot — applied to the anthropomorphic piano robot control
Yen-Fang Li and Chi-Yi Lai. Intelligent algorithm for music playing robot — applied to the anthropomorphic piano robot control. InIEEE International Symposium on Industrial Electronics (ISIE), pages 1538–1543, 2014. doi: 10.1109/ISIE.2014.6864843
-
[7]
Solving Rubik's Cube with a Robot Hand
OpenAI, Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, Jonas Schneider, Nikolas Tezak, Jerry Tworek, Peter Welinder, Lilian Weng, Qiming Yuan, Wojciech Zaremba, and Lei Zhang. Solving rubik’s cube with a robot hand, 2019. URL https://arxiv.org/abs/1910.07113
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[8]
In-hand object rotation via rapid motor adaptation
Haozhi Qi, Ashish Kumar, Roberto Calandra, Yi Ma, and Jitendra Malik. In-hand object rotation via rapid motor adaptation. InConference on Robot Learning (CORL),
- [9]
-
[10]
Pianomime: Learning a generalist, dexterous piano player from internet demonstrations, 2024
Cheng Qian, Julen Urain, Kevin Zakka, and Jan Peters. Pianomime: Learning a generalist, dexterous piano player from internet demonstrations, 2024. URL https://arxiv. org/abs/2407.18178
-
[11]
PhD thesis, Universit ¨at Hamburg,
Benjamin Scholz.Playing piano with a shadow dexterous hand. PhD thesis, Universit ¨at Hamburg,
-
[12]
URL https://tams.informatik.uni-hamburg.de/ publications/2019/MSc Benjamin Scholz.pdf
work page 2019
-
[13]
A. Takanishi, M. Sonehara, and H. Kondo. Development of an anthropomorphic flutist robot wf-3rii. InIEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS), volume 1, pages 37–43 vol.1, 1996. doi: 10.1109/IROS.1996.570624
-
[14]
Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control.IEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS), pages 5026–5033, October 2012. doi: 10. 1109/IROS.2012.6386109. URL http://ieeexplore.ieee. org/document/6386109/
-
[15]
Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm control: Software and tasks for continuous control.Software Impacts, 6:100022, November 2020. ISSN 2665-9638. doi: 10.1016/j.simpa.2020.100022. URL https://www.sciencedirect.com/science/article/pii/ S2665...
-
[16]
Lessons from learning to spin ”pens”, 2024
Jun Wang, Ying Yuan, Haichuan Che, Haozhi Qi, Yi Ma, Jitendra Malik, and Xiaolong Wang. Lessons from learning to spin ”pens”, 2024. URL https://arxiv.org/abs/ 2407.18902
-
[17]
Shaoxiong Wang, Mike Lambeta, Po-Wei Chou, and Roberto Calandra. TACTO: A fast, flexible, and open- source simulator for high-resolution vision-based tactile sensors.IEEE Robotics and Automation Letters (RA- L), 7(2):3930–3937, 2022. ISSN 2377-3766. doi: 10.1109/LRA.2022.3146945. URL https://arxiv.org/abs/ 2012.08456
-
[18]
Towards learning to play piano with dexterous hands and touch
Huazhe Xu, Yuping Luo, Shaoxiong Wang, Trevor Dar- rell, and Roberto Calandra. Towards learning to play piano with dexterous hands and touch. InIEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS), pages 10410–10416, 2022. URL https: //arxiv.org/abs/2106.02040
-
[19]
RoboPianist: Dexterous Piano Playing with Deep Rein- forcement Learning
Kevin Zakka, Philipp Wu, Laura Smith, Nimrod Gileadi, Taylor Howell, Xue Bin Peng, Sumeet Singh, Yuval Tassa, Pete Florence, Andy Zeng, and Pieter Abbeel. RoboPianist: Dexterous Piano Playing with Deep Rein- forcement Learning. Technical report, December 2023. URL http://arxiv.org/abs/2304.04150. arXiv:2304.04150 [cs] type: article
-
[20]
RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands
Yi Zhao, Le Chen, Jan Schneider, Quankai Gao, Juho Kannala, Bernhard Sch ¨olkopf, Joni Pajarinen, and Dieter B¨uchler. RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands. Technical report, August 2024. URL http://arxiv.org/abs/ 2408.11048. arXiv:2408.11048 [cs] type: article. APPENDIX A. The Reward F ormulation The t...
-
[21]
Pressing no keys should be worse than pressing the wrong keys
-
[22]
Pressing the correct keys should be better than pressing the wrong keys. Those requirements lead the exploration of the model towards pressing the correct keys without being ”afraid” of pressing the wrong keys. This relationship is implemented by using multiple cases depending on the currently pressed keys: Fork target >0we divide the keypress reward into...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.