Simulating Infant First-Person Sensorimotor Experience via Motion Retargeting from Babies to Humanoids
Pith reviewed 2026-07-01 08:07 UTC · model grok-4.3
The pith
Motion retargeting from infant videos to humanoid robots generates simulated multisensory streams with sub-centimeter accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
From a single video the method extracts the infant's skeletal structure and estimates full 3D pose per frame, then maps the reconstructed motion onto the iCub, pyCub, EMFANT and MIMo embodiments; replaying the retargeted motions on these platforms yields multisensory streams of joint and muscle proprioception, touch and vision, reaching sub-centimeter accuracy for the best-matching embodiment and thereby enabling multimodal analysis of infant development plus automated behavior annotation.
What carries the argument
The motion retargeting pipeline that reconstructs skeletal structure and 3D pose from video then maps the pose to humanoid joint angles and sensor models to produce proprioceptive, tactile and visual streams.
If this is right
- Multimodal analysis of infant development becomes possible from ordinary video recordings alone.
- Automated annotation of infant behaviors gains an additional layer of proprioceptive and tactile context.
- Robotics, developmental science and early neurodevelopmental screening each receive a new source of synthetic first-person data.
Where Pith is reading between the lines
- Different humanoid platforms could be ranked by how faithfully their generated sensor streams reproduce patterns seen in real infant data.
- The same retargeting pipeline might be applied to videos of older children or adults once embodiment parameters are adjusted accordingly.
- Synthetic datasets produced this way could serve as training material for models that learn to predict infant actions from partial observations.
Load-bearing premise
The retargeted motion on a robot body will produce sensor streams that meaningfully approximate an infant's own experience only when the robot's proportions, joint limits and sensor placement are close enough to a baby's.
What would settle it
Direct comparison of the generated sensor streams against simultaneous physiological or behavioral recordings from real infants performing the same movements would show whether the simulated data match actual infant experience.
Figures
read the original abstract
Motion retargeting from humans to human-like artificial agents is becoming increasingly important as humanoid robots grow more capable. However, most existing approaches focus only on reproducing kinematics and ignore the rich sensorimotor experience associated with human movement. In this work, we present a framework for simulating the multimodal sensorimotor experiences of infants using physical and virtual humanoids. From a single video, our method reconstructs the infant's body configuration by extracting its skeletal structure and estimating the full 3D pose from each frame. Then we map the reconstructed motion onto several developmental platforms: the physical iCub robot and the virtual simulators pyCub, EMFANT and MIMo. Replaying the retargeted motions on these embodiments produces simulated multisensory streams including proprioception (joints and muscles), touch, and vision. For the best-matching embodiment, the retargeting achieves sub-centimeter accuracy and enables a rich multimodal analysis of infant development as well as enhanced automated annotation of behaviors. This framework provides a unique window into the infant's sensorimotor experience, offering new tools for robotics, developmental science, and early detection of neurodevelopmental disorders. The code is available at https://github.com/ctu-vras/motion-retargeting/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a pipeline that extracts 3D skeletal poses from infant video, retargets the kinematics onto physical iCub and virtual platforms (pyCub, EMFANT, MIMo), and replays the motions to generate simulated proprioceptive, tactile, and visual streams; it reports sub-centimeter end-effector accuracy on the best-matching embodiment and claims this supplies a window into infant first-person sensorimotor experience for developmental analysis and behavior annotation.
Significance. If the retargeted streams were shown to approximate infant multisensory experience, the framework would supply otherwise inaccessible longitudinal data for developmental science and robotics; the public code release is a clear strength that supports reproducibility.
major comments (2)
- [Abstract and §3] Abstract and §3 (method): the central claim that retargeted motions 'simulate the multimodal sensorimotor experiences of infants' and provide 'a unique window into the infant's sensorimotor experience' rests on the untested assumption that kinematic mapping to iCub-scale embodiments produces proprioceptive/tactile/visual streams that meaningfully match an infant's; large differences in head-to-torso ratio, limb lengths, joint limits, and sensor density are not quantified or compensated, so sub-centimeter kinematic error on the target robot does not establish correspondence of the generated sensor streams.
- [§4] §4 (results): the reported sub-centimeter accuracy is stated only for the best-matching embodiment, yet no error distributions, per-joint breakdowns, or comparison against infant ground-truth sensor data (or even against a same-scale infant model) are provided; without these, it is impossible to judge whether post-processing choices preserve the claimed multimodal fidelity.
minor comments (2)
- The abstract states that code is available at the cited GitHub link; this should be repeated with a precise commit hash or release tag in the main text.
- Notation for the retargeting mapping (e.g., how joint angles and muscle lengths are scaled) is introduced without an explicit equation or pseudocode block; adding one would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below, clarifying the scope of our claims and indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (method): the central claim that retargeted motions 'simulate the multimodal sensorimotor experiences of infants' and provide 'a unique window into the infant's sensorimotor experience' rests on the untested assumption that kinematic mapping to iCub-scale embodiments produces proprioceptive/tactile/visual streams that meaningfully match an infant's; large differences in head-to-torso ratio, limb lengths, joint limits, and sensor density are not quantified or compensated, so sub-centimeter kinematic error on the target robot does not establish correspondence of the generated sensor streams.
Authors: The manuscript presents a kinematic retargeting pipeline that generates simulated sensor streams on available embodiments; it does not assert that these streams are identical to an infant's due to inherent morphological mismatches. Sub-centimeter accuracy on the best-matching platform demonstrates faithful reproduction of the input motion, which in turn drives the simulated proprioception, touch, and vision. We agree the language in the abstract and §3 overstates the degree of correspondence and will revise it to describe the output as an approximation suitable for developmental analysis. We will also add explicit discussion of unquantified differences (e.g., limb proportions, sensor density) and their implications for sensor-stream fidelity. revision: partial
-
Referee: [§4] §4 (results): the reported sub-centimeter accuracy is stated only for the best-matching embodiment, yet no error distributions, per-joint breakdowns, or comparison against infant ground-truth sensor data (or even against a same-scale infant model) are provided; without these, it is impossible to judge whether post-processing choices preserve the claimed multimodal fidelity.
Authors: We will expand §4 to report full error distributions and per-joint breakdowns for all tested embodiments. However, no infant ground-truth multimodal sensor recordings exist, which is the central motivation for the simulation framework; a same-scale infant model comparison is likewise outside the present scope. We will add text stating these limitations explicitly and note that the reported kinematic accuracy is the best available proxy for assessing post-processing effects on the generated streams. revision: partial
- Empirical comparison against real infant multimodal sensor data, which does not exist.
Circularity Check
No circularity: pipeline of reconstruction and retargeting with no fitted parameters or self-referential definitions
full rationale
The paper describes a forward pipeline: video-based skeletal extraction, 3D pose estimation, kinematic mapping to target embodiments (iCub, pyCub, etc.), and replay to generate simulated sensor streams. No equations, fitted parameters, or predictions are defined in terms of themselves. Sub-centimeter accuracy is reported as an empirical outcome of the mapping on the best-matching body, not used to define or justify the method. No self-citations are invoked as load-bearing uniqueness theorems. The central claim (that retargeted streams enable multimodal analysis) rests on the described engineering steps rather than reducing to its own inputs by construction. This is a standard non-circular methodological contribution.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Embodiment Shapes Rolling Behavior in a Multimodal Infant Model
A reinforcement learning model of a multimodal virtual infant produces rolling behaviors that reproduce age-related improvements and coordination patterns observed in human infants, shaped by changing body morphology.
Reference graph
Works this paper leans on
-
[1]
Piaget, M
J. Piaget, M. Cooket al.,The origins of intelligence in children. International universities press New York, 1952, vol. 8, no. 5
1952
-
[2]
Helpless infants are learning a foundation model,
R. Cusack, M. Ranzato, and C. J. Charvet, “Helpless infants are learning a foundation model,”Trends in Cognitive Sciences, vol. 28, no. 8, pp. 726–738, 2024
2024
-
[3]
Lessons from infant learning for unsupervised machine learning,
L. Zaadnoordijk, T. R. Besold, and R. Cusack, “Lessons from infant learning for unsupervised machine learning,”Nature Machine Intelli- gence, vol. 4, no. 6, pp. 510–520, 2022
2022
-
[4]
Bayley scales of infant development: Manual,
N. Bayley, “Bayley scales of infant development: Manual,”New York, 1993
1993
-
[5]
Structuring of early reaching movements: a longitu- dinal study,
C. von Hofsten, “Structuring of early reaching movements: a longitu- dinal study,”Journal of motor behavior, vol. 23, no. 4, pp. 280–292, 1991
1991
-
[6]
Detection of intermodal proprioceptive–visual contingency as a potential basis of self- perception in infancy
L. E. Bahrick and J. S. Watson, “Detection of intermodal proprioceptive–visual contingency as a potential basis of self- perception in infancy.”Developmental psychology, vol. 21, no. 6, p. 963, 1985
1985
-
[7]
Infants tailor their attention to maximize learning,
F. Poli, G. Serino, R. Mars, and S. Hunnius, “Infants tailor their attention to maximize learning,”Science advances, vol. 6, no. 39, p. eabb5053, 2020
2020
-
[8]
A decade of infant neuroimaging research: what have we learned and where are we going?
A. Azhari, A. Truzzi, M. J.-Y . Neoh, J. P. M. Balagtas, H. H. Tan, P. P. Goh, X. A. Ang, P. Setoh, P. Rigo, M. H. Bornsteinet al., “A decade of infant neuroimaging research: what have we learned and where are we going?”Infant Behavior and Development, vol. 58, p. 101389, 2020
2020
-
[9]
Sampling development,
K. E. Adolph and S. R. Robinson, “Sampling development,”Journal of Cognition and Development, vol. 12, no. 4, pp. 411–423, 2011
2011
-
[10]
Video can make behavioural science more reproducible,
R. O. Gilmore and K. E. Adolph, “Video can make behavioural science more reproducible,”Nature human behaviour, vol. 1, no. 7, p. 0128, 2017
2017
-
[11]
A Naturalis- tic Observation of Spontaneous Touches to the Body and Environment in the First 2 Months of Life,
A. DiMercurio, J. P. Connell, M. Clark, and D. Corbetta, “A Naturalis- tic Observation of Spontaneous Touches to the Body and Environment in the First 2 Months of Life,”Frontiers in Psychology, vol. 9, 2018
2018
-
[12]
Automatic infant 2d pose estimation from videos: Comparing seven deep neural network methods,
F. Gama, M. M ´ısaˇr, L. Navara, S. T. Popescu, and M. Hoffmann, “Automatic infant 2d pose estimation from videos: Comparing seven deep neural network methods,”Behavior Research Methods, vol. 57, no. 10, p. 280, 2025
2025
-
[13]
Learning and tracking the 3d body shape of freely moving infants from rgb-d sequences,
N. Hesse, S. Pujades, M. J. Black, M. Arens, U. G. Hofmann, and A. S. Schroeder, “Learning and tracking the 3d body shape of freely moving infants from rgb-d sequences,”IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2540–2551, 2019
2019
-
[14]
Grounded language acquisition through the eyes and ears of a single child,
W. K. V ong, W. Wang, A. E. Orhan, and B. M. Lake, “Grounded language acquisition through the eyes and ears of a single child,” Science, vol. 383, no. 6682, pp. 504–511, 2024
2024
-
[15]
Simulated cortical magnifi- cation supports self-supervised object learning,
Z. Yu, A. Aubret, C. Yu, and J. Triesch, “Simulated cortical magnifi- cation supports self-supervised object learning,” in2025 IEEE Inter- national Conference on Development and Learning (ICDL). IEEE, 2025, pp. 1–6
2025
-
[16]
Infants’ use of eye movements to explore their natural environment,
T. R. Candy, S. Biehn, S. Freeman, A. Dalessandro, V . Tellez, B. Marella, K. Singh, Z. Petroff, K. Bonnen, and L. Smith, “Infants’ use of eye movements to explore their natural environment,”Journal of Vision, vol. 24, no. 10, pp. 974–974, 2024
2024
-
[17]
The icub humanoid robot: An open-systems platform for research in cognitive development,
G. Metta, L. Natale, F. Nori, G. Sandini, D. Vernon, L. Fadiga, C. V on Hofsten, K. Rosander, M. Lopes, J. Santos-Victoret al., “The icub humanoid robot: An open-systems platform for research in cognitive development,”Neural networks, vol. 23, no. 8-9, pp. 1125– 1134, 2010
2010
-
[18]
Mimo: A multimodal infant model for studying cognitive development,
D. Mattern, P. Schumacher, F. M. L ´opez, M. C. Raabe, M. R. Ernst, A. Aubret, and J. Triesch, “Mimo: A multimodal infant model for studying cognitive development,”IEEE Transactions on Cognitive and Developmental Systems, vol. 16, no. 4, pp. 1291–1301, 2024
2024
-
[19]
Simulating a human fetus in soft uterus,
D. Kim, H. Kanazawa, and Y . Kuniyoshi, “Simulating a human fetus in soft uterus,” in2022 IEEE International Conference on Development and Learning (ICDL). IEEE, 2022, pp. 135–141
2022
-
[20]
Deep learning-based human pose estimation: A survey,
C. Zheng, W. Wu, C. Chen, T. Yang, S. Zhu, J. Shen, N. Kehtarnavaz, and M. Shah, “Deep learning-based human pose estimation: A survey,” ACM Computing Surveys, vol. 56, no. 1, pp. 1–37, 2023
2023
-
[21]
ViTPose: Simple vision transformer baselines for human pose estimation,
Y . Xu, J. Zhang, Q. Zhang, and D. Tao, “ViTPose: Simple vision transformer baselines for human pose estimation,” inAdvances in Neural Information Processing Systems, 2022
2022
-
[22]
Expressive body capture: 3D hands, face, and body from a single image,
G. Pavlakos, V . Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3D hands, face, and body from a single image,” inProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019
2019
-
[23]
Methods and technologies for the implementation of large- scale robot tactile sensors,
A. Schmitz, P. Maiolino, M. Maggiali, L. Natale, G. Cannata, and G. Metta, “Methods and technologies for the implementation of large- scale robot tactile sensors,”IEEE Transactions on Robotics, vol. 27, no. 3, pp. 389–400, 2011
2011
-
[24]
Vernon, C
D. Vernon, C. V on Hofsten, and L. Fadiga,A roadmap for cognitive development in humanoid robots. Springer Science & Business Media, 2011, vol. 11
2011
-
[25]
The iCub platform: a tool for studying intrinsically motivated learning,
L. Natale, F. Nori, G. Metta, M. Fumagalli, S. Ivaldi, U. Pattacini, M. Randazzo, A. Schmitz, and G. Sandini, “The iCub platform: a tool for studying intrinsically motivated learning,” inIntrinsically motivated learning in natural and artificial systems. Springer, 2012, pp. 433–458
2012
-
[26]
Robotic homunculus: Learning of artificial skin representation in a humanoid robot motivated by primary somatosensory cortex,
M. Hoffmann, Z. Straka, I. Farkas, M. Vavrecka, and G. Metta, “Robotic homunculus: Learning of artificial skin representation in a humanoid robot motivated by primary somatosensory cortex,”IEEE Transactions on Cognitive and Developmental Systems, vol. 10, no. 2, pp. 163–176, June 2018
2018
-
[27]
Learning with pycub: A new simulation and exercise framework for humanoid robotics,
L. Rustler and M. Hoffmann, “Learning with pycub: A new simulation and exercise framework for humanoid robotics,” 2025. [Online]. Available: https://arxiv.org/abs/2506.01756
-
[28]
Retargeting infant movements to baby humanoid robots,
O. Fiala, “Retargeting infant movements to baby humanoid robots,” Bachelor’s thesis, Czech Technical University in Prague, 2023
2023
-
[29]
An embodied brain model of the human foetus,
Y . Yamada, H. Kanazawa, S. Iwasaki, Y . Tsukahara, O. Iwata, S. Ya- mada, and Y . Kuniyoshi, “An embodied brain model of the human foetus,”Scientific Reports, vol. 6, 2016
2016
-
[30]
Opensim: Simulating musculoskeletal dynamics and neuromuscular control to study human and animal movement,
A. Seth, J. L. Hicks, T. K. Uchida, A. Habib, C. L. Dembia, J. J. Dunne, C. F. Ong, M. S. DeMers, A. Rajagopal, M. Millardet al., “Opensim: Simulating musculoskeletal dynamics and neuromuscular control to study human and animal movement,”PLoS computational biology, vol. 14, no. 7, p. e1006223, 2018
2018
-
[31]
Mimo grows! simulating body and sensory development in a mul- timodal infant model,
F. M. L ´opez, M. Lenz, M. G. Fedozzi, A. Aubret, and J. Triesch, “Mimo grows! simulating body and sensory development in a mul- timodal infant model,” in2025 IEEE International Conference on Development and Learning (ICDL). IEEE, 2025
2025
-
[32]
AnthroKids - Anthropometric data of children,
S. Ressler, “AnthroKids - Anthropometric data of children,”Nat. Inst. Standards and Technol., 1977
1977
-
[33]
Keeping the arm in the limelight: Advanced visual control of arm movements in neonates,
A. L. van der Meer, “Keeping the arm in the limelight: Advanced visual control of arm movements in neonates,”European Journal of Paediatric Neurology, vol. 1, no. 4, pp. 103–108, 1997
1997
-
[34]
Open-ended movements structure sensorimotor information in early human development,
H. Kanazawa, Y . Yamada, K. Tanaka, M. Kawai, F. Niwa, K. Iwanaga, and Y . Kuniyoshi, “Open-ended movements structure sensorimotor information in early human development,”Proceedings of the National Academy of Sciences, vol. 120, no. 1, p. e2209953120, 2023
2023
-
[35]
Independent devel- opment of the reach and the grasp in spontaneous self-touching by human infants in the first 6 months,
B. L. Thomas, J. M. Karl, and I. Q. Whishaw, “Independent devel- opment of the reach and the grasp in spontaneous self-touching by human infants in the first 6 months,”Frontiers in psychology, vol. 5, p. 1526, 2015
2015
-
[36]
Self-touch and other spontaneous behavior patterns in early infancy,
J. Khoury, S. T. Popescu, F. Gama, V . Marcel, and M. Hoffmann, “Self-touch and other spontaneous behavior patterns in early infancy,” in2022 IEEE International Conference on Development and Learning (ICDL). IEEE, 2022, pp. 148–155
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.