Infant Spontaneous Movement Noise Improves Exploration in Deep RL
Pith reviewed 2026-06-27 04:15 UTC · model grok-4.3
The pith
Progressively increasing the correlation of action noise to match infant movement statistics improves exploration efficiency in deep RL.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By calibrating the power spectral density exponent of action noise to increase over the course of training in the same manner observed in infants' end-effector velocities, the resulting temporally correlated noise generates exploratory behavior with improved state-space coverage and higher learning efficiency than temporally uncorrelated white noise or static colored noise in several standard RL benchmarks.
What carries the argument
A training-time schedule that raises the spectral exponent of the exploration noise process to replicate the developmental increase measured in infant end-effector velocity power spectra.
If this is right
- Structured exploratory trajectories emerge from the time-varying colored noise.
- Learning curves improve relative to conventional white-noise and fixed colored-noise strategies.
- The same noise schedule works across multiple distinct RL environments.
- Human developmental statistics can supply concrete design choices for agent learning mechanisms.
Where Pith is reading between the lines
- The same developmental-noise principle could be tested in continuous-control robotic tasks where state-space coverage is costly.
- Alternative developmental trajectories drawn from other motor or cognitive milestones might yield further efficiency gains.
- If the benefit scales with task complexity, the approach would be most valuable in high-dimensional or sparse-reward settings.
Load-bearing premise
The specific age-dependent increase in spectral exponent seen in human infant movements supplies a useful inductive bias for reinforcement-learning exploration rather than being incidental to biology.
What would settle it
Re-running the RL experiments with a constant spectral exponent (no developmental increase) and obtaining equal or worse sample efficiency than white-noise baselines would falsify the central claim.
Figures
read the original abstract
Exploration in deep reinforcement learning (RL) is commonly implemented as temporally uncorrelated white noise. However, recent works show that temporally correlated colored noise can improve exploration efficiency by producing smooth trajectories with better coverage of the state space. We inquire whether action noise inspired by infant spontaneous movements can also improve exploration in deep RL. We find that the power spectral densities of babies' end-effector velocities follow a colored noise process where the spectral exponent increases with age. Inspired by this developmental pattern, we introduce a mechanism that progressively increases the temporal auto-correlation of exploration noise during RL training, matching the infant statistics. Experiments across several RL environments show that infant-inspired noise produces structured exploratory behavior and can improve learning efficiency compared to conventional exploration strategies. These findings suggest that human motor and cognitive development can provide useful guidance for designing learning mechanisms in artificial agents. Our code is available at https://github.com/trieschlab/baby-noise-rl.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an exploration mechanism for deep RL in which action noise is generated as colored noise whose spectral exponent (and thus temporal auto-correlation) is progressively increased during training to match the developmental trajectory observed in the power spectral densities of infant end-effector velocities. Experiments across several RL environments are reported to demonstrate that this infant-inspired schedule yields structured exploratory trajectories and improved sample efficiency relative to standard white-noise or fixed colored-noise baselines.
Significance. If the reported gains are robust and the infant trajectory is shown to be necessary, the work would indicate that developmental statistics from human motor behavior can supply a useful inductive bias for exploration design in artificial agents. The public release of code supports reproducibility.
major comments (1)
- [Experiments] Experiments section: the central claim that matching the specific infant developmental increase in spectral exponent supplies a useful inductive bias requires evidence that this trajectory outperforms simpler monotonic schedules. No comparison is presented against a linear ramp of correlation, a fixed colored noise with the same average exponent, or a random schedule with matched average correlation; without these controls the infant data are not shown to be load-bearing.
minor comments (1)
- Abstract: quantitative performance deltas, statistical tests, and environment details are absent; these should be added so readers can immediately assess the magnitude of the reported improvement.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential significance of the work. We address the major comment below.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the central claim that matching the specific infant developmental increase in spectral exponent supplies a useful inductive bias requires evidence that this trajectory outperforms simpler monotonic schedules. No comparison is presented against a linear ramp of correlation, a fixed colored noise with the same average exponent, or a random schedule with matched average correlation; without these controls the infant data are not shown to be load-bearing.
Authors: We agree that the manuscript would be strengthened by direct comparisons demonstrating that the specific infant developmental trajectory is necessary for the reported benefits, rather than any monotonic increase or matched-average schedule. In the revised version we will add these controls: a linear ramp of the spectral exponent, fixed colored noise using the average exponent from the infant data, and a randomized schedule with matched average correlation. The new results will be reported in the Experiments section. revision: yes
Circularity Check
No significant circularity; empirical evaluation stands independent of biological inspiration
full rationale
The paper introduces a noise schedule whose parameters are taken directly from external infant movement measurements and then tests the resulting agent performance against standard baselines in multiple RL environments. No derivation, fitted parameter, or uniqueness theorem is presented whose output is definitionally identical to its input. The central claim (improved exploration efficiency) is evaluated by direct comparison to conventional strategies rather than by any self-referential construction. Self-citations, if present, are not load-bearing for the result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Infant spontaneous movements produce end-effector velocities whose power spectral density follows a colored noise process with age-dependent spectral exponent
Reference graph
Works this paper leans on
-
[1]
Helpless infants are learning a foundation model,
R. Cusack, M. Ranzato, and C. J. Charvet, “Helpless infants are learning a foundation model,”Trends in Cognitive Sciences, vol. 28, no. 8, pp. 726–738, 2024
2024
-
[2]
Continuous control with deep reinforcement learning,
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. M. O. Heess, T. Erez, Y . Tassa, D. Silver, and D. P. Wierstra, “Continuous control with deep reinforcement learning,” in4th International Conference on Learning Representation, 2016. TABLE I: Average AUC scores for different algorithms and environments, z-scored across multiple noises and seeds. Noise types com...
2016
-
[3]
Addressing function approxi- mation error in actor-critic methods,
S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approxi- mation error in actor-critic methods,” inInternational conference on machine learning. PMLR, 2018, pp. 1587–1596
2018
-
[4]
Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational conference on machine learning. PMLR, 2018, pp. 1861–1870
2018
-
[5]
Colored noise in dynamical systems,
P. H ¨aunggi and P. Jung, “Colored noise in dynamical systems,” Advances in chemical physics, vol. 89, pp. 239–326, 1994
1994
-
[6]
Sample-efficient cross-entropy method for real-time planning,
C. Pinneri, S. Sawant, S. Blaes, J. Achterhold, J. Stueckler, M. Rolinek, and G. Martius, “Sample-efficient cross-entropy method for real-time planning,” inConference on Robot Learning. PMLR, 2021, pp. 1049– 1065
2021
-
[7]
Action noise in off-policy deep reinforcement learning: Impact on exploration and performance,
J. Hollenstein, S. Auddy, M. Saveriano, E. Renaudo, and J. Piater, “Action noise in off-policy deep reinforcement learning: Impact on exploration and performance,”Transactions on Machine Learning Research, 2022
2022
-
[8]
Pink noise is all you need: Colored noise exploration in deep reinforcement learning,
O. Eberhard, J. Hollenstein, C. Pinneri, and G. Martius, “Pink noise is all you need: Colored noise exploration in deep reinforcement learning,” inThe Eleventh International Conference on Learning Representations (ICLR), 2023
2023
-
[9]
Colored noise in ppo: Improved exploration and performance through correlated action sampling,
J. J. Hollenstein, G. Martius, and J. H. Piater, “Colored noise in ppo: Improved exploration and performance through correlated action sampling,” inProceedings of the AAAI Conference on Artificial Intelligence, 2024, pp. 12 466–12 472
2024
-
[10]
Pink noise LQR: How does colored noise affect the optimal policy in RL?
J. Hollenstein, M. Zaric, S. Tosatto, and J. Piater, “Pink noise LQR: How does colored noise affect the optimal policy in RL?” inICML 2024 Workshop: Foundations of Reinforcement Learning and Control– Connections and Perspectives, 2024
2024
-
[11]
Coordination, control and skill,
K. M. Newell, “Coordination, control and skill,” inAdvances in psychology. Elsevier, 1985, vol. 27, pp. 295–317
1985
-
[12]
Stride- to-stride time intervals are independently affected by the temporal pattern and probability distribution of visual cues,
P. C. Raffalt, J. H. Sommerfeld, N. Stergiou, and A. D. Likens, “Stride- to-stride time intervals are independently affected by the temporal pattern and probability distribution of visual cues,”Neuroscience letters, vol. 792, p. 136909, 2023
2023
-
[13]
Frequency-specific fractal analysis of postural control accounts for control strategies,
P. Gilfriche, V . Deschodt-Arsac, E. Blons, and L. M. Arsac, “Frequency-specific fractal analysis of postural control accounts for control strategies,”Frontiers in physiology, vol. 9, p. 293, 2018
2018
-
[14]
Scale-free brain activity: past, present, and future,
B. J. He, “Scale-free brain activity: past, present, and future,”Trends in cognitive sciences, vol. 18, no. 9, pp. 480–487, 2014
2014
-
[15]
Thelen and L
E. Thelen and L. B. Smith,A dynamic systems approach to the development of cognition and action. MIT press, 1994
1994
-
[16]
Development as a dynamic system,
L. B. Smith and E. Thelen, “Development as a dynamic system,” Trends in cognitive sciences, vol. 7, no. 8, pp. 343–348, 2003
2003
-
[17]
Early human motor development: From variation to the ability to vary and adapt,
M. Hadders-Algra, “Early human motor development: From variation to the ability to vary and adapt,”Neuroscience & Biobehavioral Reviews, vol. 90, pp. 411–427, 2018
2018
-
[18]
Intrinsic motivation systems for autonomous mental development,
P.-Y . Oudeyer, F. Kaplan, and V . V . Hafner, “Intrinsic motivation systems for autonomous mental development,”IEEE transactions on evolutionary computation, vol. 11, no. 2, pp. 265–286, 2007
2007
-
[19]
Lessons from infant learning for unsupervised machine learning,
L. Zaadnoordijk, T. R. Besold, and R. Cusack, “Lessons from infant learning for unsupervised machine learning,”Nature Machine Intelligence, vol. 4, no. 6, pp. 510–520, 2022
2022
-
[20]
Time to aug- ment self-supervised visual representation learning,
A. Aubret, M. R. Ernst, C. Teuli `ere, and J. Triesch, “Time to aug- ment self-supervised visual representation learning,” inThe Eleventh International Conference on Learning Representations (ICLR), 2023
2023
-
[21]
Openpose: Realtime multi-person 2d pose estimation using part affinity fields,
Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y . A. Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019
2019
-
[22]
On generating power law noise
J. Timmer and M. Koenig, “On generating power law noise.”Astron- omy and Astrophysics, v. 300, p. 707, vol. 300, p. 707, 1995
1995
-
[23]
The nature and perception of fluctuations in human musical rhythms,
H. Hennig, R. Fleischmann, A. Fredebohm, Y . Hagmayer, J. Nagler, A. Witt, F. J. Theis, and T. Geisel, “The nature and perception of fluctuations in human musical rhythms,”PloS one, vol. 6, no. 10, p. e26457, 2011
2011
-
[24]
Long-range correlations in human standing,
M. Duarte and V . M. Zatsiorsky, “Long-range correlations in human standing,”Physics Letters A, vol. 283, no. 1-2, pp. 124–128, 2001
2001
-
[25]
Stable-baselines3: Reliable reinforcement learning im- plementations,
A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning im- plementations,”Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021
2021
-
[26]
Gymnasium: A standard interface for reinforcement learning environments,
M. Towers, A. Kwiatkowski, J. U. Balis, G. D. Cola, T. Deleu, M. Goul ˜ao, K. Andreas, M. Krimmel, A. KG, R. D. L. Perez- Vicente, J. K. Terry, A. Pierr ´e, S. V . Schulhoff, J. J. Tai, H. Tan, and O. G. Younis, “Gymnasium: A standard interface for reinforcement learning environments,” inThe Thirty-ninth Annual Conference on Neural Information Processing ...
2025
-
[27]
‘Helpless’ infants are active, goal-directed agents: response to Cusack et al
M. Zettersten, R. Foushee, and M. K. Goddu, “‘Helpless’ infants are active, goal-directed agents: response to Cusack et al.”Trends in Cognitive Sciences, 2025
2025
-
[28]
MIMo: A multimodal infant model for studying cognitive development,
D. Mattern, P. Schumacher, F. M. L ´opez, M. C. Raabe, M. R. Ernst, A. Aubret, and J. Triesch, “MIMo: A multimodal infant model for studying cognitive development,”IEEE Transactions on Cognitive and Developmental Systems, vol. 16, no. 4, pp. 1291–1301, 2024
2024
-
[29]
MIMo grows! simulating body and sensory development in a mul- timodal infant model,
F. M. L ´opez, M. Lenz, M. G. Fedozzi, A. Aubret, and J. Triesch, “MIMo grows! simulating body and sensory development in a mul- timodal infant model,” in2025 IEEE International Conference on Development and Learning (ICDL). IEEE, 2025, pp. 1–6
2025
-
[30]
Embodiment shapes rolling behavior in a multimodal infant model,
L. Philipp, F. M. L ´opez, and J. Triesch, “Embodiment shapes rolling behavior in a multimodal infant model,” in2026 IEEE International Conference on Development and Learning (ICDL). IEEE, 2026, pp. 1–7
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.