Recognition: no theorem link
Neuromorphic Reinforcement Learning for Quadruped Locomotion Control on Uneven Terrain
Pith reviewed 2026-05-12 04:18 UTC · model grok-4.3
The pith
Equilibrium propagation trains a reinforcement learning controller for quadruped locomotion on uneven terrain that matches backpropagation performance while using far less memory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper introduces an equilibrium-propagation-based proximal policy optimization algorithm for training continuous-control policies in a neuromorphic-compatible way. It combines this with a CPG-based policy for basic locomotion and a residual policy for terrain adaptation, deriving a nudging signal and clipping rule to stabilize the updates. On a 12-DoF A1 quadruped in a two-stage uneven terrain task, the resulting controller matches a backpropagation-trained PPO baseline in success rate, velocity tracking, power use, and stability, while cutting memory requirements by 4.3 times relative to BPTT.
What carries the argument
The EP-compatible PPO output-nudging signal combined with a two-sided ratio clipping mechanism that stabilizes policy updates during the relaxation phase of equilibrium propagation.
If this is right
- The controller achieves stable policy convergence in a two-stage uneven terrain locomotion task.
- Locomotion performance matches a backpropagation-trained PPO baseline in success rate, velocity tracking, actuator power, and body stability.
- GPU memory efficiency improves by 4.3 times compared with backpropagation through time.
- Local equilibrium-based learning can support high-dimensional embodied locomotion and provide an algorithmic foundation for low-power on-robot adaptation.
Where Pith is reading between the lines
- This approach could allow robots to adapt their locomotion policies on the fly to changes like payload shifts or actuator wear without needing to return to a full simulation environment.
- Integration with actual neuromorphic chips might further reduce power consumption for continuous learning on physical hardware.
- Similar local-learning techniques might extend to other embodied control problems such as manipulation or navigation where backpropagation is costly.
Load-bearing premise
The assumption that the derived EP-compatible signals and clipping mechanism will transfer stably from simulation to real hardware despite sensor noise, actuator delays, and unmodeled dynamics.
What would settle it
A physical experiment on the A1 quadruped robot showing either unstable policy updates, significant performance degradation, or memory efficiency not translating when sensor noise and delays are present.
Figures
read the original abstract
Reinforcement learning (RL) has enabled robust quadruped locomotion over complex terrain, but most learned controllers are trained offline with backpropagation in massively parallel simulation and deployed as fixed policies, limiting adaptation to terrain variation, payload changes, actuator wear, and other real-world conditions under onboard power constraints. Local learning provides a potential path toward energy-aware on-robot adaptation by replacing global backpropagation graphs with updates driven by local neural states, making the learning rule more compatible with neuromorphic and in-memory computing substrates. This work proposes an equilibrium-propagation (EP)-based proximal policy optimization (PPO) framework for uneven-terrain quadruped locomotion. The controller combines a bio-inspired central pattern generator (CPG) policy with a residual postural adjustment policy, while replacing conventional backpropagation-trained policy and value networks with EP-enabled local learning. To train stochastic continuous-control policies with EP, we derive an EP-compatible PPO output-nudging signal and introduce a two-sided ratio clipping mechanism that stabilizes policy updates during relaxation. Experiments on a 12-DoF A1 quadruped show that the proposed controller achieves stable policy convergence in a two-stage uneven terrain locomotion task. Its locomotion performance is comparable to a backpropagation-trained PPO baseline in success rate, velocity tracking, actuator power, and body stability, while improving GPU memory efficiency by 4.3\(\times\) compared with backpropagation through time (BPTT). These results suggest that local equilibrium-based learning can support high-dimensional embodied locomotion and provide an algorithmic foundation for low-power on-robot adaptation and fine-tuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an equilibrium-propagation (EP)-based proximal policy optimization (PPO) framework for quadruped locomotion control. It combines a bio-inspired central pattern generator (CPG) policy with a residual postural adjustment policy, derives an EP-compatible output-nudging signal and two-sided ratio clipping mechanism to enable local learning for continuous-control policies, and reports simulation results on a 12-DoF A1 quadruped showing stable convergence on a two-stage uneven-terrain task with performance metrics comparable to a backpropagation-trained PPO baseline and 4.3× lower GPU memory usage than BPTT.
Significance. If the local EP updates remain stable, the work could support energy-efficient on-device adaptation for robotic systems by making RL compatible with neuromorphic substrates. Credit is due for the first-principles derivation of the nudging signal and clipping rule that allows EP to handle stochastic policies, as well as the concrete memory-efficiency demonstration in a high-dimensional locomotion task.
major comments (2)
- [Abstract and Experiments] Abstract and Experiments section: the claim of 'comparable' performance in success rate, velocity tracking, actuator power, and body stability is presented without error bars, standard deviations across random seeds, or any statistical tests. This leaves the quantitative support for equivalence to the BPTT baseline moderate at best and weakens the central empirical claim.
- [Introduction and Conclusion] Introduction and Conclusion: the manuscript states that the results 'provide an algorithmic foundation for low-power on-robot adaptation and fine-tuning.' However, all validation occurs in clean simulation; no analysis or experiments examine whether the derived EP-compatible PPO nudging signal and two-sided clipping remain stable under sensor noise, actuator delays, or unmodeled dynamics. This gap is load-bearing for the neuromorphic and on-robot positioning.
minor comments (1)
- [Methods] The two-sided ratio clipping mechanism is described in the methods but would benefit from an explicit equation or pseudocode block to improve reproducibility and clarity of the stabilization rule.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. We appreciate the positive assessment of the work's significance and the credit given for the derivation of the EP-compatible nudging signal and clipping mechanism. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments section: the claim of 'comparable' performance in success rate, velocity tracking, actuator power, and body stability is presented without error bars, standard deviations across random seeds, or any statistical tests. This leaves the quantitative support for equivalence to the BPTT baseline moderate at best and weakens the central empirical claim.
Authors: We agree that the absence of error bars, standard deviations, and statistical tests weakens the support for the comparability claims. In the revised manuscript, we will conduct additional runs with multiple random seeds (at least 5 per condition), report all metrics as mean ± standard deviation, and include statistical comparisons (e.g., paired t-tests or Mann-Whitney U tests with p-values) between the EP-PPO and BPTT-PPO results to quantify the degree of equivalence. revision: yes
-
Referee: [Introduction and Conclusion] Introduction and Conclusion: the manuscript states that the results 'provide an algorithmic foundation for low-power on-robot adaptation and fine-tuning.' However, all validation occurs in clean simulation; no analysis or experiments examine whether the derived EP-compatible PPO nudging signal and two-sided clipping remain stable under sensor noise, actuator delays, or unmodeled dynamics. This gap is load-bearing for the neuromorphic and on-robot positioning.
Authors: We acknowledge that the experiments are confined to idealized simulation without explicit modeling of sensor noise, actuator delays, or unmodeled dynamics. The manuscript's core contribution is the first-principles derivation of EP-compatible mechanisms for stochastic continuous-control policies and their empirical validation in a 12-DoF locomotion task. While we maintain that these mechanisms provide an algorithmic foundation, we agree the on-robot and neuromorphic positioning would be strengthened by robustness analysis. In revision we will (i) add a Limitations section that explicitly states the simulation-only scope and (ii) moderate the language in the Introduction and Conclusion to frame the results as a necessary first step toward on-robot adaptation rather than a direct enabler. Full noise-robustness studies remain future work. revision: partial
Circularity Check
No circularity: EP-PPO nudging signal and clipping derived independently
full rationale
The paper derives an EP-compatible PPO output-nudging signal and two-sided ratio clipping mechanism as first-principles constructions to adapt PPO for equilibrium propagation in continuous control. No equations or steps reduce the claimed results to fitted parameters, self-citations, or input data by construction. Performance equivalence to BPTT-PPO is shown via simulation experiments on the A1 quadruped, not by re-deriving the same quantities. The sim-to-real robustness concern is a generalization issue, not a circularity in the derivation chain.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Cpg-rl: Learning central pattern generators for quadruped locomotion
Guillaume Bellegarda and Auke Ijspeert. Cpg-rl: Learning central pattern generators for quadruped locomotion. IEEE Robotics and Automation Letters, 7 0 (4): 0 12547--12554, 2022
work page 2022
-
[2]
Visual cpg-rl: Learning central pattern generators for visually-guided quadruped locomotion
Guillaume Bellegarda, Milad Shafiee, and Auke Ijspeert. Visual cpg-rl: Learning central pattern generators for visually-guided quadruped locomotion. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 1420--1427. IEEE, 2024
work page 2024
-
[3]
Michael Bloesch, Jan Humplik, Viorica Patraucean, Roland Hafner, Tuomas Haarnoja, Arunkumar Byravan, Noah Yamamoto Siegel, Saran Tunyasuvunakool, Federico Casarini, Nathan Batchelor, Francesco Romano, Stefano Saliceti, Martin Riedmiller, S. M. Ali Eslami, and Nicolas Heess. Towards real robot learning in the wild: A case study in bipedal locomotion. In Al...
work page 2022
-
[4]
Mueller, Akshara Rai, and Koushil Sreenath
Shuxiao Chen, Bike Zhang, Mark W. Mueller, Akshara Rai, and Koushil Sreenath. Learning Torque Control for Quadrupedal Locomotion , March 2023
work page 2023
-
[5]
Robots that can adapt like animals
Antoine Cully, Jeff Clune, Danesh Tarapore, and Jean-Baptiste Mouret. Robots that can adapt like animals. Nature, 521 0 (7553): 0 503--507, May 2015. ISSN 0028-0836, 1476-4687. doi:10.1038/nature14422
-
[6]
Current Principles of Motor Control , with Special Reference to Vertebrate Locomotion
Sten Grillner and Abdeljabbar El Manira. Current Principles of Motor Control , with Special Reference to Vertebrate Locomotion . Physiological Reviews, 100 0 (1): 0 271--320, January 2020. ISSN 0031-9333, 1522-1210. doi:10.1152/physrev.00015.2019
-
[7]
The CPGs for Limbed Locomotion -- Facts and Fiction
Sten Grillner and Alexander Kozlov. The CPGs for Limbed Locomotion -- Facts and Fiction . International Journal of Molecular Sciences, 22 0 (11): 0 5882, May 2021. ISSN 1422-0067. doi:10.3390/ijms22115882
-
[8]
Learning to walk in the real world with minimal human effort
Sehoon Ha, Peng Xu, Zhenyu Tan, Sergey Levine, and Jie Tan. Learning to walk in the real world with minimal human effort. arXiv preprint arXiv:2002.08550, 2020
-
[9]
Learning to walk via deep reinforcement learning,
Tuomas Haarnoja, Sehoon Ha, Aurick Zhou, Jie Tan, George Tucker, and Sergey Levine. Learning to walk via deep reinforcement learning. arXiv preprint arXiv:1812.11103, 2018
-
[10]
Learning quadrupedal high-speed running on uneven terrain
Xinyu Han and Mingguo Zhao. Learning quadrupedal high-speed running on uneven terrain. Biomimetics, 9 0 (1): 0 37, 2024
work page 2024
-
[11]
Anymal parkour: Learning agile navigation for quadrupedal robots
David Hoeller, Nikita Rudin, Dhionis Sako, and Marco Hutter. Anymal parkour: Learning agile navigation for quadrupedal robots. Science Robotics, 9 0 (88): 0 eadi7566, 2024
work page 2024
-
[12]
Learning agile and dynamic motor skills for legged robots
Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots. Science Robotics, 4 0 (26): 0 eaau5872, 2019
work page 2019
-
[13]
Central pattern generators for locomotion control in animals and robots: a review
Auke Jan Ijspeert. Central pattern generators for locomotion control in animals and robots: a review. Neural networks, 21 0 (4): 0 642--653, 2008
work page 2008
-
[14]
LOCOMOTOR CIRCUITS IN THE MAMMALIAN SPINAL CORD
Ole Kiehn. LOCOMOTOR CIRCUITS IN THE MAMMALIAN SPINAL CORD . Annual Review of Neuroscience, 29 0 (1): 0 279--306, July 2006. ISSN 0147-006X, 1545-4126. doi:10.1146/annurev.neuro.29.051605.112910
-
[15]
Decoding the organization of spinal circuits that control locomotion
Ole Kiehn. Decoding the organization of spinal circuits that control locomotion. Nature Reviews Neuroscience, 17 0 (4): 0 224--238, April 2016. ISSN 1471-003X, 1471-0048. doi:10.1038/nrn.2016.9
-
[16]
Highly dynamic quadruped locomotion via whole-body impulse control and model predictive control
Donghyun Kim, Jared Di Carlo, Benjamin Katz, Gerardo Bledt, and Sangbae Kim. Highly dynamic quadruped locomotion via whole-body impulse control and model predictive control. arXiv preprint arXiv:1909.06586, 2019
-
[17]
Biologically inspired adaptive walking of a quadruped robot
Hiroshi Kimura, Yasuhiro Fukuoka, and Avis H Cohen. Biologically inspired adaptive walking of a quadruped robot. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 365 0 (1850): 0 153--170, 2007
work page 2007
-
[18]
Yoshimasa Kubo, Eric Chalmers, and Artur Luczak. Combining backpropagation with Equilibrium Propagation to improve an Actor-Critic reinforcement learning framework. Frontiers in Computational Neuroscience, 16: 0 980613, August 2022. ISSN 1662-5188. doi:10.3389/fncom.2022.980613
-
[19]
Rma: Rapid motor adaptation for legged robots,
Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik. Rma: Rapid motor adaptation for legged robots. arXiv preprint arXiv:2107.04034, 2021
-
[20]
Holomorphic equilibrium propagation computes exact gradients through finite size oscillations
Axel Laborieux and Friedemann Zenke. Holomorphic equilibrium propagation computes exact gradients through finite size oscillations. Advances in neural information processing systems, 35: 0 12950--12963, 2022
work page 2022
-
[21]
Improving equilibrium propagation without weight symmetry through jacobian homeostasis
Axel Laborieux and Friedemann Zenke. Improving equilibrium propagation without weight symmetry through jacobian homeostasis. arXiv preprint arXiv:2309.02214, 2023
-
[22]
Scaling equilibrium propagation to deep convnets by drastically reducing its gradient estimator bias
Axel Laborieux, Maxence Ernoult, Benjamin Scellier, Yoshua Bengio, Julie Grollier, and Damien Querlioz. Scaling equilibrium propagation to deep convnets by drastically reducing its gradient estimator bias. Frontiers in neuroscience, 15: 0 633674, 2021
work page 2021
-
[23]
Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference Target Propagation . In Annalisa Appice, Pedro Pereira Rodrigues, V \'i tor Santos Costa, Carlos Soares, Jo \ a o Gama, and Al \'i pio Jorge, editors, Machine Learning and Knowledge Discovery in Databases , volume 9284, pages 498--515. Springer International Publishing, Cham, 2015....
-
[24]
Learning quadrupedal locomotion over challenging terrain,
Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. Learning quadrupedal locomotion over challenging terrain. Science Robotics, 5 0 (47): 0 eabc5986, October 2020. ISSN 2470-9476. doi:10.1126/scirobotics.abc5986
-
[25]
Lillicrap, Daniel Cownden, Douglas B
Timothy P. Lillicrap, Daniel Cownden, Douglas B. Tweed, and Colin J. Akerman. Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications, 7 0 (1): 0 13276, November 2016. ISSN 2041-1723. doi:10.1038/ncomms13276
-
[26]
Poramate Manoonpong, Ulrich Parlitz, and Florentin W \"o rg \"o tter. Neural control and adaptive neural forward models for insect-like, energy-efficient, and adaptable locomotion of walking machines. Frontiers in neural circuits, 7: 0 12, 2013
work page 2013
-
[27]
Rapid locomotion via reinforcement learning
Gabriel B Margolis, Ge Yang, Kartik Paigwar, Tao Chen, and Pulkit Agrawal. Rapid locomotion via reinforcement learning. The International Journal of Robotics Research, 43 0 (4): 0 572--587, 2024
work page 2024
-
[28]
Eqspike: spike-driven equilibrium propagation for neuromorphic implementations
Erwann Martin, Maxence Ernoult, J \'e r \'e mie Laydevant, Shuai Li, Damien Querlioz, Teodora Petrisor, and Julie Grollier. Eqspike: spike-driven equilibrium propagation for neuromorphic implementations. Iscience, 24 0 (3), 2021
work page 2021
-
[29]
Learning robust perceptive locomotion for quadrupedal robots in the wild
Takahiro Miki, Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. Learning robust perceptive locomotion for quadrupedal robots in the wild. Science robotics, 7 0 (62): 0 eabk2822, 2022
work page 2022
-
[30]
Adaptive locomotion control of a hexapod robot via bio-inspired learning
Wenjuan Ouyang, Haozhen Chi, Jiangnan Pang, Wenyu Liang, and Qinyuan Ren. Adaptive locomotion control of a hexapod robot via bio-inspired learning. Frontiers in Neurorobotics, 15: 0 627157, 2021
work page 2021
-
[31]
Pattern generators with sensory feedback for the control of quadruped locomotion
Ludovic Righetti and Auke Jan Ijspeert. Pattern generators with sensory feedback for the control of quadruped locomotion. In 2008 IEEE International Conference on Robotics and Automation, pages 819--824. IEEE, 2008
work page 2008
-
[32]
Equilibrium propagation: Bridging the gap between energy-based models and backpropagation
Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between energy-based models and backpropagation. Frontiers in computational neuroscience, 11: 0 24, 2017
work page 2017
-
[33]
Energy-based learning algorithms for analog computing: a comparative study
Benjamin Scellier, Maxence Ernoult, Jack Kendall, and Suhas Kumar. Energy-based learning algorithms for analog computing: a comparative study. Advances in neural information processing systems, 36: 0 52705--52731, 2023
work page 2023
-
[34]
High- Dimensional Continuous Control Using Generalized Advantage Estimation , 2015
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- Dimensional Continuous Control Using Generalized Advantage Estimation , 2015
work page 2015
-
[35]
Proximal Policy Optimization Algorithms , 2017
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms , 2017
work page 2017
-
[36]
Ryosei Seto, Guanda Li, Kyo Kutsuzawa, Dai Owaki, and Mitsuhiro Hayashibe. Two-stage learning of cpg and postural reflex towards quadruped locomotion on uneven terrain with simple reward. IEEE Access, 2025
work page 2025
-
[37]
Laura Smith, Ilya Kostrikov, and Sergey Levine. A walk in the park: Learning to walk in 20 minutes with model-free reinforcement learning. arXiv preprint arXiv:2208.07860, 2022
-
[38]
Foot trajectory as a key factor for diverse gait patterns in quadruped robot locomotion
Shura Suzuki, Kosuke Matayoshi, Mitsuhiro Hayashibe, and Dai Owaki. Foot trajectory as a key factor for diverse gait patterns in quadruped robot locomotion. Scientific Reports, 15 0 (1): 0 1861, January 2025. ISSN 2045-2322. doi:10.1038/s41598-024-84060-5
-
[39]
Sim-to-Real: Learning Agile Locomotion For Quadruped Robots
Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. Sim-to-real: Learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332, 2018
work page Pith review arXiv 2018
-
[40]
MuJoCo: A physics engine for model-based control
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026--5033. IEEE, 2012. doi:10.1109/IROS.2012.6386109
-
[41]
P.J. Werbos. Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 78 0 (10): 0 1550--1560, October 1990. ISSN 00189219. doi:10.1109/5.58337
-
[42]
James C. R. Whittington and Rafal Bogacz. An Approximation of the Error Backpropagation Algorithm in a Predictive Coding Network with Local Hebbian Synaptic Plasticity . Neural Computation, 29 0 (5): 0 1229--1262, May 2017. ISSN 0899-7667, 1530-888X. doi:10.1162/NECO_a_00949
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.