Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient
Pith reviewed 2026-06-29 17:32 UTC · model grok-4.3
The pith
Stochastic decoupled policy gradient estimates gradients from random perturbations of trajectory rollouts to train visual robot policies end-to-end on one GPU.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SDPG estimates policy gradients via random perturbations of trajectory rollouts, requiring orders of magnitude fewer batch-rendered environments and substantially reducing compute and memory overhead while still supporting stable end-to-end training of diverse visuomotor policies on visual MuJoCo benchmarks, where it outperforms baseline methods in training time, memory usage, and rewards.
What carries the argument
Stochastic decoupled policy gradient, which computes gradient estimates by applying random perturbations directly to sampled trajectory rollouts rather than requiring full batches of independent environment renders.
If this is right
- Visuomotor policies for dexterous manipulation and challenging locomotion can be trained end-to-end in a few hours on a single consumer GPU.
- Memory and compute overhead drop enough to allow larger batch sizes or longer horizons without additional hardware.
- The same perturbation-based estimation supports effective sim-to-real transfer on physical robot hardware.
- A new suite of realistic visual robotics benchmarks becomes available for standardized evaluation.
Where Pith is reading between the lines
- The reduced environment count could make visual RL experiments practical for labs without large GPU clusters.
- The perturbation approach might generalize to other high-dimensional observation spaces where rendering dominates cost.
- If variance remains controlled, the method could shorten iteration cycles in sim-to-real robotics pipelines.
Load-bearing premise
Random perturbations of trajectory rollouts produce sufficiently low-variance and unbiased policy gradient estimates to support stable end-to-end training of visuomotor policies.
What would settle it
A controlled comparison in which SDPG gradient estimates exhibit variance high enough to cause training divergence or final rewards statistically below those of standard on-policy methods on the same visual MuJoCo tasks.
Figures
read the original abstract
We present the stochastic decoupled policy gradient (SDPG), a lightweight visual reinforcement learning (RL) method that trains diverse visuomotor control policies end-to-end within a few hours on a single NVIDIA RTX 4080 GPU. SDPG estimates policy gradients via random perturbations of trajectory rollouts, requiring orders of magnitude fewer batch-rendered environments and substantially reducing compute and memory overhead. On visual MuJoCo benchmarks, SDPG consistently outperforms baseline methods in training time, memory usage, and rewards. Finally, to support future research, we introduce a suite of realistic visual robotics benchmarks spanning dexterous manipulation, challenging locomotion, and demonstrate effective sim-to-real transfer on physical hardware.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Stochastic Decoupled Policy Gradient (SDPG), an on-policy visual RL algorithm that estimates policy gradients through random perturbations of trajectory rollouts. It claims this yields orders-of-magnitude reductions in batch-rendered environments and memory, enabling end-to-end training of diverse visuomotor policies in a few hours on a single RTX 4080 GPU. Experiments on visual MuJoCo benchmarks report consistent gains in training time, memory, and rewards over baselines; the paper also contributes new realistic visual robotics benchmarks and demonstrates sim-to-real transfer.
Significance. If the perturbation-based estimator is shown to be unbiased with controlled variance, the approach could meaningfully lower the computational barrier for visual RL, particularly for high-dimensional visuomotor tasks. The introduction of new benchmarks and hardware validation would also provide reusable assets for the community.
major comments (3)
- [§3] §3 (SDPG estimator definition): no derivation or expectation calculation is supplied showing that the random-perturbation gradient estimator is unbiased (i.e., E[ĝ] = ∇J(π)). The central efficiency and performance claims rest on this property; without it the reported gains cannot be attributed to a correct on-policy update.
- [§3] §3 or Appendix A (variance analysis): no bound or empirical variance analysis is given for the perturbation estimator. The abstract asserts “low-variance” estimates sufficient for stable end-to-end training, yet the manuscript supplies neither analytic variance nor ablation on perturbation scale.
- [§5] §5 (experimental results): the reported outperformance lacks statistical tests, number of seeds, or confidence intervals. Table or figure captions do not indicate whether differences are significant, undermining the “consistently outperforms” claim.
minor comments (2)
- [§3] Notation for the perturbation distribution and the decoupled update rule is introduced without a compact equation reference; readers must reconstruct the estimator from prose.
- [§4] The new benchmark suite is described at a high level; missing are precise task definitions, observation spaces, and success metrics that would allow direct replication.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the theoretical justification and experimental reporting of SDPG.
read point-by-point responses
-
Referee: [§3] §3 (SDPG estimator definition): no derivation or expectation calculation is supplied showing that the random-perturbation gradient estimator is unbiased (i.e., E[ĝ] = ∇J(π)). The central efficiency and performance claims rest on this property; without it the reported gains cannot be attributed to a correct on-policy update.
Authors: We agree that an explicit derivation is required. The revised manuscript will include a full derivation in Section 3 (or a new Appendix) proving that the expectation of the stochastic perturbation estimator equals the true policy gradient ∇J(π), confirming that SDPG yields unbiased on-policy updates. revision: yes
-
Referee: [§3] §3 or Appendix A (variance analysis): no bound or empirical variance analysis is given for the perturbation estimator. The abstract asserts “low-variance” estimates sufficient for stable end-to-end training, yet the manuscript supplies neither analytic variance nor ablation on perturbation scale.
Authors: We acknowledge the missing analysis. The revision will add both an analytic variance bound (parameterized by perturbation scale) in Appendix A and empirical ablations of the scale hyperparameter on the visual MuJoCo tasks to support the low-variance claim and training stability. revision: yes
-
Referee: [§5] §5 (experimental results): the reported outperformance lacks statistical tests, number of seeds, or confidence intervals. Table or figure captions do not indicate whether differences are significant, undermining the “consistently outperforms” claim.
Authors: We agree that statistical rigor is needed. The revised paper will report results over a minimum of five random seeds, include standard deviations or confidence intervals in all tables and figures, and apply statistical tests (such as t-tests) with significance indicated in captions. revision: yes
Circularity Check
No circularity: method claims rest on unshown estimator without self-referential reduction
full rationale
The abstract and reader's summary introduce SDPG via random perturbations of trajectory rollouts but supply no equations, variance derivations, or parameter-fitting steps that could reduce by construction to the method's own inputs. No self-citations, uniqueness theorems, or ansatzes are referenced in the provided text, and the central efficiency claim does not rename or refit a known result. The derivation chain therefore remains self-contained against external benchmarks; absence of supporting math is a correctness concern, not evidence of circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Mastering visual continuous control: Improved data-augmented reinforcement learning
D. Yarats, R. Fergus, A. Lazaric, and L. Pinto. Mastering visual continuous control: Im- proved data-augmented reinforcement learning, 2021. URLhttps://arxiv.org/abs/ 2107.09645
-
[2]
Hansen, H
N. Hansen, H. Su, and X. Wang. Td-mpc2: Scalable, robust world models for continuous control, 2024
2024
-
[3]
Mastering Diverse Domains through World Models
D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models, 2024. URLhttps://arxiv.org/abs/2301.04104
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
S. Tao, F. Xiang, A. Shukla, Y . Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y . Liu, T. kai Chan, Y . Gao, X. Li, T. Mu, N. Xiao, A. Gurha, V . N. Rajesh, Y . W. Choi, Y .-R. Chen, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai, 2025. URLhttps://arxiv.org/ab...
2025
- [5]
-
[6]
Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning
NVIDIA, :, M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G....
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [7]
-
[8]
J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning quadrupedal locomo- tion over challenging terrain.Science Robotics, 5(47), Oct. 2020. ISSN 2470-9476. doi:10. 1126/scirobotics.abc5986. URLhttp://dx.doi.org/10.1126/scirobotics.abc5986
-
[9]
T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning robust per- ceptive locomotion for quadrupedal robots in the wild.Science Robotics, 7(62), Jan. 2022. ISSN 2470-9476. doi:10.1126/scirobotics.abk2822. URLhttp://dx.doi.org/10.1126/ scirobotics.abk2822
-
[10]
A. Agarwal, A. Kumar, J. Malik, and D. Pathak. Legged locomotion in challenging terrains using egocentric vision, 2022. URLhttps://arxiv.org/abs/2211.07638
- [11]
- [12]
-
[13]
Levine and P
S. Levine and P. Abbeel. Learning neural network policies with guided policy search under unknown dynamics. InNeural Information Processing Systems (NIPS), 2014
2014
- [14]
-
[15]
H. You, Y . Liu, and I. Abraham. Accelerating visual-policy learning through parallel differen- tiable simulation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=4frj038M6W
2025
-
[16]
J. Pan, J. Xing, R. Reiter, Y . Zhai, E. Aljalbout, and D. Scaramuzza. Learning on the fly: Rapid policy adaptation via differentiable simulation, 2026. URLhttps://arxiv.org/abs/2508. 21065
2026
-
[17]
Y . Zhang, Y . Hu, Y . Song, D. Zou, and W. Lin. Learning vision-based agile flight via differentiable physics.Nature Machine Intelligence, 7(6):954–966, June 2025. ISSN 2522-5839. doi:10.1038/s42256-025-01048-0. URLhttp://dx.doi.org/10.1038/ s42256-025-01048-0
-
[18]
C. Schwarke, V . Klemm, J. Bagajo, J.-P. Sleiman, I. Georgiev, J. Tordesillas, and M. Hutter. Learning deployable locomotion control via differentiable simulation, 2025. URLhttps: //arxiv.org/abs/2404.02887
- [19]
-
[20]
J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V . Tsounis, V . Koltun, and M. Hutter. Learn- ing agile and dynamic motor skills for legged robots.Science Robotics, 4(26), Jan. 2019. ISSN 2470-9476. doi:10.1126/scirobotics.aau5872. URLhttp://dx.doi.org/10.1126/ scirobotics.aau5872
- [21]
-
[22]
Nvidia isaac sim.https://developer.nvidia.com/isaac-sim, 2023
NVIDIA Corporation. Nvidia isaac sim.https://developer.nvidia.com/isaac-sim, 2023
2023
-
[23]
Mujoco warp (mjwarp): Gpu-accelerated mujoco via nvidia warp.https://github.com/google-deepmind/mujoco_warp, 2025
Google DeepMind and NVIDIA. Mujoco warp (mjwarp): Gpu-accelerated mujoco via nvidia warp.https://github.com/google-deepmind/mujoco_warp, 2025. Software, beta re- lease
2025
-
[24]
G. Authors. Genesis: A generative and universal physics engine for robotics and beyond, December 2024. URLhttps://github.com/Genesis-Embodied-AI/Genesis
2024
-
[25]
L. Metz, C. D. Freeman, S. S. Schoenholz, and T. Kachman. Gradients are not all you need,
- [26]
- [27]
-
[28]
E. Xing, V . Luk, and J. Oh. Stabilizing reinforcement learning in differentiable multiphysics simulation. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=DRiLWb8bJg. 11
2025
-
[29]
Levine and V
S. Levine and V . Koltun. Guided policy search. In S. Dasgupta and D. McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 ofPro- ceedings of Machine Learning Research, pages 1–9, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR
2013
-
[30]
S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-learning with model-based acceleration, 2016. URLhttps://arxiv.org/abs/1603.00748
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[31]
Path Integral Guided Policy Search
Y . Chebotar, M. Kalakrishnan, A. Yahya, A. Li, S. Schaal, and S. Levine. Path integral guided policy search, 2018. URLhttps://arxiv.org/abs/1610.00529
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [32]
-
[33]
R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992
1992
-
[34]
V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning, 2016. URL https://arxiv.org/abs/1602.01783
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [35]
-
[36]
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning.arXiv preprint arXiv:2506.15799, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
R. S. Sutton, A. G. Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998
1998
-
[38]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[39]
V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Ried- miller. Playing atari with deep reinforcement learning, 2013. URLhttps://arxiv.org/ abs/1312.5602
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[40]
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning, 2019. URLhttps://arxiv.org/ abs/1509.02971
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [41]
-
[42]
Makoviichuk and V
D. Makoviichuk and V . Makoviychuk. rl-games: A high-performance framework for rein- forcement learning.https://github.com/Denys88/rl_games, May 2021
2021
-
[43]
Rsl-rl: A learning library for robotics research,
C. Schwarke, M. Mittal, N. Rudin, D. Hoeller, and M. Hutter. Rsl-rl: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025
-
[44]
O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Cou- prie, J. Mairal, H. J ´egou, P. Labatut, and P. Bojanowski. Dinov3, 2025. URLhttps: //arxiv.org/...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Learning Transferable Visual Models From Natural Language Supervision
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[46]
V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein- forcement learning.nature, 518(7540):529–533, 2015
2015
-
[47]
Asymmetric Actor Critic for Image-Based Robot Learning
L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel. Asymmetric actor critic for image-based robot learning, 2017. URLhttps://arxiv.org/abs/1710.06542
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[48]
A. Srinivas, M. Laskin, and P. Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning, 2020. URLhttps://arxiv.org/abs/2004.04136
-
[49]
Yarats, I
D. Yarats, I. Kostrikov, and R. Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. InInternational Conference on Learning Representations,
-
[50]
URLhttps://openreview.net/forum?id=GY6-6sTvGaf
-
[51]
Hansen, X
N. Hansen, X. Wang, and H. Su. Temporal difference learning for model predictive control,
- [52]
-
[53]
Dream to Control: Learning Behaviors by Latent Imagination
D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination, 2020. URLhttps://arxiv.org/abs/1912.01603
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[54]
Hafner, T
D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. Mastering atari with discrete world models,
-
[55]
URLhttps://arxiv.org/abs/2010.02193
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[56]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, Mojtaba, Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. R. Hogan, D. Dugas, P. Bojanowski, V . Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y . Li, X. Ma, S. Chandar, F. Meier, Y . LeCun, M. Rabbat, and N. Ballas. V-jepa 2: Self-super...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Q. Garrido, M. Assran, N. Ballas, A. Bardes, L. Najman, and Y . LeCun. Learning and leverag- ing world models in visual representation learning, 2024. URLhttps://arxiv.org/abs/ 2403.00504
-
[58]
Revisiting Feature Prediction for Learning Visual Representations from Video
A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y . LeCun, M. Assran, and N. Ballas. Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
A. Huning. Evolutionsstrategie. optimierung technischer systeme nach prinzipien der biolo- gischen evolution, 1976
1976
-
[60]
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning, 2017. URLhttps://arxiv.org/abs/1703.03864
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[61]
Simple random search provides a competitive approach to reinforcement learning
H. Mania, A. Guy, and B. Recht. Simple random search provides a competitive approach to reinforcement learning, 2018. URLhttps://arxiv.org/abs/1803.07055
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [62]
-
[63]
Williams, P
G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou. Aggressive driving with model predictive path integral control. In2016 IEEE international conference on robotics and automation (ICRA), pages 1433–1440. IEEE, 2016. 13
2016
-
[64]
R. J. Williams and J. Peng. Function optimization using connectionist reinforcement learning algorithms.Connection Science, 3(3):241–268, 1991
1991
-
[65]
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018. URLhttps://arxiv.org/abs/ 1801.01290
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[66]
Soft Actor-Critic Algorithms and Applications
T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V . Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine. Soft actor-critic algorithms and applications, 2019. URLhttps: //arxiv.org/abs/1812.05905
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[67]
Rudin, D
N. Rudin, D. Hoeller, P. Reist, and M. Hutter. Learning to walk in minutes using mas- sively parallel deep reinforcement learning. In A. Faust, D. Hsu, and G. Neumann, ed- itors,Proceedings of the 5th Conference on Robot Learning, volume 164 ofProceedings of Machine Learning Research, pages 91–100. PMLR, 08–11 Nov 2022. URLhttps: //proceedings.mlr.press/v...
2022
-
[68]
Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, and G. State. Isaac gym: High performance gpu-based physics simula- tion for robot learning, 2021. URLhttps://arxiv.org/abs/2108.10470. 14 A Related work A.1 Visual-RL Early works such as [38, 45] demonstrated that RL can operate directly on...
work page internal anchor Pith review Pith/arXiv arXiv 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.