ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.
Jonathan Ho and Stefano Ermon
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
GRASP reduces communication in remote control by 12-fold on average (50-fold for continuous actions) by having actors generate actions via guided sampling and local policy learning instead of receiving full actions or rewards.
AWR learns policies via advantage-weighted supervised regression on actions, achieving competitive off-policy performance on Gym tasks and strong results from static data alone.
citing papers explorer
-
ANO: A Principled Approach to Robust Policy Optimization
ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.
-
Remote Action Generation: Remote Control with Minimal Communication
GRASP reduces communication in remote control by 12-fold on average (50-fold for continuous actions) by having actors generate actions via guided sampling and local policy learning instead of receiving full actions or rewards.
-
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
AWR learns policies via advantage-weighted supervised regression on actions, achieving competitive off-policy performance on Gym tasks and strong results from static data alone.