Jonathan Ho and Stefano Ermon

arXiv:1707 · 2017 · arXiv 1707.02286

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

ANO: A Principled Approach to Robust Policy Optimization

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.

Remote Action Generation: Remote Control with Minimal Communication

cs.IT · 2026-05-03 · unverdicted · novelty 6.0

GRASP reduces communication in remote control by 12-fold on average (50-fold for continuous actions) by having actors generate actions via guided sampling and local policy learning instead of receiving full actions or rewards.

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

cs.LG · 2019-10-01 · conditional · novelty 6.0

AWR learns policies via advantage-weighted supervised regression on actions, achieving competitive off-policy performance on Gym tasks and strong results from static data alone.

citing papers explorer

Showing 3 of 3 citing papers.

ANO: A Principled Approach to Robust Policy Optimization cs.AI · 2026-05-04 · unverdicted · none · ref 9
ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.
Remote Action Generation: Remote Control with Minimal Communication cs.IT · 2026-05-03 · unverdicted · none · ref 7
GRASP reduces communication in remote control by 12-fold on average (50-fold for continuous actions) by having actors generate actions via guided sampling and local policy learning instead of receiving full actions or rewards.
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning cs.LG · 2019-10-01 · conditional · none · ref 5
AWR learns policies via advantage-weighted supervised regression on actions, achieving competitive off-policy performance on Gym tasks and strong results from static data alone.

Jonathan Ho and Stefano Ermon

fields

years

verdicts

representative citing papers

citing papers explorer