R²VPO uses ratio-variance regularization as a distributional soft brake on policy updates, claiming better performance than PPO on math reasoning and robotic control without hard clipping.
Provably conver- gent policy optimization via metric-aware trust region methods.arXiv preprint arXiv:2306.14133,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Ratio-Variance Regularized Policy Optimization
R²VPO uses ratio-variance regularization as a distributional soft brake on policy updates, claiming better performance than PPO on math reasoning and robotic control without hard clipping.