Delightful Policy Gradient removes exponential corner trapping in softmax policy optimization for bandits and tabular MDPs, achieving logarithmic escape times and global O(1/t) convergence.
Journal of Machine Learning Research , volume=
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 4years
2026 4verdicts
UNVERDICTED 4representative citing papers
Introduces the Agent State-Markov Policy Gradient (ASMPG) algorithm and a policy gradient theorem for non-Markovian decision processes by jointly optimizing agent state dynamics and control policy.
MARS replaces additive clipping and soft penalties in multi-agent trust-region methods with a symmetric geometric barrier, matching or exceeding MAPPO and MASPO performance across 47 tasks in eight environments.
A model-free off-policy actor-critic algorithm is constructed for dynamic expectile and CVaR using a surrogate policy gradient without transition perturbation and elicitability-based value learning, with empirical outperformance in risk-averse domains.
citing papers explorer
-
Delightful Gradients Accelerate Corner Escape
Delightful Policy Gradient removes exponential corner trapping in softmax policy optimization for bandits and tabular MDPs, achieving logarithmic escape times and global O(1/t) convergence.
-
Policy Gradient Methods for Non-Markovian Reinforcement Learning
Introduces the Agent State-Markov Policy Gradient (ASMPG) algorithm and a policy gradient theorem for non-Markovian decision processes by jointly optimizing agent state dynamics and control policy.
-
Rethinking Ratio-Based Trust Regions for Policy Optimization in Multi-Agent Reinforcement Learning
MARS replaces additive clipping and soft penalties in multi-agent trust-region methods with a symmetric geometric barrier, matching or exceeding MAPPO and MASPO performance across 47 tasks in eight environments.
-
Actor-Critic Algorithm for Dynamic Expectile and CVaR
A model-free off-policy actor-critic algorithm is constructed for dynamic expectile and CVaR using a surrogate policy gradient without transition perturbation and elicitability-based value learning, with empirical outperformance in risk-averse domains.