Introduces priced face-crossing via normal-fan geometry on occupancy polytopes to decompose dynamic regret into intrinsic motion cost plus within-face error in non-stationary adversarial MDPs.
Mirror descent policy optimization
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 3polarities
use method 3representative citing papers
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Establishes O(N^{-1/2}) convergence for simultaneous MDA and O(N^{-2/3}) for alternating MDA to mixed Nash equilibria in mean-field convex-concave min-max problems via dual-space Bregman analysis.
The paper introduces Random-Reset Policy Optimization (RRPO) and Self-Reset Policy Optimization (SRPO) that use resets to enable more precise credit assignment in RL for language model reasoning, with SRPO outperforming GRPO and RRPO across benchmarks.
Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.
A method trains discrete diffusion policies for combinatorial RL by matching to a PMD-regularized target distribution, reporting SOTA performance and sample efficiency on DNA generation, macro-action, and multi-agent benchmarks.
Shows entropy coupling limits DSAC on discrete tasks and introduces a generalized actor-critic framework with m-step critics and novel entropy-regularized objectives that perform robustly on Atari.
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
A new posterior sampling algorithm for (ε, δ)-PAC policy identification in tabular MDPs achieves asymptotic optimality in sample complexity and posterior contraction rate with O(S²AH) runtime per episode.
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
citing papers explorer
-
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.
-
Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes
A new posterior sampling algorithm for (ε, δ)-PAC policy identification in tabular MDPs achieves asymptotic optimality in sample complexity and posterior contraction rate with O(S²AH) runtime per episode.
-
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.