RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
Manan Tomar, Lior Shani, Yonathan Efroni, and Mohammad Ghavamzadeh
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 3polarities
use method 3representative citing papers
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Establishes O(N^{-1/2}) convergence for simultaneous MDA and O(N^{-2/3}) for alternating MDA to mixed Nash equilibria in mean-field convex-concave min-max problems via dual-space Bregman analysis.
The paper introduces Random-Reset Policy Optimization (RRPO) and Self-Reset Policy Optimization (SRPO) that use resets to enable more precise credit assignment in RL for language model reasoning, with SRPO outperforming GRPO and RRPO across benchmarks.
Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.
A method trains discrete diffusion policies for combinatorial RL by matching to a PMD-regularized target distribution, reporting SOTA performance and sample efficiency on DNA generation, macro-action, and multi-agent benchmarks.
Shows entropy coupling limits DSAC on discrete tasks and introduces a generalized actor-critic framework with m-step critics and novel entropy-regularized objectives that perform robustly on Atari.
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
A new posterior sampling algorithm for (ε, δ)-PAC policy identification in tabular MDPs achieves asymptotic optimality in sample complexity and posterior contraction rate with O(S²AH) runtime per episode.
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
citing papers explorer
-
Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
Mirror Descent-Ascent for mean-field min-max problems
Establishes O(N^{-1/2}) convergence for simultaneous MDA and O(N^{-2/3}) for alternating MDA to mixed Nash equilibria in mean-field convex-concave min-max problems via dual-space Bregman analysis.
-
Credit Assignment with Resets in Language Model Reasoning
The paper introduces Random-Reset Policy Optimization (RRPO) and Self-Reset Policy Optimization (SRPO) that use resets to enable more precise credit assignment in RL for language model reasoning, with SRPO outperforming GRPO and RRPO across benchmarks.
-
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.
-
Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces
A method trains discrete diffusion policies for combinatorial RL by matching to a PMD-regularized target distribution, reporting SOTA performance and sample efficiency on DNA generation, macro-action, and multi-agent benchmarks.
-
Dissecting Discrete Soft Actor-Critic: Limitations and Principled Alternatives
Shows entropy coupling limits DSAC on discrete tasks and introduces a generalized actor-critic framework with m-step critics and novel entropy-regularized objectives that perform robustly on Atari.
-
Muon is Scalable for LLM Training
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
-
Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes
A new posterior sampling algorithm for (ε, δ)-PAC policy identification in tabular MDPs achieves asymptotic optimality in sample complexity and posterior contraction rate with O(S²AH) runtime per episode.
-
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.