BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.
Mastering the game of go without human knowledge.Nature, 550(7676):354–359
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.
A shallow dense Transformer achieves uniform epsilon-approximation of alpha-Holder functions with O(epsilon^{-d/alpha}) parameters and near-minimax generalization error O(n^{-2alpha/(2alpha+d)} log n).
A new method decomposes property differences between weakly related molecules into minimal chemical edits to train a directional evaluator that guides multi-step optimization with less oracle querying.
citing papers explorer
-
Bounded Ratio Reinforcement Learning
BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.
-
Group-in-Group Policy Optimization for LLM Agent Training
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.
-
Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity
A shallow dense Transformer achieves uniform epsilon-approximation of alpha-Holder functions with O(epsilon^{-d/alpha}) parameters and near-minimax generalization error O(n^{-2alpha/(2alpha+d)} log n).
-
From Single-Step Edit Response to Multi-Step Molecular Optimization
A new method decomposes property differences between weakly related molecules into minimal chemical edits to train a directional evaluator that guides multi-step optimization with less oracle querying.