arxiv: 1506.02438 · v6 · submitted 2015-06-08 · 💻 cs.LG · cs.RO· cs.SY

Recognition: 1 theorem link

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman , Philipp Moritz , Sergey Levine , Michael Jordan , Pieter Abbeel

Authors on Pith no claims yet

Pith reviewed 2026-05-11 04:15 UTC · model grok-4.3

classification 💻 cs.LG cs.ROcs.SY

keywords reinforcement learningpolicy gradientscontinuous controladvantage estimationneural networkslocomotiontrust region optimizationvalue functions

0 comments

The pith

Generalized advantage estimation reduces variance in policy gradients for high-dimensional continuous control by exponentially weighting temporal difference residuals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how value functions can cut the variance of policy gradient estimates in reinforcement learning while accepting some bias, through an exponentially-weighted advantage estimator similar to TD(lambda). It pairs this with trust region optimization to keep both policy and value function updates stable as data arrives. This combination lets neural networks learn complex behaviors like running gaits for simulated 3D robots directly from raw joint positions and velocities to torques. The approach succeeds on bipedal and quadrupedal locomotion and standing up tasks using only model-free simulation data equivalent to weeks of real time. A reader would care if they want practical ways to apply policy gradients to high-dimensional problems without excessive sample requirements or hand-designed features.

Core claim

We address the first challenge by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(lambda). We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks. Our approach yields strong empirical results on highly challenging 3D locomotion tasks, learning running gaits for bipedal and quadrupedal simulated robots, and learning a policy for getting the biped to stand up from starting out lying on the ground. In contrast to a body of prior work,

What carries the argument

The generalized advantage estimator: an exponentially-weighted sum of temporal difference residuals analogous to TD(lambda), which trades bias for lower variance in advantage estimates used by policy gradients.

If this is right

Neural network policies map directly from raw kinematics to joint torques without hand-crafted representations.
Trust region optimization stabilizes improvement for both policy and value functions despite nonstationary incoming data.
Model-free learning succeeds on running gaits for simulated bipeds and quadrupeds plus standing-up tasks.
The amount of simulated experience needed corresponds to 1-2 weeks of real time for the biped tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

GAE may apply to other high-variance policy optimization settings such as robotic manipulation or game playing with continuous actions.
The bias-variance tradeoff in the estimator could be tuned per-task to optimize sample efficiency beyond the fixed lambda used here.
Success in simulation raises the question of whether the same direct mapping from kinematics to torques would transfer to physical robots, though sim-to-real gaps are outside the paper's scope.

Load-bearing premise

A neural network value function approximator can be trained sufficiently accurately to deliver useful advantage estimates without introducing bias that negates the variance reduction.

What would settle it

If the learned policies on the 3D locomotion tasks require sample counts comparable to or higher than high-variance Monte Carlo policy gradients, or fail to produce stable gaits, this would show the variance reduction is not effective in practice.

read the original abstract

Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. The two main challenges are the large number of samples typically required, and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data. We address the first challenge by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(lambda). We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks. Our approach yields strong empirical results on highly challenging 3D locomotion tasks, learning running gaits for bipedal and quadrupedal simulated robots, and learning a policy for getting the biped to stand up from starting out lying on the ground. In contrast to a body of prior work that uses hand-crafted policy representations, our neural network policies map directly from raw kinematics to joint torques. Our algorithm is fully model-free, and the amount of simulated experience required for the learning tasks on 3D bipeds corresponds to 1-2 weeks of real time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GAE gives a clean tunable estimator for trading bias against variance in policy gradients, and the experiments show it works on hard locomotion tasks, but the paper leaves the value-function bias assumption mostly untested.

read the letter

The main point is that this paper presents Generalized Advantage Estimation, which lets you control the bias-variance tradeoff in advantage estimates for policy gradients using a single lambda parameter. They derive it cleanly as an exponentially weighted combination of TD residuals, and combine it with trust-region policy optimization to stabilize learning with neural networks. The experiments demonstrate success on challenging 3D locomotion tasks, including running gaits for bipeds and quadrupeds plus standing up, all with policies that map directly from kinematics to torques in a model-free way. This shows the approach can handle high-dimensional continuous control without excessive samples. The work is solid on the practical side, with clear motivation from reducing variance at some bias cost and addressing nonstationarity via trust regions. The results stand out as strong for the tasks involved. Where it is thinner is in validating the key assumption about the value function. They do not provide measurements of the bias or variance in the GAE estimates themselves, nor ablations that isolate the contribution of the advantage estimator from the trust-region mechanism. This makes it harder to pinpoint exactly why it succeeds or how robust it is to value function approximation error. Readers working on actor-critic methods or continuous RL will get the most from this, as it supplies a new estimator to experiment with on their own problems. It is worth sending for peer review given the novel construction and the non-trivial empirical results.

Referee Report

1 major / 2 minor

Summary. The paper introduces Generalized Advantage Estimation (GAE), an exponentially-weighted estimator of the advantage function (analogous to TD(λ)) derived from standard returns and value functions, to reduce variance in policy gradient estimates at the cost of bias. It combines GAE with trust-region optimization applied to both policy and value function neural networks for stable learning. Empirical results demonstrate success on challenging 3D locomotion tasks, including learning running gaits for bipedal and quadrupedal robots and standing up from a lying position, using model-free policies that map raw kinematics directly to joint torques, with simulated experience equivalent to 1-2 weeks of real time.

Significance. If the results hold, this provides a practical method for high-dimensional continuous control with neural network policies in model-free RL, addressing variance and non-stationarity issues. Strengths include the first-principles derivation of GAE from RL quantities (returns, value functions) independent of the final performance metric, the combination with trust-region constraints, and the demonstration of complex behaviors without hand-crafted representations. The work advances empirical RL for robotics-like tasks.

major comments (1)

[Experiments] Experiments section: The central claim is that GAE reduces policy-gradient variance enough for learning on 3D locomotion while the bias from the neural-network value function approximator remains tolerable. However, the manuscript provides no direct measurement of advantage-estimate bias or variance on the learned policies, nor an ablation isolating value-function accuracy from the trust-region updates. This leaves unaddressed whether approximation error in the value function negates the variance reduction.

minor comments (2)

[Abstract] The claim that simulated experience corresponds to '1-2 weeks of real time' should be supported by the exact number of timesteps or episodes in the main text or a table for reproducibility.
[Method] The GAE(λ) estimator would benefit from an explicit equation and notation definition in the early sections before the empirical results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. The positive assessment of GAE combined with trust-region optimization for high-dimensional continuous control is appreciated. We address the single major comment below.

read point-by-point responses

Referee: Experiments section: The central claim is that GAE reduces policy-gradient variance enough for learning on 3D locomotion while the bias from the neural-network value function approximator remains tolerable. However, the manuscript provides no direct measurement of advantage-estimate bias or variance on the learned policies, nor an ablation isolating value-function accuracy from the trust-region updates. This leaves unaddressed whether approximation error in the value function negates the variance reduction.

Authors: We agree that the manuscript does not include direct empirical measurements of bias or variance for the advantage estimates under the learned policies, nor an explicit ablation separating value-function approximation quality from the trust-region mechanism. Computing ground-truth advantages is intractable for these tasks without an optimal value function. Our defense of the central claim rests on the observed outcomes: the algorithm learns stable running gaits and stand-up behaviors on 3D bipeds and quadrupeds from raw kinematics, using only model-free experience equivalent to 1-2 weeks of real time. Such complex, high-dimensional policies would be unlikely to emerge if value-function bias dominated or if variance reduction were ineffective. The trust-region updates on both policy and value networks are presented as a joint mechanism for stability rather than isolated components. We will add a clarifying paragraph in the discussion section noting the reliance on end-to-end empirical success and the practical difficulty of direct bias/variance diagnostics in this setting. revision: partial

Circularity Check

0 steps flagged

GAE derivation is self-contained from standard RL definitions

full rationale

The paper derives the exponentially-weighted advantage estimator directly from the definitions of the advantage function A_t = Q_t - V_t and the TD residual delta_t = r_t + gamma V(s_{t+1}) - V(s_t), yielding the standard GAE(lambda) sum without any reduction to fitted parameters, self-citations, or input data by construction. Trust-region policy optimization is referenced separately and does not enter the estimator derivation. No quoted step equates a claimed prediction or result to its own inputs; the method remains falsifiable via external benchmarks on variance reduction and bias in continuous control.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard MDP assumptions and the existence of a sufficiently accurate value function approximator; the only explicit free parameter is the GAE lambda that trades bias for variance.

free parameters (1)

lambda
Exponential weighting factor in the advantage estimator that controls the bias-variance tradeoff; chosen by the practitioner.

axioms (1)

domain assumption The environment is a Markov decision process with stationary dynamics.
Required for the definition of returns and advantage functions used in the estimator.

pith-pipeline@v0.9.0 · 5526 in / 1127 out tokens · 20900 ms · 2026-05-11T04:15:59.942358+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
cs.CL 2026-05 unverdicted novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients
cs.LG 2026-05 unverdicted novelty 7.0

HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.
Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation
cs.LG 2026-05 unverdicted novelty 7.0

CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.
Adaptive Smooth Tchebycheff Attention for Multi-Objective Policy Optimization
cs.RO 2026-05 unverdicted novelty 7.0

An adaptive smooth Tchebycheff controller for multi-objective RL lets agents reach non-convex Pareto regions in robotic tasks while avoiding the instability of static non-linear scalarizations.
Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

ATD(λ) adapts TD(λ) in MARL via a density ratio estimator on past/current replay buffers to assign λ per state-action pair, yielding competitive or better results than fixed-λ QMIX and MAPPO on SMAC and Gfootball.
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
cs.LG 2026-05 conditional novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency
quant-ph 2026-05 unverdicted novelty 7.0

TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.
Controllability in preference-conditioned multi-objective reinforcement learning
cs.LG 2026-05 unverdicted novelty 7.0

Standard MORL metrics do not measure whether preference inputs reliably control agent behavior, so a new controllability metric is introduced to restore the link between user intent and agent output.
Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems
cs.RO 2026-05 unverdicted novelty 7.0

OLSF-TRS is a generalized sequential decision framework using structured combinatorial optimization and multi-agent reinforcement learning for order-tote-robot coordination in tote-handling robotic systems, with near-...
Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations
cs.AI 2026-05 unverdicted novelty 7.0

CaTR applies value-decomposed RL with hierarchical conflict-aware observations to achieve better safety-efficiency trade-offs than planning, optimization, and standard RL baselines in a realistic airport taxiway simulation.
KL for a KL: On-Policy Distillation with Control Variate Baseline
cs.LG 2026-05 unverdicted novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
Learning Visual Feature-Based World Models via Residual Latent Action
cs.CV 2026-05 unverdicted novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL
cs.LG 2026-05 unverdicted novelty 7.0

Approximate Next Policy Sampling approximates the next policy's state distribution during training to enable larger safe policy updates in deep RL, demonstrated by SV-PPO matching or exceeding standard PPO on Atari an...
Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks
cs.NI 2026-05 conditional novelty 7.0

Graph transformer RL for dynamic RMSA supports up to 13% more traffic than benchmarks on networks up to 143 nodes and 362 links.
Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting
cs.CV 2026-05 unverdicted novelty 7.0

LeGS turns density control in 3D Gaussian Splatting into a learnable RL policy whose reward is derived from a closed-form sensitivity analysis that measures each Gaussian's marginal contribution to reconstruction quality.
Financial Market as a Self-Organized Ecosystem: Simulation via Learning with Heterogeneous Preferences
q-fin.CP 2026-04 unverdicted novelty 7.0

Multi-agent reinforcement learning with heterogeneous preferences leads to emergent role specialization whose interactions produce fat-tailed returns and volatility clustering, offering a computational realization of ...
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
cs.CV 2026-04 unverdicted novelty 7.0

DeVI enables zero-shot physically plausible dexterous control by imitating synthetic videos via a hybrid 3D-human plus 2D-object tracking reward.
EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training
cs.LG 2026-04 unverdicted novelty 7.0

EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.
Bounded Ratio Reinforcement Learning
cs.LG 2026-04 conditional novelty 7.0

BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
cs.LG 2026-04 unverdicted novelty 7.0

SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching
cs.RO 2026-04 unverdicted novelty 7.0

ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
Plasticity-Enhanced Multi-Agent Mixture of Experts for Dynamic Objective Adaptation in UAVs-Assisted Emergency Communication Networks
cs.MA 2026-04 unverdicted novelty 7.0

PE-MAMoE combines sparsely gated mixture-of-experts actors with a non-parametric phase controller in MAPPO to maintain plasticity under dynamic user mobility and traffic, yielding 26.3% higher normalized IQM return in...
Cayley Graph Optimization for Scalable Multi-Agent Communication Topologies
cs.NI 2026-04 unverdicted novelty 7.0

CayleyTopo uses reinforcement learning to optimize Cayley graph generators for lower diameter, yielding faster and more resilient information flow in multi-agent systems than hand-crafted sparse topologies.
Neural Assistive Impulses: Synthesizing Exaggerated Motions for Physics-based Characters
cs.AI 2026-04 unverdicted novelty 7.0

A hybrid neural policy operating in impulse space enables physics-based characters to track exaggerated, dynamically infeasible motions that standard DRL methods cannot stabilize.
A semicontinuous relaxation of Saito's criterion and freeness as angular minimization
math.AG 2026-04 conditional novelty 7.0

A new functional S vanishes precisely on free line arrangements and enables discovery of verified free examples for every admissible exponent pair with up to 20 lines.
Mastering Atari with Discrete World Models
cs.LG 2020-10 accept novelty 7.0

DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
Dota 2 with Large Scale Deep Reinforcement Learning
cs.LG 2019-12 accept novelty 7.0

OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
Concrete Problems in AI Safety
cs.AI 2016-06 accept novelty 7.0

The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.
Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy
cs.LG 2026-05 conditional novelty 6.0

ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 per...
What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models
cs.RO 2026-05 conditional novelty 6.0

PAIR-VLA adds invariance and sensitivity objectives over paired visual variants during PPO fine-tuning of VLA models, yielding 9-16% average gains on ManiSkill3 under distractors, textures, poses, viewpoints, and ligh...
Explicit Stair Geometry Conditioning for Robust Humanoid Locomotion
cs.RO 2026-05 unverdicted novelty 6.0

Explicit conditioning of a PPO policy on interpretable stair parameters (height, depth, yaw) yields improved generalization to unseen stairs and reliable real-world traversal on the Unitree G1, including 33 consecutiv...
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
cs.LG 2026-05 unverdicted novelty 6.0

dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control
cs.AI 2026-05 unverdicted novelty 6.0

OracleTSC introduces a reward hurdle and uncertainty regularization to stabilize LLM-based reinforcement learning for traffic signal control, delivering 75% lower travel time and 67% lower queue length on benchmarks p...
Actor-Critic Algorithm for Dynamic Expectile and CVaR
cs.LG 2026-05 unverdicted novelty 6.0

A model-free off-policy actor-critic algorithm is constructed for dynamic expectile and CVaR using a surrogate policy gradient without transition perturbation and elicitability-based value learning, with empirical out...
Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works
cs.LG 2026-05 conditional novelty 6.0

Group-mean centering in binary-reward GRPO produces gradient starvation; the fixed sign advantage A=2r-1 raises GSM8K accuracy from 28.4% to 73.8% at group size 4.
Response Time Enhances Alignment with Heterogeneous Preferences
cs.LG 2026-05 unverdicted novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
Sequential Design of Genetic Circuits Under Uncertainty With Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

An amortized reinforcement learning method enables immediate, observation-driven sequential optimization of genetic circuits while accounting for both intrinsic stochasticity and cross-laboratory variability without r...
Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR
cs.LG 2026-05 unverdicted novelty 6.0

S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
OpenG2G: A Simulation Platform for AI Datacenter-Grid Runtime Coordination
cs.LG 2026-05 unverdicted novelty 6.0

OpenG2G is a new extensible simulation platform that lets users implement and compare classic, optimization, and learning-based controllers for AI datacenter power flexibility coordinated with the grid.
Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Hidden states in recurrent RL policies correspond to PMP co-states, so a derived co-state loss structures the dynamics and yields robust performance on partially observable continuous control tasks.
SOAR: Real-Time Joint Optimization of Order Allocation and Robot Scheduling in Robotic Mobile Fulfillment Systems
cs.AI 2026-05 unverdicted novelty 6.0

SOAR is a unified DRL method using soft allocations, event-driven MDP, and heterogeneous graph transformers that cuts global makespan by 7.5% and average order completion time by 15.4% at sub-100ms latency in RMFS.
Stage Light is Sequence$^2$: Multi-Light Control via Imitation Learning
cs.MM 2026-05 unverdicted novelty 6.0

SeqLight maps music to multi-light HSV control via SkipBART for global color prediction followed by hybrid imitation learning in a goal-conditioned MDP to decompose colors across lights.
ANO: A Principled Approach to Robust Policy Optimization
cs.AI 2026-05 unverdicted novelty 6.0

ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF e...
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling
cs.CL 2026-04 unverdicted novelty 6.0

LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.
Reinforcement Learning for Public Safety Power Shutoffs Under Decision-Dependent Uncertainty and Nonlinear Wildfire Ignition Models
math.OC 2026-04 unverdicted novelty 6.0

Reinforcement learning learns optimal PSPS topology adjustments via simulation of any nonlinear line failure model, reducing costs versus MIP baselines on 54-bus and 138-bus systems.
Sample-efficient Neuro-symbolic Proximal Policy Optimization
cs.AI 2026-04 unverdicted novelty 6.0

H-PPO-Product and H-PPO-SymLoss achieve faster learning and higher final returns than standard PPO and Reward Machine baselines on OfficeWorld, WaterWorld, and DoorKey by transferring imperfect logical policy specific...
Compute Aligned Training: Optimizing for Test Time Inference
cs.LG 2026-04 unverdicted novelty 6.0

Compute Aligned Training derives new loss functions by modeling test-time strategies as operators on the base policy, yielding empirical gains in test-time compute scaling over standard SFT and RL.
K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

A 1D Kalman filter for online reward mean estimation accelerates convergence and lowers variance in policy gradient RL compared to standard normalization on LunarLander and CartPole.
Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems
cs.RO 2026-04 unverdicted novelty 6.0

Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.
Temporally Extended Mixture-of-Experts Models
cs.LG 2026-04 unverdicted novelty 6.0

Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.
Beyond Importance Sampling: Rejection-Gated Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
cs.LG 2026-04 unverdicted novelty 6.0

PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baseli...
Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus
cs.LG 2026-04 conditional novelty 6.0

CMAT uses a transformer decoder to produce a high-level consensus vector in latent space, enabling simultaneous order-independent actions by all agents and optimization via single-agent PPO, with superior results on S...
Learning-Based Sparsification of Dynamic Graphs in Robotic Exploration Algorithms
cs.RO 2026-04 unverdicted novelty 6.0

A PPO-trained transformer policy sparsifies dynamic graphs during RRT frontier exploration, cutting size by up to 96% and yielding the most consistent exploration rates across environments.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 72 Pith papers · 1 internal anchor

[1]

Neuronlike adaptive elements that can solve difficult learning control problems

Barto, Andrew G, Sutton, Richard S, and Anderson, Charles W. Neuronlike adaptive elements that can solve difficult learning control problems. Systems, Man and Cybernetics, IEEE Transactions on, 0 (5): 0 834--846, 1983

work page 1983
[2]

Reinforcement learning in POMDP s via direct gradient ascent

Baxter, Jonathan and Bartlett, Peter L. Reinforcement learning in POMDP s via direct gradient ascent. In ICML, pp.\ 41--48, 2000

work page 2000
[3]

Dynamic programming and optimal control, volume 2

Bertsekas, Dimitri P. Dynamic programming and optimal control, volume 2. Athena Scientific, 2012

work page 2012
[4]

Convergent temporal-difference learning with arbitrary smooth function approximation

Bhatnagar, Shalabh, Precup, Doina, Silver, David, Sutton, Richard S, Maei, Hamid R, and Szepesv \'a ri, Csaba. Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, pp.\ 1204--1212, 2009

work page 2009
[5]

Variance reduction techniques for gradient estimates in reinforcement learning

Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. The Journal of Machine Learning Research, 5: 0 1471--1530, 2004

work page 2004
[6]

Reinforcement learning in feedback control

Hafner, Roland and Riedmiller, Martin. Reinforcement learning in feedback control. Machine learning, 84 0 (1-2): 0 137--169, 2011

work page 2011
[7]

Learning continuous control policies by stochastic value gradients

Heess, Nicolas, Wayne, Greg, Silver, David, Lillicrap, Timothy, Tassa, Yuval, and Erez, Tom. Learning continuous control policies by stochastic value gradients. arXiv preprint arXiv:1510.09142, 2015

work page arXiv 2015
[8]

Principles of behavior

Hull, Clark. Principles of behavior. 1943

work page 1943
[9]

A natural policy gradient

Kakade, Sham. A natural policy gradient. In NIPS, volume 14, pp.\ 1531--1538, 2001 a

work page 2001
[10]

Optimizing average reward using discounted rewards

Kakade, Sham. Optimizing average reward using discounted rewards. In Computational Learning Theory, pp.\ 605--615. Springer, 2001 b

work page 2001
[11]

An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function

Kimura, Hajime and Kobayashi, Shigenobu. An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function. In ICML, pp.\ 278--286, 1998

work page 1998
[12]

On actor-critic algorithms

Konda, Vijay R and Tsitsiklis, John N. On actor-critic algorithms. SIAM journal on Control and Optimization, 42 0 (4): 0 1143--1166, 2003

work page 2003
[13]

Continuous control with deep reinforcement learning

Lillicrap, Timothy P, Hunt, Jonathan J, Pritzel, Alexander, Heess, Nicolas, Erez, Tom, Tassa, Yuval, Silver, David, and Wierstra, Daan. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review arXiv 2015
[14]

Approximate gradient methods in policy-space optimization of markov reward processes

Marbach, Peter and Tsitsiklis, John N. Approximate gradient methods in policy-space optimization of markov reward processes. Discrete Event Dynamic Systems, 13 0 (1-2): 0 111--148, 2003

work page 2003
[15]

Steps toward artificial intelligence

Minsky, Marvin. Steps toward artificial intelligence. Proceedings of the IRE, 49 0 (1): 0 8--30, 1961

work page 1961
[16]

Policy invariance under reward transformations: Theory and application to reward shaping

Ng, Andrew Y, Harada, Daishi, and Russell, Stuart. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pp.\ 278--287, 1999

work page 1999
[17]

Natural actor-critic

Peters, Jan and Schaal, Stefan. Natural actor-critic. Neurocomputing, 71 0 (7): 0 1180--1190, 2008

work page 2008
[18]

Trust Region Policy Optimization

Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan, Michael I, and Abbeel, Pieter. Trust region policy optimization. arXiv preprint arXiv:1502.05477, 2015

work page Pith review arXiv 2015
[19]

Introduction to reinforcement learning

Sutton, Richard S and Barto, Andrew G. Introduction to reinforcement learning. MIT Press, 1998

work page 1998
[20]

Policy gradient methods for reinforcement learning with function approximation

Sutton, Richard S, McAllester, David A, Singh, Satinder P, and Mansour, Yishay. Policy gradient methods for reinforcement learning with function approximation. In NIPS, volume 99, pp.\ 1057--1063. Citeseer, 1999

work page 1999
[21]

Bias in natural actor-critic algorithms

Thomas, Philip. Bias in natural actor-critic algorithms. In Proceedings of The 31st International Conference on Machine Learning, pp.\ 441--448, 2014

work page 2014
[22]

Mujoco: A physics engine for model-based control

Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp.\ 5026--5033. IEEE, 2012

work page 2012
[23]

Real-time reinforcement learning by sequential actor--critics and experience replay

Wawrzy \'n ski, Pawe . Real-time reinforcement learning by sequential actor--critics and experience replay. Neural Networks, 22 0 (10): 0 1484--1497, 2009

work page 2009
[24]

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8 0 (3-4): 0 229--256, 1992

work page 1992
[25]

Numerical optimization

Wright, Stephen J and Nocedal, Jorge. Numerical optimization. Springer New York, 1999

work page 1999