Recognition: 1 theorem link
High-Dimensional Continuous Control Using Generalized Advantage Estimation
Pith reviewed 2026-05-11 04:15 UTC · model grok-4.3
The pith
Generalized advantage estimation reduces variance in policy gradients for high-dimensional continuous control by exponentially weighting temporal difference residuals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We address the first challenge by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(lambda). We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks. Our approach yields strong empirical results on highly challenging 3D locomotion tasks, learning running gaits for bipedal and quadrupedal simulated robots, and learning a policy for getting the biped to stand up from starting out lying on the ground. In contrast to a body of prior work,
What carries the argument
The generalized advantage estimator: an exponentially-weighted sum of temporal difference residuals analogous to TD(lambda), which trades bias for lower variance in advantage estimates used by policy gradients.
If this is right
- Neural network policies map directly from raw kinematics to joint torques without hand-crafted representations.
- Trust region optimization stabilizes improvement for both policy and value functions despite nonstationary incoming data.
- Model-free learning succeeds on running gaits for simulated bipeds and quadrupeds plus standing-up tasks.
- The amount of simulated experience needed corresponds to 1-2 weeks of real time for the biped tasks.
Where Pith is reading between the lines
- GAE may apply to other high-variance policy optimization settings such as robotic manipulation or game playing with continuous actions.
- The bias-variance tradeoff in the estimator could be tuned per-task to optimize sample efficiency beyond the fixed lambda used here.
- Success in simulation raises the question of whether the same direct mapping from kinematics to torques would transfer to physical robots, though sim-to-real gaps are outside the paper's scope.
Load-bearing premise
A neural network value function approximator can be trained sufficiently accurately to deliver useful advantage estimates without introducing bias that negates the variance reduction.
What would settle it
If the learned policies on the 3D locomotion tasks require sample counts comparable to or higher than high-variance Monte Carlo policy gradients, or fail to produce stable gaits, this would show the variance reduction is not effective in practice.
read the original abstract
Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. The two main challenges are the large number of samples typically required, and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data. We address the first challenge by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(lambda). We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks. Our approach yields strong empirical results on highly challenging 3D locomotion tasks, learning running gaits for bipedal and quadrupedal simulated robots, and learning a policy for getting the biped to stand up from starting out lying on the ground. In contrast to a body of prior work that uses hand-crafted policy representations, our neural network policies map directly from raw kinematics to joint torques. Our algorithm is fully model-free, and the amount of simulated experience required for the learning tasks on 3D bipeds corresponds to 1-2 weeks of real time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Generalized Advantage Estimation (GAE), an exponentially-weighted estimator of the advantage function (analogous to TD(λ)) derived from standard returns and value functions, to reduce variance in policy gradient estimates at the cost of bias. It combines GAE with trust-region optimization applied to both policy and value function neural networks for stable learning. Empirical results demonstrate success on challenging 3D locomotion tasks, including learning running gaits for bipedal and quadrupedal robots and standing up from a lying position, using model-free policies that map raw kinematics directly to joint torques, with simulated experience equivalent to 1-2 weeks of real time.
Significance. If the results hold, this provides a practical method for high-dimensional continuous control with neural network policies in model-free RL, addressing variance and non-stationarity issues. Strengths include the first-principles derivation of GAE from RL quantities (returns, value functions) independent of the final performance metric, the combination with trust-region constraints, and the demonstration of complex behaviors without hand-crafted representations. The work advances empirical RL for robotics-like tasks.
major comments (1)
- [Experiments] Experiments section: The central claim is that GAE reduces policy-gradient variance enough for learning on 3D locomotion while the bias from the neural-network value function approximator remains tolerable. However, the manuscript provides no direct measurement of advantage-estimate bias or variance on the learned policies, nor an ablation isolating value-function accuracy from the trust-region updates. This leaves unaddressed whether approximation error in the value function negates the variance reduction.
minor comments (2)
- [Abstract] The claim that simulated experience corresponds to '1-2 weeks of real time' should be supported by the exact number of timesteps or episodes in the main text or a table for reproducibility.
- [Method] The GAE(λ) estimator would benefit from an explicit equation and notation definition in the early sections before the empirical results.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation for minor revision. The positive assessment of GAE combined with trust-region optimization for high-dimensional continuous control is appreciated. We address the single major comment below.
read point-by-point responses
-
Referee: Experiments section: The central claim is that GAE reduces policy-gradient variance enough for learning on 3D locomotion while the bias from the neural-network value function approximator remains tolerable. However, the manuscript provides no direct measurement of advantage-estimate bias or variance on the learned policies, nor an ablation isolating value-function accuracy from the trust-region updates. This leaves unaddressed whether approximation error in the value function negates the variance reduction.
Authors: We agree that the manuscript does not include direct empirical measurements of bias or variance for the advantage estimates under the learned policies, nor an explicit ablation separating value-function approximation quality from the trust-region mechanism. Computing ground-truth advantages is intractable for these tasks without an optimal value function. Our defense of the central claim rests on the observed outcomes: the algorithm learns stable running gaits and stand-up behaviors on 3D bipeds and quadrupeds from raw kinematics, using only model-free experience equivalent to 1-2 weeks of real time. Such complex, high-dimensional policies would be unlikely to emerge if value-function bias dominated or if variance reduction were ineffective. The trust-region updates on both policy and value networks are presented as a joint mechanism for stability rather than isolated components. We will add a clarifying paragraph in the discussion section noting the reliance on end-to-end empirical success and the practical difficulty of direct bias/variance diagnostics in this setting. revision: partial
Circularity Check
GAE derivation is self-contained from standard RL definitions
full rationale
The paper derives the exponentially-weighted advantage estimator directly from the definitions of the advantage function A_t = Q_t - V_t and the TD residual delta_t = r_t + gamma V(s_{t+1}) - V(s_t), yielding the standard GAE(lambda) sum without any reduction to fitted parameters, self-citations, or input data by construction. Trust-region policy optimization is referenced separately and does not enter the estimator derivation. No quoted step equates a claimed prediction or result to its own inputs; the method remains falsifiable via external benchmarks on variance reduction and bias in continuous control.
Axiom & Free-Parameter Ledger
free parameters (1)
- lambda
axioms (1)
- domain assumption The environment is a Markov decision process with stationary dynamics.
Forward citations
Cited by 60 Pith papers
-
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
-
Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients
HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.
-
Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation
CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.
-
Adaptive Smooth Tchebycheff Attention for Multi-Objective Policy Optimization
An adaptive smooth Tchebycheff controller for multi-objective RL lets agents reach non-convex Pareto regions in robotic tasks while avoiding the instability of static non-linear scalarizations.
-
Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning
ATD(λ) adapts TD(λ) in MARL via a density ratio estimator on past/current replay buffers to assign λ per state-action pair, yielding competitive or better results than fixed-λ QMIX and MAPPO on SMAC and Gfootball.
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency
TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.
-
Controllability in preference-conditioned multi-objective reinforcement learning
Standard MORL metrics do not measure whether preference inputs reliably control agent behavior, so a new controllability metric is introduced to restore the link between user intent and agent output.
-
Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems
OLSF-TRS is a generalized sequential decision framework using structured combinatorial optimization and multi-agent reinforcement learning for order-tote-robot coordination in tote-handling robotic systems, with near-...
-
Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations
CaTR applies value-decomposed RL with hierarchical conflict-aware observations to achieve better safety-efficiency trade-offs than planning, optimization, and standard RL baselines in a realistic airport taxiway simulation.
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
-
Learning Visual Feature-Based World Models via Residual Latent Action
RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
-
Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL
Approximate Next Policy Sampling approximates the next policy's state distribution during training to enable larger safe policy updates in deep RL, demonstrated by SV-PPO matching or exceeding standard PPO on Atari an...
-
Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks
Graph transformer RL for dynamic RMSA supports up to 13% more traffic than benchmarks on networks up to 143 nodes and 362 links.
-
Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting
LeGS turns density control in 3D Gaussian Splatting into a learnable RL policy whose reward is derived from a closed-form sensitivity analysis that measures each Gaussian's marginal contribution to reconstruction quality.
-
Financial Market as a Self-Organized Ecosystem: Simulation via Learning with Heterogeneous Preferences
Multi-agent reinforcement learning with heterogeneous preferences leads to emergent role specialization whose interactions produce fat-tailed returns and volatility clustering, offering a computational realization of ...
-
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
DeVI enables zero-shot physically plausible dexterous control by imitating synthetic videos via a hybrid 3D-human plus 2D-object tracking reward.
-
EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training
EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.
-
Bounded Ratio Reinforcement Learning
BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.
-
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
-
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching
ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...
-
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning
GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
-
Plasticity-Enhanced Multi-Agent Mixture of Experts for Dynamic Objective Adaptation in UAVs-Assisted Emergency Communication Networks
PE-MAMoE combines sparsely gated mixture-of-experts actors with a non-parametric phase controller in MAPPO to maintain plasticity under dynamic user mobility and traffic, yielding 26.3% higher normalized IQM return in...
-
Cayley Graph Optimization for Scalable Multi-Agent Communication Topologies
CayleyTopo uses reinforcement learning to optimize Cayley graph generators for lower diameter, yielding faster and more resilient information flow in multi-agent systems than hand-crafted sparse topologies.
-
Neural Assistive Impulses: Synthesizing Exaggerated Motions for Physics-based Characters
A hybrid neural policy operating in impulse space enables physics-based characters to track exaggerated, dynamically infeasible motions that standard DRL methods cannot stabilize.
-
A semicontinuous relaxation of Saito's criterion and freeness as angular minimization
A new functional S vanishes precisely on free line arrangements and enables discovery of verified free examples for every admissible exponent pair with up to 20 lines.
-
Mastering Atari with Discrete World Models
DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
-
Dota 2 with Large Scale Deep Reinforcement Learning
OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
-
Concrete Problems in AI Safety
The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.
-
Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy
ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 per...
-
What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models
PAIR-VLA adds invariance and sensitivity objectives over paired visual variants during PPO fine-tuning of VLA models, yielding 9-16% average gains on ManiSkill3 under distractors, textures, poses, viewpoints, and ligh...
-
Explicit Stair Geometry Conditioning for Robust Humanoid Locomotion
Explicit conditioning of a PPO policy on interpretable stair parameters (height, depth, yaw) yields improved generalization to unseen stairs and reliable real-world traversal on the Unitree G1, including 33 consecutiv...
-
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
-
OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control
OracleTSC introduces a reward hurdle and uncertainty regularization to stabilize LLM-based reinforcement learning for traffic signal control, delivering 75% lower travel time and 67% lower queue length on benchmarks p...
-
Actor-Critic Algorithm for Dynamic Expectile and CVaR
A model-free off-policy actor-critic algorithm is constructed for dynamic expectile and CVaR using a surrogate policy gradient without transition perturbation and elicitability-based value learning, with empirical out...
-
Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works
Group-mean centering in binary-reward GRPO produces gradient starvation; the fixed sign advantage A=2r-1 raises GSM8K accuracy from 28.4% to 73.8% at group size 4.
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
Sequential Design of Genetic Circuits Under Uncertainty With Reinforcement Learning
An amortized reinforcement learning method enables immediate, observation-driven sequential optimization of genetic circuits while accounting for both intrinsic stochasticity and cross-laboratory variability without r...
-
Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR
S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
-
OpenG2G: A Simulation Platform for AI Datacenter-Grid Runtime Coordination
OpenG2G is a new extensible simulation platform that lets users implement and compare classic, optimization, and learning-based controllers for AI datacenter power flexibility coordinated with the grid.
-
Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning
Hidden states in recurrent RL policies correspond to PMP co-states, so a derived co-state loss structures the dynamics and yields robust performance on partially observable continuous control tasks.
-
SOAR: Real-Time Joint Optimization of Order Allocation and Robot Scheduling in Robotic Mobile Fulfillment Systems
SOAR is a unified DRL method using soft allocations, event-driven MDP, and heterogeneous graph transformers that cuts global makespan by 7.5% and average order completion time by 15.4% at sub-100ms latency in RMFS.
-
Stage Light is Sequence$^2$: Multi-Light Control via Imitation Learning
SeqLight maps music to multi-light HSV control via SkipBART for global color prediction followed by hybrid imitation learning in a goal-conditioned MDP to decompose colors across lights.
-
ANO: A Principled Approach to Robust Policy Optimization
ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF e...
-
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
-
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
-
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
-
Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling
LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.
-
Reinforcement Learning for Public Safety Power Shutoffs Under Decision-Dependent Uncertainty and Nonlinear Wildfire Ignition Models
Reinforcement learning learns optimal PSPS topology adjustments via simulation of any nonlinear line failure model, reducing costs versus MIP baselines on 54-bus and 138-bus systems.
-
Sample-efficient Neuro-symbolic Proximal Policy Optimization
H-PPO-Product and H-PPO-SymLoss achieve faster learning and higher final returns than standard PPO and Reward Machine baselines on OfficeWorld, WaterWorld, and DoorKey by transferring imperfect logical policy specific...
-
Compute Aligned Training: Optimizing for Test Time Inference
Compute Aligned Training derives new loss functions by modeling test-time strategies as operators on the base policy, yielding empirical gains in test-time compute scaling over standard SFT and RL.
-
K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning
A 1D Kalman filter for online reward mean estimation accelerates convergence and lowers variance in policy gradient RL compared to standard normalization on LunarLander and CartPole.
-
Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems
Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.
-
Temporally Extended Mixture-of-Experts Models
Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.
-
Beyond Importance Sampling: Rejection-Gated Policy Optimization
RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.
-
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baseli...
-
Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus
CMAT uses a transformer decoder to produce a high-level consensus vector in latent space, enabling simultaneous order-independent actions by all agents and optimization via single-agent PPO, with superior results on S...
-
Learning-Based Sparsification of Dynamic Graphs in Robotic Exploration Algorithms
A PPO-trained transformer policy sparsifies dynamic graphs during RRT frontier exploration, cutting size by up to 96% and yielding the most consistent exploration rates across environments.
Reference graph
Works this paper leans on
-
[1]
Neuronlike adaptive elements that can solve difficult learning control problems
Barto, Andrew G, Sutton, Richard S, and Anderson, Charles W. Neuronlike adaptive elements that can solve difficult learning control problems. Systems, Man and Cybernetics, IEEE Transactions on, 0 (5): 0 834--846, 1983
work page 1983
-
[2]
Reinforcement learning in POMDP s via direct gradient ascent
Baxter, Jonathan and Bartlett, Peter L. Reinforcement learning in POMDP s via direct gradient ascent. In ICML, pp.\ 41--48, 2000
work page 2000
-
[3]
Dynamic programming and optimal control, volume 2
Bertsekas, Dimitri P. Dynamic programming and optimal control, volume 2. Athena Scientific, 2012
work page 2012
-
[4]
Convergent temporal-difference learning with arbitrary smooth function approximation
Bhatnagar, Shalabh, Precup, Doina, Silver, David, Sutton, Richard S, Maei, Hamid R, and Szepesv \'a ri, Csaba. Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, pp.\ 1204--1212, 2009
work page 2009
-
[5]
Variance reduction techniques for gradient estimates in reinforcement learning
Greensmith, Evan, Bartlett, Peter L, and Baxter, Jonathan. Variance reduction techniques for gradient estimates in reinforcement learning. The Journal of Machine Learning Research, 5: 0 1471--1530, 2004
work page 2004
-
[6]
Reinforcement learning in feedback control
Hafner, Roland and Riedmiller, Martin. Reinforcement learning in feedback control. Machine learning, 84 0 (1-2): 0 137--169, 2011
work page 2011
-
[7]
Learning continuous control policies by stochastic value gradients
Heess, Nicolas, Wayne, Greg, Silver, David, Lillicrap, Timothy, Tassa, Yuval, and Erez, Tom. Learning continuous control policies by stochastic value gradients. arXiv preprint arXiv:1510.09142, 2015
- [8]
-
[9]
Kakade, Sham. A natural policy gradient. In NIPS, volume 14, pp.\ 1531--1538, 2001 a
work page 2001
-
[10]
Optimizing average reward using discounted rewards
Kakade, Sham. Optimizing average reward using discounted rewards. In Computational Learning Theory, pp.\ 605--615. Springer, 2001 b
work page 2001
-
[11]
Kimura, Hajime and Kobayashi, Shigenobu. An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function. In ICML, pp.\ 278--286, 1998
work page 1998
-
[12]
Konda, Vijay R and Tsitsiklis, John N. On actor-critic algorithms. SIAM journal on Control and Optimization, 42 0 (4): 0 1143--1166, 2003
work page 2003
-
[13]
Continuous control with deep reinforcement learning
Lillicrap, Timothy P, Hunt, Jonathan J, Pritzel, Alexander, Heess, Nicolas, Erez, Tom, Tassa, Yuval, Silver, David, and Wierstra, Daan. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015
work page internal anchor Pith review arXiv 2015
-
[14]
Approximate gradient methods in policy-space optimization of markov reward processes
Marbach, Peter and Tsitsiklis, John N. Approximate gradient methods in policy-space optimization of markov reward processes. Discrete Event Dynamic Systems, 13 0 (1-2): 0 111--148, 2003
work page 2003
-
[15]
Steps toward artificial intelligence
Minsky, Marvin. Steps toward artificial intelligence. Proceedings of the IRE, 49 0 (1): 0 8--30, 1961
work page 1961
-
[16]
Policy invariance under reward transformations: Theory and application to reward shaping
Ng, Andrew Y, Harada, Daishi, and Russell, Stuart. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pp.\ 278--287, 1999
work page 1999
-
[17]
Peters, Jan and Schaal, Stefan. Natural actor-critic. Neurocomputing, 71 0 (7): 0 1180--1190, 2008
work page 2008
-
[18]
Trust Region Policy Optimization
Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan, Michael I, and Abbeel, Pieter. Trust region policy optimization. arXiv preprint arXiv:1502.05477, 2015
work page Pith review arXiv 2015
-
[19]
Introduction to reinforcement learning
Sutton, Richard S and Barto, Andrew G. Introduction to reinforcement learning. MIT Press, 1998
work page 1998
-
[20]
Policy gradient methods for reinforcement learning with function approximation
Sutton, Richard S, McAllester, David A, Singh, Satinder P, and Mansour, Yishay. Policy gradient methods for reinforcement learning with function approximation. In NIPS, volume 99, pp.\ 1057--1063. Citeseer, 1999
work page 1999
-
[21]
Bias in natural actor-critic algorithms
Thomas, Philip. Bias in natural actor-critic algorithms. In Proceedings of The 31st International Conference on Machine Learning, pp.\ 441--448, 2014
work page 2014
-
[22]
Mujoco: A physics engine for model-based control
Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp.\ 5026--5033. IEEE, 2012
work page 2012
-
[23]
Real-time reinforcement learning by sequential actor--critics and experience replay
Wawrzy \'n ski, Pawe . Real-time reinforcement learning by sequential actor--critics and experience replay. Neural Networks, 22 0 (10): 0 1484--1497, 2009
work page 2009
-
[24]
Simple statistical gradient-following algorithms for connectionist reinforcement learning
Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8 0 (3-4): 0 229--256, 1992
work page 1992
-
[25]
Wright, Stephen J and Nocedal, Jorge. Numerical optimization. Springer New York, 1999
work page 1999
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.