Recognition: unknown
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
read the original abstract
We explore the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Q-learning and Policy Gradients. Experiments on MuJoCo and Atari show that ES is a viable solution strategy that scales extremely well with the number of CPUs available: By using a novel communication strategy based on common random numbers, our ES implementation only needs to communicate scalars, making it possible to scale to over a thousand parallel workers. This allows us to solve 3D humanoid walking in 10 minutes and obtain competitive results on most Atari games after one hour of training. In addition, we highlight several advantages of ES as a black box optimization technique: it is invariant to action frequency and delayed rewards, tolerant of extremely long horizons, and does not need temporal discounting or value function approximation.
This paper has not been read by Pith yet.
Forward citations
Cited by 20 Pith papers
-
Training Non-Differentiable Networks via Optimal Transport
PolyStep optimizes non-differentiable networks via forward-only polytope evaluations and optimal-transport barycentric updates, reaching 93.4% accuracy on hard-LIF spiking networks while outperforming gradient-free baselines.
-
Convergence of zeroth-order proximal point algorithms in the high-temperature regime
ZOPPA at fixed positive temperature converges under minimal assumptions by acting as an exact proximal point method on a smoothed objective, with explicit connections back to the original function and convergence for ...
-
Parameter-Efficient Neuroevolution for Diverse LLM Generation: Quality-Diversity Optimization via Prompt Embedding Evolution
QD-LLM evolves prompt embeddings via neuroevolution in a quality-diversity framework, delivering 46% higher coverage and 41% higher QD-score than prior methods on coding and writing benchmarks.
-
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
-
Direct From Darwin: Deriving Advanced Optimizers From Evolutionary First Principles
SGD, approximations of Newton's method, natural gradient descent, and Adam are proven compatible with evolutionary dynamics when augmented with DLS noise, turning them into valid in silico simulations of asexual Darwi...
-
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
-
HyCOP: Hybrid Composition Operators for Interpretable Learning of PDEs
HyCOP learns policies over compositions of hybrid modules to produce interpretable programs for parametric PDE solution operators with order-of-magnitude OOD gains over monolithic neural operators.
-
Zeroth-Order Optimization at the Edge of Stability
Zeroth-order methods achieve mean-square stability when the step size satisfies a condition involving the entire Hessian spectrum, with full-batch ZO optimizers operating at the edge of stability and large steps regul...
-
Autonomous Diffractometry Enabled by Visual Reinforcement Learning
A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.
-
Importance Sampling Optimization with Laplace Principle
A Laplace-inspired importance sampling scheme for averaging random search points achieves error of order n to the power -2/(d+2) after n evaluations, improving on the n to the power -1/d rate of standard random search...
-
From Cursed to Competitive: Closing the ZO-FO Gap via Input-to-State Stability
Zeroth-order methods achieve the same expected convergence rate as first-order methods without extra dimension dependence by treating them as input-to-state stable systems with controllable perturbations.
-
Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics
Tempered sequential Monte Carlo samples from a Boltzmann-tilted distribution over controllers to optimize trajectories and policies under differentiable dynamics.
-
Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics
Tempered sequential Monte Carlo samples efficiently from a temperature-annealed distribution over controller parameters to solve trajectory and policy optimization under differentiable dynamics.
-
Soft Deterministic Policy Gradient with Gaussian Smoothing
Soft-DPG uses Gaussian smoothing on the Bellman equation to derive a well-defined policy gradient without relying on critic action derivatives, yielding competitive performance on dense-reward tasks and gains on discr...
-
Direct From Darwin: Deriving Advanced Optimizers From Evolutionary First Principles
Darwinian Lineage Simulations unify Fisher's and Wright's evolutionary models and prove that SGD, Natural Gradient Descent, Damped Newton, and Adam become evolutionarily faithful simulations when augmented with DLS noise.
-
Neural Control: Adjoint Learning Through Equilibrium Constraints
Neural Control introduces adjoint-based differentiation through implicit equilibrium constraints to enable memory-efficient gradient computation and robust receding-horizon MPC for multi-stable deformable object manip...
-
Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance
Stable-GFlowNet improves training stability and attack diversity in LLM red-teaming by eliminating Z estimation via contrastive trajectory balance while preserving GFN optimality.
-
Can Tabular Foundation Models Guide Exploration in Robot Policy Learning?
TFM-S3 uses a tabular foundation model to predict returns and guide intermittent global exploration within an SVD-derived policy subspace, yielding faster early convergence and better final performance than TD3 and po...
-
Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models
HDET lets data-parallel replicas explore a spread of learning rates independently before averaging parameters, with an auto-LR controller driven by inter-replica loss differences to produce a self-adapting schedule wi...
-
On the importance of hyperparameters in initializing parameterized quantum circuits
An evolutionary algorithm optimizes initialization hyperparameters for quantum circuits, leading to faster convergence without worsening barren plateaus.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.