Evolution Strategies as a Scalable Alternative to Reinforcement Learning

Ilya Sutskever, Jonathan Ho, Szymon Sidor, Tim Salimans, Xi Chen

classification 📊 stat.ML cs.AIcs.LGcs.NE

keywords alternativeatariblackevolutionextremelyoptimizationstrategiesstrategy

read the original abstract

We explore the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Q-learning and Policy Gradients. Experiments on MuJoCo and Atari show that ES is a viable solution strategy that scales extremely well with the number of CPUs available: By using a novel communication strategy based on common random numbers, our ES implementation only needs to communicate scalars, making it possible to scale to over a thousand parallel workers. This allows us to solve 3D humanoid walking in 10 minutes and obtain competitive results on most Atari games after one hour of training. In addition, we highlight several advantages of ES as a black box optimization technique: it is invariant to action frequency and delayed rewards, tolerant of extremely long horizons, and does not need temporal discounting or value function approximation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Training Non-Differentiable Networks via Optimal Transport
cs.LG 2026-05 unverdicted novelty 8.0

PolyStep optimizes non-differentiable networks via forward-only polytope evaluations and optimal-transport barycentric updates, reaching 93.4% accuracy on hard-LIF spiking networks while outperforming gradient-free baselines.
Convergence of zeroth-order proximal point algorithms in the high-temperature regime
math.OC 2026-05 unverdicted novelty 7.0

ZOPPA at fixed positive temperature converges under minimal assumptions by acting as an exact proximal point method on a smoothed objective, with explicit connections back to the original function and convergence for ...
Parameter-Efficient Neuroevolution for Diverse LLM Generation: Quality-Diversity Optimization via Prompt Embedding Evolution
cs.NE 2026-05 unverdicted novelty 7.0

QD-LLM evolves prompt embeddings via neuroevolution in a quality-diversity framework, delivering 46% higher coverage and 41% higher QD-score than prior methods on coding and writing benchmarks.
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
cs.CR 2026-05 unverdicted novelty 7.0

PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
Direct From Darwin: Deriving Advanced Optimizers From Evolutionary First Principles
cs.NE 2026-05 unverdicted novelty 7.0

SGD, approximations of Newton's method, natural gradient descent, and Adam are proven compatible with evolutionary dynamics when augmented with DLS noise, turning them into valid in silico simulations of asexual Darwi...
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
cs.CV 2026-05 unverdicted novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
HyCOP: Hybrid Composition Operators for Interpretable Learning of PDEs
cs.CE 2026-05 unverdicted novelty 7.0

HyCOP learns policies over compositions of hybrid modules to produce interpretable programs for parametric PDE solution operators with order-of-magnitude OOD gains over monolithic neural operators.
Zeroth-Order Optimization at the Edge of Stability
cs.LG 2026-04 unverdicted novelty 7.0

Zeroth-order methods achieve mean-square stability when the step size satisfies a condition involving the entire Hessian spectrum, with full-batch ZO optimizers operating at the edge of stability and large steps regul...
Autonomous Diffractometry Enabled by Visual Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.
Importance Sampling Optimization with Laplace Principle
math.OC 2026-04 unverdicted novelty 7.0

A Laplace-inspired importance sampling scheme for averaging random search points achieves error of order n to the power -2/(d+2) after n evaluations, improving on the n to the power -1/d rate of standard random search...
From Cursed to Competitive: Closing the ZO-FO Gap via Input-to-State Stability
math.OC 2026-04 unverdicted novelty 6.0

Zeroth-order methods achieve the same expected convergence rate as first-order methods without extra dimension dependence by treating them as input-to-state stable systems with controllable perturbations.
Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics
cs.LG 2026-04 unverdicted novelty 6.0

Tempered sequential Monte Carlo samples from a Boltzmann-tilted distribution over controllers to optimize trajectories and policies under differentiable dynamics.
Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics
cs.LG 2026-04 unverdicted novelty 6.0

Tempered sequential Monte Carlo samples efficiently from a temperature-annealed distribution over controller parameters to solve trajectory and policy optimization under differentiable dynamics.
Soft Deterministic Policy Gradient with Gaussian Smoothing
cs.LG 2026-05 unverdicted novelty 5.0

Soft-DPG uses Gaussian smoothing on the Bellman equation to derive a well-defined policy gradient without relying on critic action derivatives, yielding competitive performance on dense-reward tasks and gains on discr...
Direct From Darwin: Deriving Advanced Optimizers From Evolutionary First Principles
cs.NE 2026-05 unverdicted novelty 5.0

Darwinian Lineage Simulations unify Fisher's and Wright's evolutionary models and prove that SGD, Natural Gradient Descent, Damped Newton, and Adam become evolutionarily faithful simulations when augmented with DLS noise.
Neural Control: Adjoint Learning Through Equilibrium Constraints
cs.RO 2026-05 unverdicted novelty 5.0

Neural Control introduces adjoint-based differentiation through implicit equilibrium constraints to enable memory-efficient gradient computation and robust receding-horizon MPC for multi-stable deformable object manip...
Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance
cs.LG 2026-05 unverdicted novelty 5.0

Stable-GFlowNet improves training stability and attack diversity in LLM red-teaming by eliminating Z estimation via contrastive trajectory balance while preserving GFN optimality.
Can Tabular Foundation Models Guide Exploration in Robot Policy Learning?
cs.RO 2026-04 unverdicted novelty 5.0

TFM-S3 uses a tabular foundation model to predict returns and guide intermittent global exploration within an SVD-derived policy subspace, yielding faster early convergence and better final performance than TD3 and po...
Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models
cs.LG 2026-04 unverdicted novelty 5.0

HDET lets data-parallel replicas explore a spread of learning rates independently before averaging parameters, with an auto-LR controller driven by inter-replica loss differences to produce a self-adapting schedule wi...
On the importance of hyperparameters in initializing parameterized quantum circuits
quant-ph 2026-04 unverdicted novelty 5.0

An evolutionary algorithm optimizes initialization hyperparameters for quantum circuits, leading to faster convergence without worsening barren plateaus.