pith. machine review for the scientific record. sign in

arxiv: 1611.02779 · v2 · submitted 2016-11-09 · 💻 cs.AI · cs.LG· cs.NE· stat.ML

Recognition: unknown

RL²: Fast Reinforcement Learning via Slow Reinforcement Learning

Authors on Pith no claims yet
classification 💻 cs.AI cs.LGcs.NEstat.ML
keywords learningalgorithmreinforcementfastproblemsdeeplarge-scalelearn
0
0 comments X
read the original abstract

Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world. This paper seeks to bridge this gap. Rather than designing a "fast" reinforcement learning algorithm, we propose to represent it as a recurrent neural network (RNN) and learn it from data. In our proposed method, RL$^2$, the algorithm is encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm. The RNN receives all information a typical RL algorithm would receive, including observations, actions, rewards, and termination flags; and it retains its state across episodes in a given Markov Decision Process (MDP). The activations of the RNN store the state of the "fast" RL algorithm on the current (previously unseen) MDP. We evaluate RL$^2$ experimentally on both small-scale and large-scale problems. On the small-scale side, we train it to solve randomly generated multi-arm bandit problems and finite MDPs. After RL$^2$ is trained, its performance on new MDPs is close to human-designed algorithms with optimality guarantees. On the large-scale side, we test RL$^2$ on a vision-based navigation task and show that it scales up to high-dimensional problems.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Language Models are Few-Shot Learners

    cs.CL 2020-05 accept novelty 8.0

    GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

  2. ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

    cs.AI 2026-05 conditional novelty 7.0

    ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.

  3. Zero-shot Imitation Learning by Latent Topology Mapping

    cs.LG 2026-05 unverdicted novelty 7.0

    ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.

  4. Searching for Activation Functions

    cs.NE 2017-10 conditional novelty 7.0

    Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.

  5. Why Does Agentic Safety Fail to Generalize Across Tasks?

    cs.LG 2026-05 conditional novelty 6.0

    Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...

  6. Dual-Timescale Memory in a Spiking Neuron-Astrocyte Network for Efficient Navigation

    q-bio.QM 2026-04 unverdicted novelty 6.0

    A neuron-astrocyte network with dual-timescale memory reduces median path lengths up to sixfold in partially observable grid-world navigation tasks.

  7. Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 5.0

    Supervised fine-tuning of pretrained LLMs on offline trajectories yields better few-shot sequential decision-making than in-context-only baselines, with a theoretical suboptimality bound derived for linear MDPs by int...

  8. Learning Material-Aware Hamiltonian Risk Fields for Safe Navigation

    cs.LG 2026-05 unverdicted novelty 5.0

    A learned context-energy term in port-Hamiltonian policies creates selective risk navigation that activates evasive forces only when safer paths are available.

  9. The Rise and Potential of Large Language Model Based Agents: A Survey

    cs.AI 2023-09 accept novelty 4.0

    The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.