Meta reinforcement learning as task inference
read the original abstract
Humans achieve efficient learning by relying on prior knowledge about the structure of naturally occurring tasks. There is considerable interest in designing reinforcement learning (RL) algorithms with similar properties. This includes proposals to learn the learning algorithm itself, an idea also known as meta learning. One formal interpretation of this idea is as a partially observable multi-task RL problem in which task information is hidden from the agent. Such unknown task problems can be reduced to Markov decision processes (MDPs) by augmenting an agent's observations with an estimate of the belief about the task based on past experience. However estimating the belief state is intractable in most partially-observed MDPs. We propose a method that separately learns the policy and the task belief by taking advantage of various kinds of privileged information. Our approach can be very effective at solving standard meta-RL environments, as well as a complex continuous control environment with sparse rewards and requiring long-term memory.
This paper has not been read by Pith yet.
Forward citations
Cited by 4 Pith papers
-
Solving Rubik's Cube with a Robot Hand
Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
Neural Operators for Multi-Task Control and Adaptation
Neural operators approximate the solution operator for multi-task optimal control, generalizing to new tasks and enabling efficient adaptation via branch-trunk structure and meta-training.
-
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-wor...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.