Recognition: unknown
Policy Distillation
read the original abstract
Policies for complex visual tasks have been successfully learned with deep reinforcement learning, using an approach called deep Q-networks (DQN), but relatively large (task-specific) networks and extensive training are needed to achieve good performance. In this work, we present a novel method called policy distillation that can be used to extract the policy of a reinforcement learning agent and train a new network that performs at the expert level while being dramatically smaller and more efficient. Furthermore, the same method can be used to consolidate multiple task-specific policies into a single policy. We demonstrate these claims using the Atari domain and show that the multi-task distilled agent outperforms the single-task teachers as well as a jointly-trained DQN agent.
This paper has not been read by Pith yet.
Forward citations
Cited by 16 Pith papers
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
-
SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.
-
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
-
Demystifying Deep Reinforcement Learning: A Neuro-Symbolic Framework for Interpretable Open RAN Automation
DeRAN converts black-box DRL policies into interpretable symbolic representations for O-RAN automation, retaining 78-87% of original performance while adding built-in transparency.
-
Demystifying Deep Reinforcement Learning: A Neuro-Symbolic Framework for Interpretable Open RAN Automation
DeRAN converts opaque DRL policies for O-RAN tasks into interpretable symbolic policies via concept abstraction, deep symbolic regression, and neurally guided logic, retaining 78-87% of DRL performance on a live 5G testbed.
-
Precise Aggressive Aerial Maneuvers with Sensorimotor Policies
Reinforcement learning sensorimotor policies enable quadrotors to traverse narrow gaps at extreme tilts with 5 cm clearance using only vision and proprioception, including reactive traversal of moving gaps.
-
MiniLLM: On-Policy Distillation of Large Language Models
MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
-
Progressive Neural Networks
Progressive neural networks learn sequences of RL tasks without catastrophic forgetting by freezing prior columns and adding lateral connections for knowledge transfer.
-
Robot Squid Game: Quadrupedal Locomotion for Traversing Narrow Tunnels
A teacher-student RL policy distillation approach combined with procedural tunnel generation enables quadruped robots to traverse narrow tunnels consistently in both simulation and real-world tests.
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.
-
LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks
LANTERN improves RL sample efficiency by 40-60% via LLM-generated task automata, semantic multi-source policy aggregation, and experience-gated adaptive transfer.
-
Combining Trained Models in Reinforcement Learning
A review of 15 studies finds positive transfer in DRL mainly when source and target tasks share structure or include alignment mechanisms, but compute-matched comparisons against from-scratch baselines remain rare.
-
Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift
JEPA-Indexed Local Expert Growth adds local action corrections for detected shift clusters and yields statistically significant OOD gains on four shift conditions while keeping in-distribution performance intact.
-
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
-
Digital Guardians: The Past and The Future of Cyber-Physical Resilience
A survey frames CPS resilience through five themes and illustrates them in connected transportation and medical systems to provide a roadmap for real-world resilience.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.