Policy Distillation
Add this Pith Number to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{DKREUUR6}
Prints a linked pith:DKREUUR6 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
read the original abstract
Policies for complex visual tasks have been successfully learned with deep reinforcement learning, using an approach called deep Q-networks (DQN), but relatively large (task-specific) networks and extensive training are needed to achieve good performance. In this work, we present a novel method called policy distillation that can be used to extract the policy of a reinforcement learning agent and train a new network that performs at the expert level while being dramatically smaller and more efficient. Furthermore, the same method can be used to consolidate multiple task-specific policies into a single policy. We demonstrate these claims using the Atari domain and show that the multi-task distilled agent outperforms the single-task teachers as well as a jointly-trained DQN agent.
This paper has not been read by Pith yet.
Forward citations
Cited by 17 Pith papers
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
-
SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.
-
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
-
Policy-DRIFT: Dynamic Reward-Informed Flow Trajectory Steering
Policy-DRIFT combines conditional flow matching with terminal reward guidance and decoupled DRL to achieve 49% drag reduction in Re_tau=180 channel flow, 16% above DRL benchmarks and with 37 times less actuation energy.
-
Demystifying Deep Reinforcement Learning: A Neuro-Symbolic Framework for Interpretable Open RAN Automation
DeRAN converts black-box DRL policies into interpretable symbolic representations for O-RAN automation, retaining 78-87% of original performance while adding built-in transparency.
-
Demystifying Deep Reinforcement Learning: A Neuro-Symbolic Framework for Interpretable Open RAN Automation
DeRAN converts opaque DRL policies for O-RAN tasks into interpretable symbolic policies via concept abstraction, deep symbolic regression, and neurally guided logic, retaining 78-87% of DRL performance on a live 5G testbed.
-
Precise Aggressive Aerial Maneuvers with Sensorimotor Policies
Reinforcement learning sensorimotor policies enable quadrotors to traverse narrow gaps at extreme tilts with 5 cm clearance using only vision and proprioception, including reactive traversal of moving gaps.
-
MiniLLM: On-Policy Distillation of Large Language Models
MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
-
Progressive Neural Networks
Progressive neural networks learn sequences of RL tasks without catastrophic forgetting by freezing prior columns and adding lateral connections for knowledge transfer.
-
Robot Squid Game: Quadrupedal Locomotion for Traversing Narrow Tunnels
A teacher-student RL policy distillation approach combined with procedural tunnel generation enables quadruped robots to traverse narrow tunnels consistently in both simulation and real-world tests.
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...
-
LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks
LANTERN improves RL sample efficiency by 40-60% via LLM-generated task automata, semantic multi-source policy aggregation, and experience-gated adaptive transfer.
-
Combining Trained Models in Reinforcement Learning
A review of 15 studies finds positive transfer in DRL mainly when source and target tasks share structure or include alignment mechanisms, but compute-matched comparisons against from-scratch baselines remain rare.
-
Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift
JEPA-Indexed Local Expert Growth adds local action corrections for detected shift clusters and yields statistically significant OOD gains on four shift conditions while keeping in-distribution performance intact.
-
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
-
Digital Guardians: The Past and The Future of Cyber-Physical Resilience
A survey frames CPS resilience through five themes and illustrates them in connected transportation and medical systems to provide a roadmap for real-world resilience.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.