From Prior to Pro: Efficient Skill Mastery via Distribution Contractive RL Finetuning

Shuran Song; Zhanyi Sun

arxiv: 2603.10263 · v2 · pith:ZO3TO2T6new · submitted 2026-03-10 · 💻 cs.RO · cs.LG

From Prior to Pro: Efficient Skill Mastery via Distribution Contractive RL Finetuning

Zhanyi Sun , Shuran Song This is my paper

classification 💻 cs.RO cs.LG

keywords dice-rldistributionbehaviorcontractiveframeworklearningmasterypolicy

0 comments

read the original abstract

We introduce Distribution Contractive Reinforcement Learning (DICE-RL), a framework that uses reinforcement learning (RL) as a "distribution contraction" operator to refine pretrained generative robot policies. DICE-RL turns a pretrained behavior prior into a high-performing "pro" policy by amplifying high-success behaviors from online feedback. We pretrain a diffusion- or flow-based policy for broad behavioral coverage, then finetune it with a stable, sample-efficient residual off-policy RL framework that combines selective behavior regularization with value-guided action selection. Extensive experiments and analyses show that DICE-RL reliably improves performance with strong stability and sample efficiency. It enables mastery of complex long-horizon manipulation skills directly from high-dimensional pixel inputs, both in simulation and on a real robot. Project website: https://zhanyisun.github.io/dice.rl.2026/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Adapting Generalist Robot Policies with Semantic Reinforcement Learning
cs.RO 2026-06 unverdicted novelty 7.0

SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.
Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 6.0

ZPRL adapts frozen flow-matching imitation policies via RL perturbations on a task-relevant bottleneck latent, yielding 33.7% higher average success on four real-world manipulation tasks than action-residual baselines.
Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards
cs.LG 2026-06 unverdicted novelty 4.0

Coherent IRL learns dense rewards from demos to enable sample-efficient off-policy improvement of large behavior-cloned policies on sparse robotic manipulation tasks.