pith. sign in

arxiv: 2603.10263 · v2 · pith:ZO3TO2T6new · submitted 2026-03-10 · 💻 cs.RO · cs.LG

From Prior to Pro: Efficient Skill Mastery via Distribution Contractive RL Finetuning

classification 💻 cs.RO cs.LG
keywords dice-rldistributionbehaviorcontractiveframeworklearningmasterypolicy
0
0 comments X
read the original abstract

We introduce Distribution Contractive Reinforcement Learning (DICE-RL), a framework that uses reinforcement learning (RL) as a "distribution contraction" operator to refine pretrained generative robot policies. DICE-RL turns a pretrained behavior prior into a high-performing "pro" policy by amplifying high-success behaviors from online feedback. We pretrain a diffusion- or flow-based policy for broad behavioral coverage, then finetune it with a stable, sample-efficient residual off-policy RL framework that combines selective behavior regularization with value-guided action selection. Extensive experiments and analyses show that DICE-RL reliably improves performance with strong stability and sample efficiency. It enables mastery of complex long-horizon manipulation skills directly from high-dimensional pixel inputs, both in simulation and on a real robot. Project website: https://zhanyisun.github.io/dice.rl.2026/.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Adapting Generalist Robot Policies with Semantic Reinforcement Learning

    cs.RO 2026-06 unverdicted novelty 7.0

    SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.

  2. Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    ZPRL adapts frozen flow-matching imitation policies via RL perturbations on a task-relevant bottleneck latent, yielding 33.7% higher average success on four real-world manipulation tasks than action-residual baselines.

  3. Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

    cs.LG 2026-06 unverdicted novelty 4.0

    Coherent IRL learns dense rewards from demos to enable sample-efficient off-policy improvement of large behavior-cloned policies on sparse robotic manipulation tasks.