{"total":13,"items":[{"citing_arxiv_id":"2605.19919","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2026-05-19T14:43:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ZPRL adapts frozen flow-matching imitation policies via RL perturbations on a task-relevant bottleneck latent, yielding 33.7% higher average success on four real-world manipulation tasks than action-residual baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12236","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning","primary_cat":"cs.RO","submitted_at":"2026-05-12T15:07:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TMRL bridges behavioral cloning pretraining and RL finetuning via diffusion noise and timestep modulation to enable controlled exploration, improving sample efficiency and enabling real-world robot training in under one hour.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"A dominant paradigm for training real-world robotic poli- cies is to first pre-train on large-scale demonstration datasets via imitation learning, and then fine-tune the resulting policy with reinforcement learning (RL) in deployment environ- ments [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. This RL fine-tuning stage is critical for improving task precision, throughput and robustness [11, 6, 12], yet remains bottlenecked by sample efficiency due to the cost of real-world interaction. While prior work in robotics primarily focuses on improving the RL algorithms themselves, comparatively little attention has been paid to ensuring that pre-trained policies provide effective initializations for downstream RL. In this work, we propose a"},{"citing_arxiv_id":"2605.03065","ref_index":153,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OGPO: Sample Efficient Full-Finetuning of Generative Control Policies","primary_cat":"cs.LG","submitted_at":"2026-05-04T18:36:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00416","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies","primary_cat":"cs.RO","submitted_at":"2026-05-01T05:20:26+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"recoveries, partial progress, and task rewards. Reinforcement Learning (RL) in principle provides such a mechanism by optimizing policy behavior from task outcomes and policy experience [8, 9, 10, 11]. Yet existing RL approaches for robotics are often limited to small-scale, short-horizon, or task- specific settings, and frequently specialize a pretrained gener- alist policy to a narrow task [12, 13, 14]. A scalable method for post-training end-to-end VLA policies from fleet deployment experience while preserving their generality remains an open problem. Addressing this gap requires an RL algorithm for LWD that is compatible with pretrained VLA policies, can learn from large offline and off-policy datasets, and can adapt rapidly as new deployment data streams in."},{"citing_arxiv_id":"2604.23073","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RL Token: Bootstrapping Online RL with Vision-Language-Action Models","primary_cat":"cs.LG","submitted_at":"2026-04-24T23:57:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"large demonstration datasets has recently emerged as the dominant paradigm for training generalist robot manipulation policies (see e.g. [6-11]). Two critical ingredients that have enabled this success are action chunking [12], which predicts multiple actions for sequential open-loop execution, and using expressive output distributions, such as diffusion [13] or au- toregressive generation [6], that can capture the multimodality inherent in demonstration data. A further advancement came from using large pretrained vision-language models as a back- bone for language-conditioned generalist policies, yielding vision-language-action (VLA) models [6, 7]. These models import large web-scale prior knowledge into closed-loop robot policies. Recent work has combined VLA backbones with"},{"citing_arxiv_id":"2604.22235","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning-augmented robotic automation for real-world manufacturing","primary_cat":"cs.RO","submitted_at":"2026-04-24T05:20:26+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A learning-augmented robotic system automated deformable cable insertion and soldering on a live electric-motor production line for 5 hours 10 minutes, producing 108 motors at 99.4% pass rate with under 20 minutes of real-world data per task and no physical fencing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10165","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks","primary_cat":"cs.RO","submitted_at":"2026-04-11T11:24:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"compounding errors, which makes it struggle with the strin- gent requirements of high-precision manipulation [3], [4]. Conversely, Reinforcement Learning (RL) [5] optimizes poli- cies through trial-and-error exploration and reward signals, allowing agents to perform beyond the scope of demonstra- tion data. RL has shown significant robustness, especially in contact-rich tasks [6]-[9]. Recent progress in Human-in-the- Loop RL (HIL-RL) within real-world environments [9]-[12] has enabled the use of real-time human interventions to refine policies during on-robot deployment, leading to performance improvements in robotic tasks. Despite these gains, HIL- RL still faces inherent challenges such as slow convergence This work was supported by the National Natural Science Foundation of"},{"citing_arxiv_id":"2603.15759","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation","primary_cat":"cs.RO","submitted_at":"2026-03-16T18:00:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimDist pretrains world models in simulation and adapts them to real-world robots by updating only the latent dynamics model, enabling rapid improvement on contact-rich tasks where prior methods fail.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.12243","ref_index":58,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies","primary_cat":"cs.RO","submitted_at":"2026-03-12T17:56:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HandelBot refines simulation policies via physical rollouts and residual RL to achieve precise bimanual piano playing, outperforming direct sim transfer by 1.8x with only 30 minutes of real data across five songs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.16712","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"One Hand to Rule Them All: Canonical Representations for Unified Dexterous Manipulation","primary_cat":"cs.RO","submitted_at":"2026-02-18T18:59:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A unified parameter space and canonical URDF enable cross-embodiment dexterous grasping policies with 81.9% zero-shot success on unseen hands like the 3-finger LEAP Hand.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.11075","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RISE: Self-Improving Robot Policy with Compositional World Model","primary_cat":"cs.RO","submitted_at":"2026-02-11T17:43:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.09023","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-02-09T18:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TwinRL expands RL exploration via digital twin reconstruction and twin RL warm-up to guide real-world learning, reaching near-100% success with 20 minutes of on-robot time across four tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.14759","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"$\\pi^{*}_{0.6}$: a VLA That Learns From Experience","primary_cat":"cs.LG","submitted_at":"2025-11-18T18:58:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In contrast to these works, our method uses both expert interventions and fully autonomous experience, resulting in an RL-based framework that integrates multiple data sources. There is a large body of work on using RL for autonomous improvement of robotic manipulation policies [13-21], in- cluding methods using diffusion-based policies [22-24], in multi-task settings [25, 26], and using pre-trained multi-task policies [27-29]. Unlike these works, we study how to scale real-world RL to large VLA policies for long-horizon, fine- grained manipulation tasks. Many recent works have studied how to improve a base VLA model through RL. Several works directly apply the proximal policy optimization (PPO) algorithm and variations"}],"limit":50,"offset":0}