SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.
hub Canonical reference
Conrft: A reinforced fine-tuning method for vla models via con- sistency policy.arXiv preprint arXiv:2502.05450
Canonical reference. 89% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
AutoSERL achieves strong performance on six real-world robot manipulation tasks using RL guided by a single demonstration via sliding-window intervention, safety recovery, and automatic termination.
StaKe adds lightweight auxiliary heads for manipulation stage identification and next-gripper-transition keyframe prediction to VLA fine-tuning, reporting relative success rate gains of 14% in bimanual simulation and 56% on single-arm real-robot tasks.
Success Visitation Matching uses a discriminator to turn sparse outcome rewards into dense process rewards by matching visitations of successful episodes, provably preserving the optimal policy and speeding up robotic RL finetuning.
SARM2 presents RM, a multi-task stage-aware reward model achieving 80% lower value-estimation MSE, which when used in SPIRAL boosts manipulation task success from ~50% to near-perfect on several benchmarks.
FiberTune is a new fine-tuning objective that preserves action-fiber visual residuals in VLA policies, yielding performance gains on simulation and physical robot tasks.
PACT calibrates credit assignment in HIL-RL by penalizing Bellman targets on suboptimal segments using counterfactual advantages from human-policy preference pairs, yielding 24.5% higher success rates and 1.3x faster convergence on five real-robot tasks.
HandITL enables seamless human intervention in VLA policies for bimanual dexterous manipulation, cutting jitter by 99.8% and improving refined policies by 19% over standard teleoperation.
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
LWD is a fleet-scale offline-to-online RL framework that continually improves pretrained VLA policies using autonomous rollouts and human interventions, reaching 95% average success on real-world manipulation tasks.
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.
MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.
LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a 22% average success rate gain over supervised fine-tuning on the LIBERO benchmark's
TwinRL expands RL exploration via digital twin reconstruction and twin RL warm-up to guide real-world learning, reaching near-100% success with 20 minutes of on-robot time across four tasks.
RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.
DeepThinkVLA shows CoT improves VLA models only under decoding and causal alignment, delivering 97% success on LIBERO and 21.7-point gains via hybrid attention and SFT-RL training.
SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' phenomenon.
Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
WorldSample generates synthetic transitions from a post-trained world model grounded in real rollouts and uses Policy-Paced Learning to improve RL policies, reporting 28% higher success rates and 59% fewer training steps on contact-rich robot tasks.
AllDayNav encodes scene dynamics into a large model's parameters via RL and a multimodal memory, achieving near-100% success rates in lifelong navigation and outperforming map-based and VLM baselines.
citing papers explorer
-
FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning
FiberTune is a new fine-tuning objective that preserves action-fiber visual residuals in VLA policies, yielding performance gains on simulation and physical robot tasks.