RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models

Xiao Li; Yixin Han; Yize Huang; Yuxuan Chen

arxiv: 2506.17639 · v2 · pith:O6FWNGQXnew · submitted 2025-06-21 · 💻 cs.RO · cs.AI

RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models

Yuxuan Chen , Yixin Han , Yize Huang , Xiao Li This is my paper

classification 💻 cs.RO cs.AI

keywords rlrccompressionrecoverydeploymentinferencemodelstimesvision-language-action

0 comments

read the original abstract

Vision-Language-Action models (VLA) have demonstrated remarkable capabilities and strong potential in complex robotic manipulation. However, their large parameter sizes and high inference latency hinder real-world deployment, especially on resource-constrained platforms. To address this, we conduct a systematic empirical study of model compression for VLAs. Building on these insights, we present \textit{RLRC}, a three-stage compression and recovery pipeline consisting of structured pruning, performance recovery via SFT and RL, and subsequent quantization. The RL stage incorporates a critic warm-up strategy and BC loss regularization to stabilize training and preserve policy behavior. RLRC achieves up to an 8 times memory reduction and 2.3 times inference speedup while maintaining the original task success rate. Extensive experiments across multiple VLA backbones show that RLRC consistently outperforms existing compression baselines, highlighting its effectiveness for on-device deployment. Project website: https://rlrc-vla.github.io

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Vesta: A Generalist Embodied Reasoning Model
cs.RO 2026-06 unverdicted novelty 6.0

Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
Joint Architecture-Token-Bitwidth Multi-Axis Optimization of Vision Transformers for Semiconductor IC Packaging
cs.CV 2026-05 unverdicted novelty 6.0

A joint architecture-token-bitwidth optimization of Vision Transformers delivers over 10x gains in throughput, parameters, FLOPs and energy on a semiconductor defect classification task while preserving required accuracy.
FASTER: Rethinking Real-Time Flow VLAs
cs.RO 2026-03 unverdicted novelty 6.0

FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.
FASTER: Rethinking Real-Time Flow VLAs
cs.RO 2026-03 conditional novelty 6.0

FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.
DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models
cs.LG 2025-10 unverdicted novelty 6.0

DeepThinkVLA shows CoT improves VLA models only under decoding and causal alignment, delivering 97% success on LIBERO and 21.7-point gains via hybrid attention and SFT-RL training.
SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation
cs.CV 2026-05 unverdicted novelty 5.0

SAFE-Pruner forecasts deep-layer token saliency in VLA models via semantic attention consistency and adaptive subtask detection to achieve up to 1.89x speedup with under 1.7% success rate loss.
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization
cs.RO 2026-05 unverdicted novelty 5.0

DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.