Contrastive Representation Regularization for Vision-Language-Action Models

Changyeon Kim; Dongyoung Kim; Jimin Lee; Jinwoo Shin; Kyungmin Lee; Myungkyu Koo; Taeyoung Kim; Younggyo Seo

arxiv: 2510.01711 · v4 · pith:766G5U5Hnew · submitted 2025-10-02 · 💻 cs.RO · cs.LG

Contrastive Representation Regularization for Vision-Language-Action Models

Taeyoung Kim , Jimin Lee , Myungkyu Koo , Dongyoung Kim , Kyungmin Lee , Changyeon Kim , Younggyo Seo , Jinwoo Shin This is my paper

classification 💻 cs.RO cs.LG

keywords modelsrepresentationsrs-clrepresentationrobotcontrastivemanipulationperformance

0 comments

read the original abstract

Vision-Language-Action (VLA) models have shown strong capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive information. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipelines. Our empirical results demonstrate that RS-CL substantially improves the performance of state-of-the-art VLA models; it pushes the prior art to 69.7% achieving the state-of-the-art performance on the RoboCasa-Kitchen benchmark, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Contrastive Action-Image Pre-training for Visuomotor Control
cs.RO 2026-06 unverdicted novelty 6.0

CAIP learns action-aligned visual representations via contrastive pre-training on human hand keypoints from egocentric video, outperforming DINOv2, SigLIP, MVP, and R3M with >30% gains on real dexterous manipulation tasks.
FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning
cs.CV 2026-06 unverdicted novelty 6.0

FiberTune is a new fine-tuning objective that preserves action-fiber visual residuals in VLA policies, yielding performance gains on simulation and physical robot tasks.
Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning
cs.CV 2026-05 unverdicted novelty 6.0

Inverse dynamics prediction is added as an auxiliary task to reduce state aliasing in VLA models by directly supervising the vision encoder on action-relevant visual distinctions using only standard observation-action pairs.