pith. sign in

arxiv: 2510.14836 · v3 · pith:ACXLWRPLnew · submitted 2025-10-16 · 💻 cs.CV · cs.RO

QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

classification 💻 cs.CV cs.RO
keywords depthmodelsqdepth-vlatasksauxiliarymanipulationpredictionquantized
0
0 comments X
read the original abstract

Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

    cs.RO 2026-05 conditional novelty 7.0

    EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines...

  2. UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

    cs.RO 2026-02 unverdicted novelty 7.0

    UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.