Gaze2Act: Gaze-Conditioned Vision-Language-Action Policies for Interactive Robot Manipulation

Bofan Lyu; Boyu Ma; Chuhao Zhou; Geng Li; Gen Li; Jianfei Yang; Jiaqi Bai; Kuangji Zuo; Shijia Han; Xichen Yuan

arxiv: 2605.30282 · v1 · pith:NSC7EKMEnew · submitted 2026-05-28 · 💻 cs.RO

Gaze2Act: Gaze-Conditioned Vision-Language-Action Policies for Interactive Robot Manipulation

Kuangji Zuo , Gen Li , Bofan Lyu , Yanshuo Lu , Boyu Ma , Shijia Han , Xinyu Zhou , Xichen Yuan

show 4 more authors

Chuhao Zhou Jiaqi Bai Geng Li Jianfei Yang

This is my paper

classification 💻 cs.RO

keywords intentgazegaze2actobjectrobotdynamichumaninteractive

0 comments

read the original abstract

Vision-Language-Action (VLA) models have recently shown strong potential for robot learning by following language instructions. However, in practice, language alone is often insufficient to precisely convey human intent. It is difficult to describe which exact object to interact with among similar candidates, where to act on the object, or how the target may change during execution. To address this limitation, we propose Gaze2Act, a novel VLA framework that leverages human gaze as a dynamic and intuitive intent signal for complex interactive manipulation. Gaze2Act first bridges the ego-exo view gap by mapping first-person gaze into the robot's perspective through cross-view semantic matching, producing both an object mask and a gaze point for coarse-to-fine target specification. These cues are then integrated into the policy through perception-level prompting and action-level conditioning, allowing the robot to attend to relevant regions and execute precise interactions under dynamic intent. In a systematic evaluation across seven task categories and 16 real-robot tasks on a Unitree G1 humanoid, Gaze2Act achieves state-of-the-art performance in both intent accuracy and task success rate. It notably outperforms baselines in object disambiguation, fine-grained interaction, and dynamic intent steering. These results demonstrate that human gaze provides a natural, low-burden, and highly expressive modality for human-in-the-loop VLA control.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LA4VLA: Learning to Act without Seeing via Language-Action Pretraining
cs.RO 2026-06 unverdicted novelty 6.0

LA4VLA pretrains on language-action pairs from decomposed demonstrations to create reusable action priors, yielding up to 45 percentage point gains in real-world VLA success rates when mixed with standard training.
GIVE: Grounding Human Gestures in Vision-Language-Action Models
cs.RO 2026-06 unverdicted novelty 5.0

GIVE improves pre-trained VLA models for robotic tasks by incorporating gestures via visual skeleton overlays and semantic descriptions, yielding 40% higher object recognition accuracy and 80% higher task success in r...