Pointvla: Injecting the 3d world into vision-language-action models

· 2026 · arXiv 2026.365330

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Direct Action-Head Injection of A Grounded 3D Point Unlocks Spatial and Task Generalization

cs.RO · 2026-06-26 · unverdicted · novelty 6.0

Direct 3D point grounding injected into the action head via a two-layer MLP and adaptive layer norm boosts VLA success rates by 32-46 points on spatial and task perturbations in LIBERO-PRO.

Fourier Features Let Agents Learn High Precision Policies with Imitation Learning

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

Mapping point clouds to Fourier features improves high-precision imitation learning policies on RoboCasa, ManiSkill3, and real-robot tasks compared with Cartesian inputs.

Sparse2Act: Learning Action-Aligned Sparse 3D Representations for Cross-Domain Robot Manipulation

cs.RO · 2026-06-10 · unverdicted · novelty 5.0

Sparse2Act pretrains sparse 3D encoders via masked action-alignment supervision, yielding reusable representations that reach 86.9% success on LIBERO-10 and enable cross-domain transfer.

citing papers explorer

Showing 3 of 3 citing papers.

Direct Action-Head Injection of A Grounded 3D Point Unlocks Spatial and Task Generalization cs.RO · 2026-06-26 · unverdicted · none · ref 29
Direct 3D point grounding injected into the action head via a two-layer MLP and adaptive layer norm boosts VLA success rates by 32-46 points on spatial and task perturbations in LIBERO-PRO.
Fourier Features Let Agents Learn High Precision Policies with Imitation Learning cs.LG · 2026-06-10 · unverdicted · none · ref 26
Mapping point clouds to Fourier features improves high-precision imitation learning policies on RoboCasa, ManiSkill3, and real-robot tasks compared with Cartesian inputs.
Sparse2Act: Learning Action-Aligned Sparse 3D Representations for Cross-Domain Robot Manipulation cs.RO · 2026-06-10 · unverdicted · none · ref 45
Sparse2Act pretrains sparse 3D encoders via masked action-alignment supervision, yielding reusable representations that reach 86.9% success on LIBERO-10 and enable cross-domain transfer.

Pointvla: Injecting the 3d world into vision-language-action models

fields

years

verdicts

representative citing papers

citing papers explorer