hub Canonical reference

Pointvla: Injecting the 3d world into vision-language- action models.arXiv preprint arXiv:2503.07511, 2025a

· 2025 · arXiv 2503.07511

Canonical reference. 83% of citing Pith papers cite this work as background.

14 Pith papers citing it

Background 83% of classified citations

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 baseline 1

citation-polarity summary

background 5 baseline 1

representative citing papers

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

cs.RO · 2026-05-06 · unverdicted · novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation

cs.RO · 2026-04-20 · unverdicted · novelty 6.0

ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.

Block-wise Adaptive Caching for Accelerating Diffusion Policy

cs.AI · 2025-06-16 · unverdicted · novelty 6.0

BAC accelerates transformer-based Diffusion Policy up to 3x by block-level adaptive feature caching using an Adaptive Caching Scheduler and Bubbling Union Algorithm to control error propagation.

GeoHAT: Geometry-Adaptive Hybrid Action Transformer for Mobile Manipulation

cs.RO · 2026-06-11 · unverdicted · novelty 5.0

GeoHAT reports a 79.3% mean success rate on the ManiSkill-HAB mobile manipulation benchmark (23.7% above the strongest baseline) by using gated geometric token injection and a hybrid whole-body action decoder.

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

cs.RO · 2026-06-10 · unverdicted · novelty 5.0

World Pilot augments VLA policies with world-action priors through latent and action steering pathways, reporting 84.7% success on LIBERO-Plus zero-shot OOD and top real-robot results across four tasks.

OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation

cs.RO · 2026-05-25 · unverdicted · novelty 5.0

OASIS improves robotic manipulation success and generalization by predicting camera-frame SE(3) end-effector trajectories to condition the action decoder on pose-supervised states.

X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction

cs.RO · 2026-05-12 · unverdicted · novelty 5.0

X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.

PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

cs.RO · 2026-04-22 · unverdicted · novelty 5.0

PokeVLA is a lightweight VLA model pre-trained on 2.4M samples for spatial grounding and reasoning, then adapted via multi-view semantics and geometry alignment to achieve state-of-the-art robot manipulation performance.

R3D: Revisiting 3D Policy Learning

cs.CV · 2026-04-16 · unverdicted · novelty 5.0

A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

cs.RO · 2025-08-18 · unverdicted · novelty 5.0

This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.

From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model

cs.CV · 2026-05-21 · unverdicted · novelty 4.0 · 2 refs

BehaviorVLA learns long-horizon behavioral representations via causal Mamba encoder and phase-conditioned decoder, reporting SOTA results of 58% on RoboTwin 2.0, 98% on LIBERO, 4.36 on CALVIN, and matching OpenVLA-OFT performance with 50% data in sim-to-real transfer.

Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models

cs.CV · 2026-05-23 · unverdicted · novelty 3.0

The paper quantifies the geometric gap in current VLAs via linear probing and compares three architectures for injecting geometry from GFMs while analyzing impacts of data, cameras, and reconstruction quality.

E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

cs.CV · 2026-04-06

Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation

cs.RO · 2025-12-29

citing papers explorer

Showing 2 of 2 citing papers after filters.

E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes cs.CV · 2026-04-06 · unreviewed · ref 31
Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation cs.RO · 2025-12-29 · unreviewed · ref 18

Pointvla: Injecting the 3d world into vision-language- action models.arXiv preprint arXiv:2503.07511, 2025a

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer