hub Mixed citations

3D-VLA: A 3D Vision-Language-Action Generative World Model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du · 2024 · cs.CV · arXiv 2403.09631

Mixed citation behavior. Most common role is background (60%).

26 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 26 citing papers arXiv PDF

abstract

Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from perception to action, neglecting the vast dynamics of the world and the relations between actions and dynamics. In contrast, human beings are endowed with world models that depict imagination about future scenarios to plan actions accordingly. To this end, we propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action through a generative world model. Specifically, 3D-VLA is built on top of a 3D-based large language model (LLM), and a set of interaction tokens is introduced to engage with the embodied environment. Furthermore, to inject generation abilities into the model, we train a series of embodied diffusion models and align them into the LLM for predicting the goal images and point clouds. To train our 3D-VLA, we curate a large-scale 3D embodied instruction dataset by extracting vast 3D-related information from existing robotics datasets. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments, showcasing its potential in real-world applications.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 3 unclear 1 use method 1

representative citing papers

VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 7.0

VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.

${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

Action Images: End-to-End Policy Learning via Multiview Video Generation

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

cs.RO · 2026-05-06 · unverdicted · novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

Affordance Agent Harness: Verification-Gated Skill Orchestration

cs.RO · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tradeoffs in open-world affordance grounding.

LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

cs.RO · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

cs.RO · 2026-04-24 · unverdicted · novelty 6.0

A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.

ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation

cs.RO · 2026-04-20 · unverdicted · novelty 6.0

ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.

ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making

cs.RO · 2026-03-26 · unverdicted · novelty 6.0

ThermoAct integrates thermal imaging into VLA models via a VLM planner to enable robots to perceive physical properties like heat and improve safety over vision-only systems.

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

cs.RO · 2025-03-18 · unverdicted · novelty 6.0

GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

cs.RO · 2025-02-27 · accept · novelty 6.0

OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

cs.RO · 2025-02-09 · unverdicted · novelty 6.0

DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.

FAST: Efficient Action Tokenization for Vision-Language-Action Models

cs.RO · 2025-01-16 · unverdicted · novelty 6.0

FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.

OpenVLA: An Open-Source Vision-Language-Action Model

cs.RO · 2024-06-13 · unverdicted · novelty 6.0

OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

cs.RO · 2024-03-19 · accept · novelty 6.0

DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.

Nautilus: From One Prompt to Plug-and-Play Robot Learning

cs.RO · 2026-05-12 · unverdicted · novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning

cs.RO · 2026-04-20 · unverdicted · novelty 5.0

ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.

R3D: Revisiting 3D Policy Learning

cs.CV · 2026-04-16 · unverdicted · novelty 5.0

A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.

CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment

cs.RO · 2026-04-07 · unverdicted · novelty 5.0

CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer to achieve high success rates on multi-arm manipulation tasks.

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

cs.CV · 2025-07-03 · unverdicted · novelty 5.0

MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.

WorldVLA: Towards Autoregressive Action World Model

cs.RO · 2025-06-26 · unverdicted · novelty 5.0

WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.

Towards Robotic Dexterous Hand Intelligence: A Survey

cs.RO · 2026-05-13 · unverdicted · novelty 4.0

A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.

World Action Models: The Next Frontier in Embodied AI

cs.RO · 2026-05-12 · unverdicted · novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

cs.RO · 2026-04-24 · unverdicted · novelty 3.0

A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data generation.

citing papers explorer

Showing 26 of 26 citing papers.

VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 46 · internal anchor
VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities cs.LG · 2026-04-16 · unverdicted · none · ref 12 · internal anchor
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
Action Images: End-to-End Policy Learning via Multiview Video Generation cs.CV · 2026-04-07 · unverdicted · none · ref 69 · internal anchor
Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation cs.RO · 2026-05-06 · unverdicted · none · ref 90 · internal anchor
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
Affordance Agent Harness: Verification-Gated Skill Orchestration cs.RO · 2026-05-01 · unverdicted · none · ref 87 · 2 links · internal anchor
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tradeoffs in open-world affordance grounding.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning cs.RO · 2026-04-30 · unverdicted · none · ref 51 · 2 links · internal anchor
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model cs.RO · 2026-04-24 · unverdicted · none · ref 50 · internal anchor
A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation cs.RO · 2026-04-20 · unverdicted · none · ref 42 · internal anchor
ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making cs.RO · 2026-03-26 · unverdicted · none · ref 21 · internal anchor
ThermoAct integrates thermal imaging into VLA models via a VLM planner to enable robots to perceive physical properties like heat and improve safety over vision-only systems.
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots cs.RO · 2025-03-18 · unverdicted · none · ref 103 · internal anchor
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success cs.RO · 2025-02-27 · accept · none · ref 59 · internal anchor
OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control cs.RO · 2025-02-09 · unverdicted · none · ref 19 · internal anchor
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
FAST: Efficient Action Tokenization for Vision-Language-Action Models cs.RO · 2025-01-16 · unverdicted · none · ref 73 · internal anchor
FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.
OpenVLA: An Open-Source Vision-Language-Action Model cs.RO · 2024-06-13 · unverdicted · none · ref 75 · internal anchor
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset cs.RO · 2024-03-19 · accept · none · ref 67 · internal anchor
DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
Nautilus: From One Prompt to Plug-and-Play Robot Learning cs.RO · 2026-05-12 · unverdicted · none · ref 55 · internal anchor
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning cs.RO · 2026-04-20 · unverdicted · none · ref 48 · internal anchor
ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
R3D: Revisiting 3D Policy Learning cs.CV · 2026-04-16 · unverdicted · none · ref 52 · internal anchor
A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment cs.RO · 2026-04-07 · unverdicted · none · ref 67 · internal anchor
CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer to achieve high success rates on multi-arm manipulation tasks.
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details cs.CV · 2025-07-03 · unverdicted · none · ref 80 · internal anchor
MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.
WorldVLA: Towards Autoregressive Action World Model cs.RO · 2025-06-26 · unverdicted · none · ref 32 · internal anchor
WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
Towards Robotic Dexterous Hand Intelligence: A Survey cs.RO · 2026-05-13 · unverdicted · none · ref 112 · internal anchor
A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.
World Action Models: The Next Frontier in Embodied AI cs.RO · 2026-05-12 · unverdicted · none · ref 270 · internal anchor
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines cs.RO · 2026-04-24 · unverdicted · none · ref 31 · internal anchor
A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data generation.
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems eess.SY · 2026-04-03 · unverdicted · none · ref 183 · internal anchor
A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy cs.CV · 2026-05-08 · unreviewed · ref 50 · 2 links · internal anchor

3D-VLA: A 3D Vision-Language-Action Generative World Model

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer