hub Mixed citations

Unified vision-language-action model.arXiv preprint arXiv:2506.19850

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, Zhaoxiang Zhang · 2025 · arXiv 2506.19850

Mixed citation behavior. Most common role is background (60%).

20 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 baseline 2 dataset 1

citation-polarity summary

background 6 baseline 2 unclear 1 use dataset 1

representative citing papers

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

cs.AI · 2026-05-01 · unverdicted · novelty 7.0

A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.

CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

cs.RO · 2026-02-18 · unverdicted · novelty 7.0

PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.

ChainFlow-VLA: Causal Flow Planning with Vision-Language Models

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

ChainFlow-VLA unifies autoregressive causal trajectory modes with VLM-conditioned diffusion refinement to reach 94.85 on NAVSIM v1, matching human performance.

CLOVER: Closed-Loop Value Estimation and Ranking for End-to-End Autonomous Driving Planning

cs.RO · 2026-05-14 · conditional · novelty 6.0

CLOVER is a closed-loop generator-scorer framework that expands proposal coverage with pseudo-expert trajectories and performs conservative self-distillation to achieve state-of-the-art planning scores on NAVSIM and nuScenes.

Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

cs.RO · 2026-05-13 · conditional · novelty 6.0

GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

cs.RO · 2026-04-29 · unverdicted · novelty 6.0 · 2 refs

X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

cs.RO · 2025-12-10 · unverdicted · novelty 6.0

HiF-VLA improves long-horizon robotic manipulation by encoding past motion as hindsight priors and anticipating future motion through foresight reasoning inside a VLA framework.

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

cs.RO · 2025-10-15 · unverdicted · novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

cs.RO · 2025-09-08 · unverdicted · novelty 6.0

F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.

A Survey on Vision-Language-Action Models for Embodied AI

cs.RO · 2024-05-23 · unverdicted · novelty 6.0

This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.

RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models

cs.RO · 2026-05-19 · unverdicted · novelty 5.0

RoVLA enforces instructional, evolutionary, and observational consistency to improve robustness of VLA policies on manipulation benchmarks and real robots.

EponaV2: Driving World Model with Comprehensive Future Reasoning

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

Causal World Modeling for Robot Control

cs.CV · 2026-01-29 · unverdicted · novelty 5.0

LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

cs.LG · 2025-11-24 · unverdicted · novelty 5.0

AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

cs.RO · 2025-08-18 · unverdicted · novelty 5.0

This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.

RLDX-1 Technical Report

cs.RO · 2026-05-05 · unverdicted · novelty 4.0 · 2 refs

RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.

Test-Time Training for Visual Foresight Vision-Language-Action Models

cs.CV · 2026-05-06

citing papers explorer

Showing 20 of 20 citing papers.

Point Tracking Improves World Action Models cs.RO · 2026-05-22 · unverdicted · none · ref 21
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 79
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation cs.AI · 2026-05-01 · unverdicted · none · ref 24
A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies cs.CV · 2026-04-27 · unverdicted · none · ref 43
CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation cs.RO · 2026-02-18 · unverdicted · none · ref 56
PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.
ChainFlow-VLA: Causal Flow Planning with Vision-Language Models cs.CV · 2026-05-22 · unverdicted · none · ref 17
ChainFlow-VLA unifies autoregressive causal trajectory modes with VLM-conditioned diffusion refinement to reach 94.85 on NAVSIM v1, matching human performance.
CLOVER: Closed-Loop Value Estimation and Ranking for End-to-End Autonomous Driving Planning cs.RO · 2026-05-14 · conditional · none · ref 33
CLOVER is a closed-loop generator-scorer framework that expands proposal coverage with pseudo-expert trajectories and performs conservative self-distillation to achieve state-of-the-art planning scores on NAVSIM and nuScenes.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-05-13 · conditional · none · ref 32
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising cs.RO · 2026-04-29 · unverdicted · none · ref 39 · 2 links
X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models cs.RO · 2025-12-10 · unverdicted · none · ref 38
HiF-VLA improves long-horizon robotic manipulation by encoding past motion as hindsight priors and anticipating future motion through foresight reasoning inside a VLA framework.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy cs.RO · 2025-10-15 · unverdicted · none · ref 39
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions cs.RO · 2025-09-08 · unverdicted · none · ref 26
F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.
A Survey on Vision-Language-Action Models for Embodied AI cs.RO · 2024-05-23 · unverdicted · none · ref 129
This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.
RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models cs.RO · 2026-05-19 · unverdicted · none · ref 26
RoVLA enforces instructional, evolutionary, and observational consistency to improve robustness of VLA policies on manipulation benchmarks and real robots.
EponaV2: Driving World Model with Comprehensive Future Reasoning cs.CV · 2026-05-14 · unverdicted · none · ref 75
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
Causal World Modeling for Robot Control cs.CV · 2026-01-29 · unverdicted · none · ref 78
LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention cs.LG · 2025-11-24 · unverdicted · none · ref 50
AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey cs.RO · 2025-08-18 · unverdicted · none · ref 107
This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.
RLDX-1 Technical Report cs.RO · 2026-05-05 · unverdicted · none · ref 105 · 2 links
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
Test-Time Training for Visual Foresight Vision-Language-Action Models cs.CV · 2026-05-06 · unreviewed · ref 3

Unified vision-language-action model.arXiv preprint arXiv:2506.19850

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer