hub

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li · 2023 · cs.RO · arXiv 2312.13139

25 Pith papers cite this work. Polarity classification is still indexing.

25 Pith papers citing it

open full Pith review browse 25 citing papers arXiv PDF

abstract

Generative pre-trained models have demonstrated remarkable effectiveness in language and vision domains by learning useful representations. In this paper, we extend the scope of this effectiveness by showing that visual robot manipulation can significantly benefit from large-scale video generative pre-training. We introduce GR-1, a straightforward GPT-style model designed for multi-task language-conditioned visual robot manipulation. GR-1 takes as inputs a language instruction, a sequence of observation images, and a sequence of robot states. It predicts robot actions as well as future images in an end-to-end manner. Thanks to a flexible design, GR-1 can be seamlessly finetuned on robot data after pre-trained on a large-scale video dataset. We perform extensive experiments on the challenging CALVIN benchmark and a real robot. On CALVIN benchmark, our method outperforms state-of-the-art baseline methods and improves the success rate from 88.9% to 94.9%. In the setting of zero-shot unseen scene generalization, GR-1 improves the success rate from 53.3% to 85.4%. In real robot experiments, GR-1 also outperforms baseline methods and shows strong potentials in generalization to unseen scenes and objects. We provide inaugural evidence that a unified GPT-style transformer, augmented with large-scale video generative pre-training, exhibits remarkable generalization to multi-task visual robot manipulation. Project page: https://GR1-Manipulation.github.io

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

cs.RO · 2026-05-08 · unverdicted · novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

cs.RO · 2026-04-09 · unverdicted · novelty 7.0

ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

cs.RO · 2026-05-13 · unverdicted · novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.

ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

GazeVLA: Learning Human Intention for Robotic Manipulation

cs.RO · 2026-04-24 · unverdicted · novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

cs.RO · 2026-04-23 · unverdicted · novelty 6.0

CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.

Human Cognition in Machines: A Unified Perspective of World Models

cs.RO · 2026-04-17 · unverdicted · novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.

VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

cs.RO · 2026-04-10 · unverdicted · novelty 6.0

VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.

Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

cs.RO · 2026-04-03 · conditional · novelty 6.0

MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

cs.AI · 2025-06-11 · unverdicted · novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

cs.RO · 2025-04-03 · unverdicted · novelty 6.0

Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generalizable policies than imitation learning alone.

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

cs.RO · 2025-03-18 · unverdicted · novelty 6.0

GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

cs.CV · 2024-12-19 · unverdicted · novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

cs.RO · 2024-11-29 · unverdicted · novelty 6.0

CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new robots and objects.

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

cs.RO · 2024-10-08 · unverdicted · novelty 6.0

GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.

StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

cs.RO · 2026-04-20 · unverdicted · novelty 5.0

StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict action accuracy on AgiBot and 9.7-17.6% gains in real-robot tasks.

M100: An Orchestrated Dataflow Architecture Powering General AI Computing

cs.LG · 2026-04-20 · unverdicted · novelty 5.0

M100 is a tensor-based dataflow architecture that eliminates heavy caching through compiler-managed data streams, claiming higher utilization and better performance than GPGPUs for AD and LLM inference tasks.

R3D: Revisiting 3D Policy Learning

cs.CV · 2026-04-16 · unverdicted · novelty 5.0

A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

cs.CV · 2026-04-15 · unverdicted · novelty 5.0 · 2 refs

HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.

ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

cs.RO · 2026-04-13 · unverdicted · novelty 5.0

Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.

citing papers explorer

Showing 25 of 25 citing papers.

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving cs.RO · 2026-05-12 · unverdicted · none · ref 69 · internal anchor
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models cs.AI · 2026-05-12 · unverdicted · none · ref 49 · internal anchor
MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models cs.RO · 2026-05-08 · unverdicted · none · ref 26 · internal anchor
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 89 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning cs.RO · 2026-04-09 · unverdicted · none · ref 49 · internal anchor
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data cs.RO · 2026-05-13 · unverdicted · none · ref 49 · internal anchor
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 51 · internal anchor
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 50 · 2 links · internal anchor
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
GazeVLA: Learning Human Intention for Robotic Manipulation cs.RO · 2026-04-24 · unverdicted · none · ref 66 · internal anchor
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors cs.RO · 2026-04-23 · unverdicted · none · ref 6 · internal anchor
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
Human Cognition in Machines: A Unified Perspective of World Models cs.RO · 2026-04-17 · unverdicted · none · ref 185 · internal anchor
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis cs.RO · 2026-04-10 · unverdicted · none · ref 72 · internal anchor
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model cs.RO · 2026-04-03 · conditional · none · ref 9 · internal anchor
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning cs.AI · 2025-06-11 · unverdicted · none · ref 53 · internal anchor
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets cs.RO · 2025-04-03 · unverdicted · none · ref 46 · internal anchor
Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generalizable policies than imitation learning alone.
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots cs.RO · 2025-03-18 · unverdicted · none · ref 92 · internal anchor
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations cs.CV · 2024-12-19 · unverdicted · none · ref 126 · internal anchor
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation cs.RO · 2024-11-29 · unverdicted · none · ref 68 · internal anchor
CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new robots and objects.
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation cs.RO · 2024-10-08 · unverdicted · none · ref 5 · internal anchor
GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement cs.RO · 2026-04-20 · unverdicted · none · ref 47 · internal anchor
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict action accuracy on AgiBot and 9.7-17.6% gains in real-robot tasks.
M100: An Orchestrated Dataflow Architecture Powering General AI Computing cs.LG · 2026-04-20 · unverdicted · none · ref 4 · internal anchor
M100 is a tensor-based dataflow architecture that eliminates heavy caching through compiler-managed data streams, claiming higher utilization and better performance than GPGPUs for AD and LLM inference tasks.
R3D: Revisiting 3D Policy Learning cs.CV · 2026-04-16 · unverdicted · none · ref 41 · internal anchor
A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System cs.CV · 2026-04-15 · unverdicted · none · ref 39 · 2 links · internal anchor
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation cs.RO · 2026-04-13 · unverdicted · none · ref 50 · internal anchor
Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
WorldVLA: Towards Autoregressive Action World Model cs.RO · 2025-06-26 · unverdicted · none · ref 29 · internal anchor
WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer