hub Mixed citations

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang · 2025 · cs.RO · arXiv 2508.05635

Mixed citation behavior. Most common role is background (64%).

49 Pith papers citing it

Background 64% of classified citations

open full Pith review browse 49 citing papers arXiv PDF

abstract

We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 baseline 2 method 2

citation-polarity summary

background 7 baseline 2 use method 2

representative citing papers

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

cs.RO · 2026-04-23 · unverdicted · novelty 7.0

VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.

Mask World Model: Predicting What Matters for Robust Robot Policy Learning

cs.RO · 2026-04-21 · unverdicted · novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.

JailWAM: Jailbreaking World Action Models in Robot Control

cs.RO · 2026-04-07 · unverdicted · novelty 7.0

JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.

RoboWorld: Fast and Reliable Neural Simulators for Generalist Robot Policy Evaluation

cs.RO · 2026-07-01 · unverdicted · novelty 6.0

RoboWorld introduces an automated pipeline using autoregressive video world models and task-progress VLM scoring, plus Step Forcing for long-horizon stability, to achieve high correlation with real robot policy evaluation.

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

ImageWAM shows image editing models can replace video generation in world action models, delivering better performance with 6x lower FLOPs and 4x lower latency by using edit-derived KV caches as compact context.

Unified Motion-Action Modeling for Heterogeneous Robot Learning

cs.RO · 2026-06-15 · unverdicted · novelty 6.0

UMA treats object motion and robot actions as co-evolving variables under a masked generative objective with hindsight relabeling and contrastive disentanglement to support multi-task pretraining and deployment across heterogeneous robot data.

MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

cs.CV · 2026-06-11 · unverdicted · novelty 6.0

MaskWAM unifies mask prompting and prediction in world-action models via Mixture of Transformers to improve robotic policy generalization on language-ambiguous tasks.

TacForeSight: Force-Guided Tactile World Model for Contact-Rich Manipulation

cs.RO · 2026-06-09 · unverdicted · novelty 6.0

TacForeSight trains a force-conditioned tactile world model to predict latent dynamics and uses those predictions as anticipatory priors inside a visuo-tactile policy for real-time contact-rich manipulation.

Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation

cs.RO · 2026-06-07 · unverdicted · novelty 6.0

Dream-Tac unifies visual and tactile signals in a world action model using contact-gated fusion and attention bias, reporting 31.7% average action accuracy gains on six manipulation tasks.

FAWAM: Force-Aware World Action Models for Closed-Loop Contact-Rich Manipulation

cs.RO · 2026-06-07 · unverdicted · novelty 6.0

FAWAM integrates force signals into perception, prediction, and closed-loop correction, raising success rates 36% over vision baselines in contact-rich manipulation tasks.

WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation

cs.AI · 2026-06-04 · unverdicted · novelty 6.0

WorldFly integrates a world model into a VLA framework via dual-branch coupled flow matching to jointly generate future videos and actions, outperforming baselines on an urban canyon traversal benchmark especially in unseen environments.

OSCAR: Omni-Embodiment Action-Conditioned World Model for Robotics

cs.RO · 2026-06-03 · unverdicted · novelty 6.0

OSCAR finetunes Cosmos-Predict2.5-2B on a deduplicated multi-embodiment robotics dataset with kinematic skeleton conditioning, claiming better action following and significant correlation between virtual and real robot policy evaluations.

MotuBrain: An Advanced World Action Model for Robot Control

cs.RO · 2026-04-30 · unverdicted · novelty 6.0

MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new robots with 50-100 trajectories.

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

cs.RO · 2026-04-29 · unverdicted · novelty 6.0 · 2 refs

X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

cs.RO · 2026-04-23 · unverdicted · novelty 6.0

Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.

Grounded World Model for Semantically Generalizable Planning

cs.RO · 2026-04-13 · conditional · novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

WM-DAgger: Enabling Efficient Data Aggregation for Imitation Learning with World Models

cs.RO · 2026-04-13 · unverdicted · novelty 6.0

WM-DAgger uses world models with corrective action synthesis and consistency-guided filtering to aggregate OOD recovery data for imitation learning, reporting 93.3% success in soft bag pushing with five demonstrations.

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

cs.CV · 2026-03-17 · unverdicted · novelty 6.0

Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.

RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization

cs.CV · 2026-03-13 · unverdicted · novelty 6.0

A dual-tower 4D embodied world model called RoboStereo reduces geometric hallucinations and delivers over 97% relative improvement on manipulation tasks via test-time augmentation, imitative learning, and open exploration.

World Action Models are Zero-shot Policies

cs.RO · 2026-02-17 · unverdicted · novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.

citing papers explorer

Showing 34 of 34 citing papers after filters.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 32 · internal anchor
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 42 · internal anchor
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 55 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis cs.RO · 2026-04-23 · unverdicted · none · ref 35 · internal anchor
VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
Mask World Model: Predicting What Matters for Robust Robot Policy Learning cs.RO · 2026-04-21 · unverdicted · none · ref 23 · internal anchor
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.
JailWAM: Jailbreaking World Action Models in Robot Control cs.RO · 2026-04-07 · unverdicted · none · ref 12 · internal anchor
JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.
RoboWorld: Fast and Reliable Neural Simulators for Generalist Robot Policy Evaluation cs.RO · 2026-07-01 · unverdicted · none · ref 50 · internal anchor
RoboWorld introduces an automated pipeline using autoregressive video world models and task-progress VLM scoring, plus Step Forcing for long-horizon stability, to achieve high correlation with real robot policy evaluation.
Unified Motion-Action Modeling for Heterogeneous Robot Learning cs.RO · 2026-06-15 · unverdicted · none · ref 34 · internal anchor
UMA treats object motion and robot actions as co-evolving variables under a masked generative objective with hindsight relabeling and contrastive disentanglement to support multi-task pretraining and deployment across heterogeneous robot data.
TacForeSight: Force-Guided Tactile World Model for Contact-Rich Manipulation cs.RO · 2026-06-09 · unverdicted · none · ref 25 · internal anchor
TacForeSight trains a force-conditioned tactile world model to predict latent dynamics and uses those predictions as anticipatory priors inside a visuo-tactile policy for real-time contact-rich manipulation.
Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation cs.RO · 2026-06-07 · unverdicted · none · ref 39 · internal anchor
Dream-Tac unifies visual and tactile signals in a world action model using contact-gated fusion and attention bias, reporting 31.7% average action accuracy gains on six manipulation tasks.
FAWAM: Force-Aware World Action Models for Closed-Loop Contact-Rich Manipulation cs.RO · 2026-06-07 · unverdicted · none · ref 13 · internal anchor
FAWAM integrates force signals into perception, prediction, and closed-loop correction, raising success rates 36% over vision baselines in contact-rich manipulation tasks.
OSCAR: Omni-Embodiment Action-Conditioned World Model for Robotics cs.RO · 2026-06-03 · unverdicted · none · ref 5 · internal anchor
OSCAR finetunes Cosmos-Predict2.5-2B on a deduplicated multi-embodiment robotics dataset with kinematic skeleton conditioning, claiming better action following and significant correlation between virtual and real robot policy evaluations.
MotuBrain: An Advanced World Action Model for Robot Control cs.RO · 2026-04-30 · unverdicted · none · ref 25 · internal anchor
MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new robots with 50-100 trajectories.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising cs.RO · 2026-04-29 · unverdicted · none · ref 22 · 2 links · internal anchor
X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training cs.RO · 2026-04-23 · unverdicted · none · ref 36 · internal anchor
Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.
Grounded World Model for Semantically Generalizable Planning cs.RO · 2026-04-13 · conditional · none · ref 42 · internal anchor
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
WM-DAgger: Enabling Efficient Data Aggregation for Imitation Learning with World Models cs.RO · 2026-04-13 · unverdicted · none · ref 18 · internal anchor
WM-DAgger uses world models with corrective action synthesis and consistency-guided filtering to aggregate OOD recovery data for imitation learning, reporting 93.3% success in soft bag pushing with five demonstrations.
World Action Models are Zero-shot Policies cs.RO · 2026-02-17 · unverdicted · none · ref 65 · internal anchor
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.
RISE: Self-Improving Robot Policy with Compositional World Model cs.RO · 2026-02-11 · unverdicted · none · ref 59 · internal anchor
RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy cs.RO · 2025-10-15 · unverdicted · none · ref 22 · internal anchor
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
Ctrl-World: A Controllable Generative World Model for Robot Manipulation cs.RO · 2025-10-11 · unverdicted · none · ref 31 · internal anchor
A controllable world model trained on the DROID dataset generates consistent multi-view robot trajectories for over 20 seconds and improves generalist policy success rates by 44.7% via imagined trajectory fine-tuning.
A Survey on Vision-Language-Action Models for Embodied AI cs.RO · 2024-05-23 · unverdicted · none · ref 139 · internal anchor
This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.
DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation cs.RO · 2026-06-30 · unverdicted · none · ref 25 · 2 links · internal anchor
DVG-WM disentangles dynamics learning from visual synthesis via flow matching and latent degradation to deliver faster, higher-quality video predictions for robotic manipulation.
PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation cs.RO · 2026-06-16 · unverdicted · none · ref 61 · internal anchor
PAIWorld adds explicit geometric cross-view mechanisms and 3D distillation to DiT world models to achieve multi-view 3D consistency in robotic manipulation benchmarks.
MemoryVAM: Integrating Memory into Video Action Model for Robot Manipulation cs.RO · 2026-06-13 · unverdicted · none · ref 7 · internal anchor
MemoryVAM integrates a Perceiver-based Recap Compressor and Cue Gate into video action models, raising success rates on long-horizon manipulation from 5% to 42.5% on LIBERO-Mem and 75-80% on real-robot counting, spatial recall, and tracking tasks.
$\tau_0$-WM: A Unified Video-Action World Model for Robotic Manipulation cs.RO · 2026-05-31 · unverdicted · none · ref 29 · internal anchor
A shared video diffusion backbone jointly predicts future latents and continuous actions while also rolling out candidate actions to predict dense task-progress scores, trained on 27,300 hours of mixed robot and human data.
Dynamic Resilient Spatio-Semantic Memory with Hybrid Localization for Mobile Manipulation cs.RO · 2026-05-30 · unverdicted · none · ref 32 · internal anchor
DREAM is a mobile manipulation system that constructs online spatio-semantic voxel memory with redundancy-aware pruning and hybrid language-vision localization, reporting higher long-horizon success rates than DynaMem in dynamic lab scenes.
World Models for Robotic Manipulation: A Survey cs.RO · 2026-05-27 · accept · none · ref 14 · internal anchor
Survey organizing world models for robotic manipulation into representation families, a functional taxonomy, and infrastructure roles across pretraining, post-training, and inference, while reviewing 34 datasets and evaluation protocols.
Key-Gram: Extensible World Knowledge for Embodied Manipulation cs.RO · 2026-05-18 · unverdicted · none · ref 12 · internal anchor
Key-Gram uses a memory module with key-grams and hashed lookup to inject static linguistic priors into vision-language-action backbones, yielding reported gains on manipulation benchmarks.
WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform cs.RO · 2026-05-18 · unverdicted · none · ref 32 · internal anchor
WorldArena 2.0 extends embodied world model benchmarks to visuotactile perception, interactive policy training, and diverse real and simulated robotic platforms under a unified protocol.
World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems cs.RO · 2026-04-16 · unverdicted · none · ref 20 · internal anchor
The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.
WALL-WM: Carving World Action Modeling at the Event Joints cs.RO · 2026-06-01 · unverdicted · none · ref 47 · internal anchor
WALL-WM introduces event-grounded Vision-Language-Action pretraining that uses semantic events as the atomic unit to address granularity mismatch in world action models and reports state-of-the-art generalization.
GE-Sim 2.0: A Roadmap Towards Comprehensive Closed-loop Video World Simulators for Robotic Manipulation cs.RO · 2026-05-26 · unverdicted · none · ref 15 · internal anchor
GE-Sim 2.0 is a video-based closed-loop simulator for robotic manipulation that adds state expert, world judge, and acceleration modules on top of prior video generation to support policy learning and evaluation.
World Action Models: A Survey cs.RO · 2026-06-18 · unverdicted · none · ref 102 · internal anchor
A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer