hub

Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control

Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, Shuo Yang · 2025 · arXiv 2603.10448

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Colosseum V2: Benchmarking Generalization for Vision Language Action Models

cs.RO · 2026-05-26 · unverdicted · novelty 7.0

Introduces Colosseum V2 benchmark for evaluating VLA model generalization in robotic manipulation with 28 tasks, revealing limitations in current methods and sim-real correlations.

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

cs.CV · 2026-05-14 · conditional · novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

ImageWAM shows image editing models can replace video generation in world action models, delivering better performance with 6x lower FLOPs and 4x lower latency by using edit-derived KV caches as compact context.

MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

cs.CV · 2026-06-11 · unverdicted · novelty 6.0

MaskWAM unifies mask prompting and prediction in world-action models via Mixture of Transformers to improve robotic policy generalization on language-ambiguous tasks.

MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation

cs.RO · 2026-06-08 · unverdicted · novelty 6.0

MotionWAM conditions a policy on intermediate features from a video world model to predict unified whole-body motion tokens, enabling real-time humanoid loco-manipulation that outperforms VLA baselines by over 30% on nine Unitree G1 tasks.

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

cs.RO · 2026-04-29 · unverdicted · novelty 6.0 · 2 refs

X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

GenHOI: Contact-Aware Humanoid-Object Interaction by Imitating Generated Videos without Task-Specific Training

cs.RO · 2026-06-11 · unverdicted · novelty 5.0

GenHOI reconstructs robot-object scenes, generates task videos from language and first-frame images, extracts contact constraints, optimizes reference trajectories, and executes them via closed-loop control for zero-shot humanoid-object interaction.

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

cs.RO · 2026-06-10 · unverdicted · novelty 5.0

World Pilot augments VLA policies with world-action priors through latent and action steering pathways, reporting 84.7% success on LIBERO-Plus zero-shot OOD and top real-robot results across four tasks.

Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

cs.CV · 2026-06-10 · unverdicted · novelty 5.0

AGRA is an Action-Grounded Representation Alignment objective that aligns intermediate video diffusion features with semantic representations to make world action model hidden states more useful for low-level robot control, improving localization, affordance, and robustness.

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

cs.RO · 2026-06-04 · unverdicted · novelty 5.0

WLA models use an autoregressive Transformer to jointly predict textual subtasks, subgoal images, and robot actions from instructions, images, and states, reporting SOTA success rates on RoboTwin2.0 and RMBench.

Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning

cs.RO · 2026-06-04 · unverdicted · novelty 5.0

Discrete-WAM unifies world modeling and policy learning for autonomous driving by representing observations, states, decisions, and actions as tokens in one space and using hierarchical token editing for planning.

SANTS: A State-Adaptive Scheduler for World Action Models

cs.RO · 2026-05-27 · unverdicted · novelty 5.0

SANTS adaptively chooses denoising depth in video-based robot action diffusion policies using a state-dependent stopping hazard and noise ratio, trained via downstream action reward to reduce latency.

Unified Video-Action Joint Denoising for Dexterous Action and Data Generation

cs.CV · 2026-06-02 · unverdicted · novelty 4.0

Donk is a unified video-action denoising model that generates dexterous hand trajectories and videos under language, image, and state conditioning while also serving as a text-conditioned data engine.

World Action Models: The Next Frontier in Embodied AI

cs.RO · 2026-05-12 · unverdicted · novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

World Model for Robot Learning: A Comprehensive Survey

cs.RO · 2026-04-30 · unverdicted · novelty 3.0

A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datasets, and benchmarks.

citing papers explorer

Showing 11 of 11 citing papers after filters.

Colosseum V2: Benchmarking Generalization for Vision Language Action Models cs.RO · 2026-05-26 · unverdicted · none · ref 48
Introduces Colosseum V2 benchmark for evaluating VLA model generalization in robotic manipulation with 28 tasks, revealing limitations in current methods and sim-real correlations.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 36
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation cs.RO · 2026-06-08 · unverdicted · none · ref 21
MotionWAM conditions a policy on intermediate features from a video world model to predict unified whole-body motion tokens, enabling real-time humanoid loco-manipulation that outperforms VLA baselines by over 30% on nine Unitree G1 tasks.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising cs.RO · 2026-04-29 · unverdicted · none · ref 28 · 2 links
X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
GenHOI: Contact-Aware Humanoid-Object Interaction by Imitating Generated Videos without Task-Specific Training cs.RO · 2026-06-11 · unverdicted · none · ref 29
GenHOI reconstructs robot-object scenes, generates task videos from language and first-frame images, extracts contact constraints, optimizes reference trajectories, and executes them via closed-loop control for zero-shot humanoid-object interaction.
World Pilot: Steering Vision-Language-Action Models with World-Action Priors cs.RO · 2026-06-10 · unverdicted · none · ref 57
World Pilot augments VLA policies with world-action priors through latent and action steering pathways, reporting 84.7% success on LIBERO-Plus zero-shot OOD and top real-robot results across four tasks.
World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis cs.RO · 2026-06-04 · unverdicted · none · ref 38
WLA models use an autoregressive Transformer to jointly predict textual subtasks, subgoal images, and robot actions from instructions, images, and states, reporting SOTA success rates on RoboTwin2.0 and RMBench.
Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning cs.RO · 2026-06-04 · unverdicted · none · ref 58
Discrete-WAM unifies world modeling and policy learning for autonomous driving by representing observations, states, decisions, and actions as tokens in one space and using hierarchical token editing for planning.
SANTS: A State-Adaptive Scheduler for World Action Models cs.RO · 2026-05-27 · unverdicted · none · ref 10
SANTS adaptively chooses denoising depth in video-based robot action diffusion policies using a state-dependent stopping hazard and noise ratio, trained via downstream action reward to reduce latency.
World Action Models: The Next Frontier in Embodied AI cs.RO · 2026-05-12 · unverdicted · none · ref 113
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
World Model for Robot Learning: A Comprehensive Survey cs.RO · 2026-04-30 · unverdicted · none · ref 39
A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datasets, and benchmarks.

Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer