pith. machine review for the scientific record. sign in

arxiv: 2601.21998 · v2 · submitted 2026-01-29 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

Causal World Modeling for Robot Control

Fei Han, Lin Li, Mingrui Yu, Nan Xue, Qihang Zhang, Ruilin Wang, Shuai Yang, Xing Zhu, Yiming Luo, Yinghao Xu, Yujun Shen, Zelin Gao

Pith reviewed 2026-05-12 13:49 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords video world modelingrobot learningautoregressive diffusioncausal controllong-horizon manipulationshared latent spacemixture of transformers
0
0 comments X

The pith

Video world modeling establishes an independent foundation for robot learning by enabling causal prediction of visual outcomes from actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that video world models can stand alone as a basis for robot learning, separate from or alongside vision-language pre-training. By training on predicting future frames while also learning to choose actions, the approach aims to capture the cause-and-effect links between what a robot does and what it sees. A sympathetic reader would care if this leads to robots that handle extended tasks with less training data and better adaptation to new settings.

Core claim

This work claims that video world modeling provides a fresh foundation for robot learning through an autoregressive diffusion framework, LingBot-VA, which jointly learns frame prediction and policy execution using a shared latent space driven by Mixture-of-Transformers, closed-loop rollouts with ground-truth observations, and asynchronous inference to support efficient control.

What carries the argument

LingBot-VA is the autoregressive diffusion framework that integrates vision and action tokens in a shared latent space via a Mixture-of-Transformers architecture, enabling simultaneous future frame prediction and action selection.

If this is right

  • It supports long-horizon manipulation tasks effectively.
  • It achieves data efficiency in post-training.
  • It generalizes well to novel configurations.
  • It provides the ability to imagine the near future by understanding action-visual dynamics causality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Video-based world models might serve as a primary pre-training method for robots, reducing reliance on language data.
  • The asynchronous pipeline could enable faster real-world deployment by running prediction and execution in parallel.
  • Testing this in more diverse physical environments might show how well the causal understanding transfers beyond the evaluated scenarios.

Load-bearing premise

That the combination of shared latent space, closed-loop rollout with ground-truth observations, and asynchronous inference is sufficient to establish effective causal understanding and robot control.

What would settle it

Demonstrating that the model performs no better than standard methods on long-horizon manipulation without one of the three designs, such as without closed-loop feedback, would challenge the central claim.

read the original abstract

This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that video world modeling, alongside vision-language pre-training, provides a fresh foundation for robot learning. It introduces LingBot-VA, an autoregressive diffusion framework that jointly learns frame prediction and policy execution via three designs: (1) a shared latent space integrating vision and action tokens using a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism that acquires ongoing environmental feedback via ground-truth observations, and (3) an asynchronous inference pipeline that parallelizes action prediction and motor execution. Evaluations on simulation benchmarks and real-world scenarios are said to demonstrate promise for long-horizon manipulation, data efficiency in post-training, and generalizability to novel configurations, with code and model released publicly.

Significance. If the results hold, the work could strengthen video-based world models as an independent route to robot control, particularly for data-efficient long-horizon tasks. The public release of code and model is a clear strength that supports reproducibility. The three-design integration is presented as enabling causal understanding from visual dynamics, but the abstract supplies no equations, training details, ablations, or quantitative metrics, so the magnitude of any advance over prior world-model or diffusion-policy approaches remains difficult to assess.

major comments (1)
  1. [Abstract] Abstract (description of design (2)): The closed-loop rollout feeds ground-truth observations back into the model at each step. This setup risks masking deficiencies in the learned dynamics, because the model never has to sustain an internal causal state across multiple steps without perfect external feedback. The central claim that the three designs together produce effective causal understanding for long-horizon control therefore requires additional experiments (e.g., open-loop rollouts using only predicted observations, or rollouts under sensor noise) to be load-bearing; without them the reported gains in manipulation and generalizability could be artifacts of the GT feedback assumption rather than emergent causal modeling.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'significant promise' is used without any accompanying metrics, baseline comparisons, or ablation results, which makes it impossible to judge the strength of the empirical claims from the provided text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review of our manuscript. The comment on the closed-loop rollout raises a substantive point about validating causal modeling, which we address directly below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (description of design (2)): The closed-loop rollout feeds ground-truth observations back into the model at each step. This setup risks masking deficiencies in the learned dynamics, because the model never has to sustain an internal causal state across multiple steps without perfect external feedback. The central claim that the three designs together produce effective causal understanding for long-horizon control therefore requires additional experiments (e.g., open-loop rollouts using only predicted observations, or rollouts under sensor noise) to be load-bearing; without them the reported gains in manipulation and generalizability could be artifacts of the GT feedback assumption rather than emergent causal modeling.

    Authors: We agree that closed-loop evaluation with ground-truth feedback alone does not fully isolate the model's learned causal dynamics, as the policy can rely on perfect external observations rather than its internal predictions. Our design choice reflects standard robot control practice, where sensory feedback is typically available, and the joint training of frame prediction and policy is intended to encourage the model to internalize action-visual causality. Nevertheless, to make the causal claims load-bearing, we will add open-loop rollout experiments (using only the model's own predicted observations over multiple steps) and evaluations under simulated sensor noise in the revised manuscript. These will be reported alongside the existing closed-loop results to quantify the contribution of the learned dynamics. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical model introduction with no derivations or self-referential reductions

full rationale

The paper presents LingBot-VA as an autoregressive diffusion framework incorporating three architectural choices (shared MoT latent space, closed-loop GT rollout, asynchronous inference) and reports empirical results on simulation and real-world tasks. No equations, first-principles derivations, or parameter-fitting steps are described that could reduce a claimed prediction back to the inputs by construction. The central claims rest on experimental evaluation rather than any self-definitional loop, fitted-input-as-prediction, or load-bearing self-citation chain. The closed-loop design is an explicit modeling choice whose limitations are external to circularity analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility; the central claim rests on the unverified premise that video frame prediction captures actionable causality and that the listed designs suffice for closed-loop control.

axioms (1)
  • domain assumption Video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics.
    Stated directly as the intuitive foundation in the abstract.

pith-pipeline@v0.9.0 · 5502 in / 1212 out tokens · 49356 ms · 2026-05-12T13:49:47.969450+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

    cs.RO 2026-04 conditional novelty 8.0

    Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.

  2. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  3. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  4. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  5. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  6. Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models

    cs.RO 2026-04 unverdicted novelty 7.0

    Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.

  7. JailWAM: Jailbreaking World Action Models in Robot Control

    cs.RO 2026-04 unverdicted novelty 7.0

    JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.

  8. RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

    cs.RO 2026-05 unverdicted novelty 6.0

    A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

  9. HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.

  10. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  11. When to Trust Imagination: Adaptive Action Execution for World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...

  12. When to Trust Imagination: Adaptive Action Execution for World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.

  13. MotuBrain: An Advanced World Action Model for Robot Control

    cs.RO 2026-04 unverdicted novelty 6.0

    MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new rob...

  14. ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control

    cs.RO 2026-04 unverdicted novelty 6.0

    ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.

  15. SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

    cs.CV 2026-04 unverdicted novelty 6.0

    SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.

  16. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...

  17. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

  18. Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.

  19. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  20. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

  21. AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

    cs.RO 2026-04 unverdicted novelty 6.0

    AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.

  22. DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks

    cs.CV 2026-04 unverdicted novelty 6.0

    CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.

  23. VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

    cs.RO 2026-04 unverdicted novelty 6.0

    VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.

  24. Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

    cs.RO 2026-04 conditional novelty 6.0

    MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.

  25. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  26. Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    cs.CV 2026-03 unverdicted novelty 6.0

    Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.

  27. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  28. AttenA+: Rectifying Action Inequality in Robotic Foundation Models

    cs.RO 2026-05 unverdicted novelty 5.0

    AttenA+ applies velocity-driven action attention to reweight training objectives toward kinematically critical low-velocity segments, yielding small benchmark gains on Libero and RoboTwin without added parameters.

  29. CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models

    cs.RO 2026-05 unverdicted novelty 5.0

    CKT-WAM transfers teacher WAM knowledge to students via compressed text-embedding contexts using LQCA and adapters, reaching 86.1% success on LIBERO-Plus with 1.17% trainable parameters and 83.3% in real-world tasks.

  30. STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 5.0

    STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising r...

  31. Gated Memory Policy

    cs.RO 2026-04 unverdicted novelty 5.0

    GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.

  32. World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

    cs.RO 2026-04 unverdicted novelty 5.0

    The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.

  33. ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents

    cs.CV 2026-04 unverdicted novelty 4.0

    ABot-Claw is an embodied software layer that adds unified robot scheduling, cross-embodiment visual memory, and critic-driven replanning on top of OpenClaw to support persistent multi-robot execution from natural-lang...

  34. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · cited by 31 Pith papers · 11 internal anchors

  1. [1]

    1x world model: From video to action

    1X Technologies. 1x world model: From video to action. https://www.1x.tech/discover/world-model-self-learning,

  2. [2]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, Shu Jiang, Yuxin Jiang, Cheng Jing, Hongyang Li, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

  3. [3]

    A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025

    Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025

  4. [4]

    Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation

    Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei 18 Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. InConference on Robot Learning (CoRL), 2024

  5. [5]

    Motus: A Unified Latent Action World Model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  6. [6]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Jim Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  7. [7]

    π0: A vision- language-action flow model for general robot control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, et al. π0: A vision- language-action flow model for general robot control. InRobotics: Science and Systems, 2025

  8. [8]

    Real-time execution of action chunking flow policies.arXiv preprint arXiv:2506.07339, 2025

    Kevin Black, Manuel Y . Galliker, and Sergey Levine. Real-time execution of action chunking flow policies.arXiv preprint arXiv:2506.07339, 2025

  9. [9]

    Zero-shot robotic manipulation with pretrained image-editing diffusion models

    Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models. InInt. Conf. Learn. Represent., 2024

  10. [10]

    Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

    Kevin Black, Allen Z Ren, Michael Equi, and Sergey Levine. Training-time action conditioning for efficient real-time chunking. arXiv preprint arXiv:2512.05964, 2025

  11. [11]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning ...

  12. [12]

    Rt-1: Robotics transformer for real-world control at scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, et al. Rt-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems, 2023

  13. [13]

    Genie: Generative interactive environments

    Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, et al. Genie: Generative interactive environments. InInt. Conf. Mach. Learn., 2024

  14. [14]

    Chatvla: Unified multimodal understanding and robot control with vision-language-action model, 2025

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2502.14420, 2025

  15. [15]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  16. [16]

    Moto: Latent motion token as the bridging language for robot manipulation

    Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for robot manipulation. InInt. Conf. Comput. Vis., 2025

  17. [17]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems, 2023

  18. [18]

    Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. InRobotics: Science and Systems, 2024

  19. [19]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  20. [20]

    Tenenbaum, Dale Schuurmans, and Pieter Abbeel

    Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdv. Neural Inform. Process. Syst., 2023

  21. [21]

    Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

  22. [22]

    Vidar: Embodied video diffusion model for generalist manipulation, 2025

    Yao Feng, Hengkai Tan, Xinyi Mao, Guodong Liu, Shuhe Huang, Chendong Xiang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist bimanual manipulation.arXiv preprint arXiv:2507.12898, 2025. 19

  23. [23]

    Veo: A text-to-video generation system.Google DeepMind Technical Report, 2025

    Google DeepMind. Veo: A text-to-video generation system.Google DeepMind Technical Report, 2025

  24. [24]

    Prediction with action: Visual policy learning via joint denoising process

    Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process. InAdv. Neural Inform. Process. Syst., 2024

  25. [25]

    Mastering diverse control tasks through world models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models. Nature, 2025

  26. [26]

    Td-mpc2: Scalable, robust world models for continuous control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. InInt. Conf. Learn. Represent., 2024

  27. [27]

    Video prediction policy: A generalist robot policy with predictive visual representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. InInt. Conf. Mach. Learn., 2025

  28. [28]

    Thinkact: Vision-language-action reasoning via reinforced visual latent planning

    Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning. InAdv. Neural Inform. Process. Syst., 2025

  29. [29]

    π0.5: A generalist robot policy with flow matching and world models

    Physical Intelligence et al. π0.5: A generalist robot policy with flow matching and world models. InConference on Robot Learning (CoRL), 2025

  30. [30]

    Mixture of horizons in action chunking.arXiv preprint arXiv:2511.19433, 2025

    Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun, Yunchao Yao, Zhenyu Wei, Yunhui Liu, Zhiwu Lu, and Mingyu Ding. Mixture of horizons in action chunking.arXiv preprint arXiv:2511.19433, 2025

  31. [31]

    Droid: A large-scale in-the-wild robot manipulation dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, et al. Droid: A large-scale in-the-wild robot manipulation dataset. InRobotics: Science and Systems, 2024

  32. [32]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  33. [33]

    Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

  34. [34]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning (CoRL), 2024

  35. [35]

    Kling ai.https://klingai.kuaishou.com/, 2024

    Kuaishou. Kling ai.https://klingai.kuaishou.com/, 2024

  36. [36]

    Deformnet: Latent space modeling and dynamics prediction for deformable object manipulation

    Chenchang Li, Zihao Ai, Tong Wu, Xiaosa Li, Wenbo Ding, and Huazhe Xu. Deformnet: Latent space modeling and dynamics prediction for deformable object manipulation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 14770–14776. IEEE, 2024

  37. [37]

    Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816, 2025

    Hao Li, Shuai Yang, Yilun Chen, Yang Tian, Xiaoda Yang, Xinyi Chen, Hanqing Wang, Tai Wang, Feng Zhao, Dahua Lin, et al. Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816, 2025

  38. [38]

    arXiv preprint arXiv:2509.09674 , year=

    Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025

  39. [39]

    Cheang, S

    Jiacheng Li, Mengzhou Sun, Bowen Zhang, Zhe Zhao, Xiu Liu, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

  40. [40]

    Unified video action model

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model. InRobotics: Science and Systems, 2025

  41. [41]

    Propagation networks for model-based control under partial observation

    Yunzhu Li, Jiajun Wu, Jun-Yan Zhu, Joshua B Tenenbaum, Antonio Torralba, and Russ Tedrake. Propagation networks for model-based control under partial observation. In2019 International Conference on Robotics and Automation (ICRA), pages 1205–1211. IEEE, 2019

  42. [42]

    Dreamitate: Real-world visuomotor policy learning via video generation

    Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation. InConference on Robot Learning (CoRL), 2024

  43. [43]

    Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.Transactions on Machine Learning Research, 2025

    Weixin Liang, LILI YU, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.Transactions on Machine Learning Research, 2025. 20

  44. [44]

    Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language- action policies.arXiv preprint arXiv:2508.20072,

    Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, and Ping Luo. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

  45. [45]

    Data scaling laws in imitation learning for robotic manipulation

    Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in imitation learning for robotic manipulation. InInt. Conf. Learn. Represent., 2025

  46. [46]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInt. Conf. Learn. Represent., 2023

  47. [47]

    Libero: Benchmarking knowledge transfer for lifelong robot learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. InAdv. Neural Inform. Process. Syst., 2023

  48. [48]

    Vitamin: Learning contact-rich tasks through robot-free visuo-tactile manipulation interface.arXiv preprint arXiv:2504.06156, 2025

    Fangchen Liu, Chuanyu Li, Yihua Qin, Austin Shaw, Jing Xu, Pieter Abbeel, and Rui Chen. Vitamin: Learning contact-rich tasks through robot-free visuo-tactile manipulation interface.arXiv preprint arXiv:2504.06156, 2025

  49. [49]

    Rdt-1b: A diffusion foundation model for bimanual manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: A diffusion foundation model for bimanual manipulation. InInt. Conf. Learn. Represent., 2025

  50. [50]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInt. Conf. Learn. Represent., 2023

  51. [51]

    Maniwav: Learning robot manipulation from in-the-wild audio-visual data.arXiv preprint arXiv:2406.19464, 2024

    Zeyi Liu, Cheng Chi, Eric Cousineau, Naveen Kuppuswamy, Benjamin Burchfiel, and Shuran Song. Maniwav: Learning robot manipulation from in-the-wild audio-visual data.arXiv preprint arXiv:2406.19464, 2024

  52. [52]

    Deep learning for universal linear embeddings of nonlinear dynamics

    Bethany Lusch, J Nathan Kutz, and Steven L Brunton. Deep learning for universal linear embeddings of nonlinear dynamics. Nature communications, 9(1):4950, 2018

  53. [53]

    Open x-embodiment: Robotic learning datasets and rt-x models

    Open X-Embodiment Collaboration. Open x-embodiment: Robotic learning datasets and rt-x models. InIEEE International Conference on Robotics and Automation (ICRA), 2024

  54. [54]

    Video generation models as world simulators.OpenAI Technical Report, 2024

    OpenAI. Video generation models as world simulators.OpenAI Technical Report, 2024

  55. [55]

    mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint 2512.15692, 2025

    Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint 2512.15692, 2025

  56. [56]

    Genie 2: A large-scale foundation world model

    Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, et al. Genie 2: A large-scale foundation world model. https://deepmind.google/discover/blog/ genie-2-a-large-scale-fou...

  57. [57]

    Fast: Efficient action tokenization for vision-language-action models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models. InRobotics: Science and Systems, 2025

  58. [58]

    Spatialvla: Exploring spatial representations for visual-language-action model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spatial representations for visual-language-action model. InRobotics: Science and Systems, 2025

  59. [59]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

  60. [60]

    Mv-umi: A scalable multi- view interface for cross-embodiment learning, 2025

    Omar Rayyan, John Abanes, Mahmoud Hafez, Anthony Tzes, and Fares Abu-Dakka. Mv-umi: A scalable multi-view interface for cross-embodiment learning.arXiv preprint arXiv:2509.18757, 2025

  61. [61]

    Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning

    Moritz Reuss, Jyothish Pari, Pulkit Agrawal, and Rudolf Lioutikov. Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning. InInt. Conf. Learn. Represent., 2025

  62. [62]

    Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies

    Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Ya˘gmurlu, Fabian Otto, and Rudolf Lioutikov. Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies. InConference on Robot Learning (CoRL), 2025

  63. [63]

    Action- conditional implicit visual dynamics for deformable object manipulation.The International Journal of Robotics Research, 43(4):437–455, 2024

    Bokui Shen, Zhenyu Jiang, Christopher Choy, Silvio Savarese, Leonidas J Guibas, Anima Anandkumar, and Yuke Zhu. Action- conditional implicit visual dynamics for deformable object manipulation.The International Journal of Robotics Research, 43(4):437–455, 2024

  64. [64]

    Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963, 2025

    Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963, 2025. 21

  65. [65]

    Memoryvla: Perceptual-cognitive memory in vision- language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

    Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

  66. [66]

    Robocraft: Learning to see, simulate, and shape elasto- plastic objects in 3d with graph networks.The International Journal of Robotics Research, 43(4):533–549, 2024

    Haochen Shi, Huazhe Xu, Zhiao Huang, Yunzhu Li, and Jiajun Wu. Robocraft: Learning to see, simulate, and shape elasto- plastic objects in 3d with graph networks.The International Journal of Robotics Research, 43(4):533–549, 2024

  67. [67]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

  68. [68]

    Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

    Ajay Sridhar, Jennifer Pan, Satvik Sharma, and Chelsea Finn. Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

  69. [69]

    Application of a particle-in-cell method to solid mechanics.Computer physics communications, 87(1-2):236–252, 1995

    Deborah Sulsky, Shi-Jian Zhou, and Howard L Schreyer. Application of a particle-in-cell method to solid mechanics.Computer physics communications, 87(1-2):236–252, 1995

  70. [70]

    arXiv preprint arXiv:2512.01031 , year=

    Jiaming Tang, Yufei Sun, Yilong Zhao, Shang Yang, Yujun Lin, Zhuoyang Zhang, James Hou, Yao Lu, Zhijian Liu, and Song Han. Vlash: Real-time vlas via future-state-aware asynchronous inference.arXiv preprint arXiv:2512.01031, 2025

  71. [71]

    Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025

    Gemini Robotics Team, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Fangchen Liu, Anirudha Majumdar, et al. Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025

  72. [72]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, et al. Octo: An open-source generalist robot policy. InRobotics: Science and Systems, 2024

  73. [73]

    Seer: Predictive inverse dynamics models are scalable learners for robotic manipulation

    Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Seer: Predictive inverse dynamics models are scalable learners for robotic manipulation. InInt. Conf. Learn. Represent., 2025

  74. [74]

    Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

    Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, Yaping Li, Ping Wang, Junhao Cai, Jia Zeng, Hao Dong, et al. Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

  75. [75]

    Improving and generalizing flow-based generative models with minibatch optimal transport.Transactions on Machine Learning Research, 2024

    Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport.Transactions on Machine Learning Research, 2024

  76. [76]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdv. Neural Inform. Process. Syst., 2017

  77. [77]

    Dynamic-resolution model learning for object pile manipulation.arXiv preprint arXiv:2306.16700, 2023

    Yixuan Wang, Yunzhu Li, Katherine Driggs-Campbell, Li Fei-Fei, and Jiajun Wu. Dynamic-resolution model learning for object pile manipulation.arXiv preprint arXiv:2306.16700, 2023

  78. [78]

    Unifiedvision-language-action model.arXiv preprint arXiv:2506.19850,

    Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

  79. [79]

    Wan: Open and Advanced Large-Scale Video Generative Models

    WanTeam. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  80. [80]

    Embed to control: A locally linear latent dynamics model for control from raw images.Advances in neural information processing systems, 28, 2015

    Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images.Advances in neural information processing systems, 28, 2015

Showing first 80 references.