arxiv: 2601.21998 · v2 · submitted 2026-01-29 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

Causal World Modeling for Robot Control

Fei Han, Lin Li, Mingrui Yu, Nan Xue, Qihang Zhang, Ruilin Wang, Shuai Yang, Xing Zhu, Yiming Luo, Yinghao Xu, Yujun Shen, Zelin Gao

Pith reviewed 2026-05-12 13:49 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords video world modelingrobot learningautoregressive diffusioncausal controllong-horizon manipulationshared latent spacemixture of transformers

0 comments

The pith

Video world modeling establishes an independent foundation for robot learning by enabling causal prediction of visual outcomes from actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that video world models can stand alone as a basis for robot learning, separate from or alongside vision-language pre-training. By training on predicting future frames while also learning to choose actions, the approach aims to capture the cause-and-effect links between what a robot does and what it sees. A sympathetic reader would care if this leads to robots that handle extended tasks with less training data and better adaptation to new settings.

Core claim

This work claims that video world modeling provides a fresh foundation for robot learning through an autoregressive diffusion framework, LingBot-VA, which jointly learns frame prediction and policy execution using a shared latent space driven by Mixture-of-Transformers, closed-loop rollouts with ground-truth observations, and asynchronous inference to support efficient control.

What carries the argument

LingBot-VA is the autoregressive diffusion framework that integrates vision and action tokens in a shared latent space via a Mixture-of-Transformers architecture, enabling simultaneous future frame prediction and action selection.

If this is right

It supports long-horizon manipulation tasks effectively.
It achieves data efficiency in post-training.
It generalizes well to novel configurations.
It provides the ability to imagine the near future by understanding action-visual dynamics causality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Video-based world models might serve as a primary pre-training method for robots, reducing reliance on language data.
The asynchronous pipeline could enable faster real-world deployment by running prediction and execution in parallel.
Testing this in more diverse physical environments might show how well the causal understanding transfers beyond the evaluated scenarios.

Load-bearing premise

That the combination of shared latent space, closed-loop rollout with ground-truth observations, and asynchronous inference is sufficient to establish effective causal understanding and robot control.

What would settle it

Demonstrating that the model performs no better than standard methods on long-horizon manipulation without one of the three designs, such as without closed-loop feedback, would challenge the central claim.

read the original abstract

This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LingBot-VA pairs autoregressive diffusion with Mixture-of-Transformers for joint video-action learning in robots, but its closed-loop design feeds perfect ground-truth observations back during rollout.

read the letter

The paper's core contribution is LingBot-VA, an autoregressive diffusion model that trains video frame prediction and policy execution together. It uses a shared latent space via Mixture-of-Transformers, closed-loop rollout that pulls in ground-truth observations, and asynchronous inference to run prediction and execution in parallel. The authors release code and model, and they report results on simulation benchmarks plus real-world robot tasks, with claims around long-horizon manipulation, post-training data efficiency, and generalization to new setups.

Referee Report

1 major / 1 minor

Summary. The paper claims that video world modeling, alongside vision-language pre-training, provides a fresh foundation for robot learning. It introduces LingBot-VA, an autoregressive diffusion framework that jointly learns frame prediction and policy execution via three designs: (1) a shared latent space integrating vision and action tokens using a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism that acquires ongoing environmental feedback via ground-truth observations, and (3) an asynchronous inference pipeline that parallelizes action prediction and motor execution. Evaluations on simulation benchmarks and real-world scenarios are said to demonstrate promise for long-horizon manipulation, data efficiency in post-training, and generalizability to novel configurations, with code and model released publicly.

Significance. If the results hold, the work could strengthen video-based world models as an independent route to robot control, particularly for data-efficient long-horizon tasks. The public release of code and model is a clear strength that supports reproducibility. The three-design integration is presented as enabling causal understanding from visual dynamics, but the abstract supplies no equations, training details, ablations, or quantitative metrics, so the magnitude of any advance over prior world-model or diffusion-policy approaches remains difficult to assess.

major comments (1)

[Abstract] Abstract (description of design (2)): The closed-loop rollout feeds ground-truth observations back into the model at each step. This setup risks masking deficiencies in the learned dynamics, because the model never has to sustain an internal causal state across multiple steps without perfect external feedback. The central claim that the three designs together produce effective causal understanding for long-horizon control therefore requires additional experiments (e.g., open-loop rollouts using only predicted observations, or rollouts under sensor noise) to be load-bearing; without them the reported gains in manipulation and generalizability could be artifacts of the GT feedback assumption rather than emergent causal modeling.

minor comments (1)

[Abstract] Abstract: The phrase 'significant promise' is used without any accompanying metrics, baseline comparisons, or ablation results, which makes it impossible to judge the strength of the empirical claims from the provided text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review of our manuscript. The comment on the closed-loop rollout raises a substantive point about validating causal modeling, which we address directly below.

read point-by-point responses

Referee: [Abstract] Abstract (description of design (2)): The closed-loop rollout feeds ground-truth observations back into the model at each step. This setup risks masking deficiencies in the learned dynamics, because the model never has to sustain an internal causal state across multiple steps without perfect external feedback. The central claim that the three designs together produce effective causal understanding for long-horizon control therefore requires additional experiments (e.g., open-loop rollouts using only predicted observations, or rollouts under sensor noise) to be load-bearing; without them the reported gains in manipulation and generalizability could be artifacts of the GT feedback assumption rather than emergent causal modeling.

Authors: We agree that closed-loop evaluation with ground-truth feedback alone does not fully isolate the model's learned causal dynamics, as the policy can rely on perfect external observations rather than its internal predictions. Our design choice reflects standard robot control practice, where sensory feedback is typically available, and the joint training of frame prediction and policy is intended to encourage the model to internalize action-visual causality. Nevertheless, to make the causal claims load-bearing, we will add open-loop rollout experiments (using only the model's own predicted observations over multiple steps) and evaluations under simulated sensor noise in the revised manuscript. These will be reported alongside the existing closed-loop results to quantify the contribution of the learned dynamics. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical model introduction with no derivations or self-referential reductions

full rationale

The paper presents LingBot-VA as an autoregressive diffusion framework incorporating three architectural choices (shared MoT latent space, closed-loop GT rollout, asynchronous inference) and reports empirical results on simulation and real-world tasks. No equations, first-principles derivations, or parameter-fitting steps are described that could reduce a claimed prediction back to the inputs by construction. The central claims rest on experimental evaluation rather than any self-definitional loop, fitted-input-as-prediction, or load-bearing self-citation chain. The closed-loop design is an explicit modeling choice whose limitations are external to circularity analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility; the central claim rests on the unverified premise that video frame prediction captures actionable causality and that the listed designs suffice for closed-loop control.

axioms (1)

domain assumption Video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics.
Stated directly as the intuitive foundation in the abstract.

pith-pipeline@v0.9.0 · 5502 in / 1212 out tokens · 49356 ms · 2026-05-12T13:49:47.969450+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_strictMono_of_one_lt unclear
autoregressive diffusion framework that learns frame prediction and policy execution simultaneously... shared latent space... Mixture-of-Transformers (MoT)... closed-loop rollout mechanism... asynchronous inference pipeline
IndisputableMonolith/Foundation/ArrowOfTime.lean before_transitive / arrow_from_z echoes
causal attention masking over the unified sequence ensures that both predicted visual states and action commands are governed by preceding states

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
cs.RO 2026-04 conditional novelty 8.0

Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models
cs.RO 2026-04 unverdicted novelty 7.0

Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.
JailWAM: Jailbreaking World Action Models in Robot Control
cs.RO 2026-04 unverdicted novelty 7.0

JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
cs.RO 2026-05 unverdicted novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
When to Trust Imagination: Adaptive Action Execution for World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...
When to Trust Imagination: Adaptive Action Execution for World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.
MotuBrain: An Advanced World Action Model for Robot Control
cs.RO 2026-04 unverdicted novelty 6.0

MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new rob...
ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control
cs.RO 2026-04 unverdicted novelty 6.0

ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.
SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 6.0

SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
cs.RO 2026-04 unverdicted novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps
cs.RO 2026-04 unverdicted novelty 6.0

AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
cs.CV 2026-04 unverdicted novelty 6.0

CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
cs.RO 2026-04 unverdicted novelty 6.0

VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
cs.RO 2026-04 conditional novelty 6.0

MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
cs.CV 2026-03 unverdicted novelty 6.0

Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
cs.RO 2026-05 unverdicted novelty 5.0

AttenA+ applies velocity-driven action attention to reweight training objectives toward kinematically critical low-velocity segments, yielding small benchmark gains on Libero and RoboTwin without added parameters.
CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models
cs.RO 2026-05 unverdicted novelty 5.0

CKT-WAM transfers teacher WAM knowledge to students via compressed text-embedding contexts using LQCA and adapters, reaching 86.1% success on LIBERO-Plus with 1.17% trainable parameters and 83.3% in real-world tasks.
STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 5.0

STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising r...
Gated Memory Policy
cs.RO 2026-04 unverdicted novelty 5.0

GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.
World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems
cs.RO 2026-04 unverdicted novelty 5.0

The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.
ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents
cs.CV 2026-04 unverdicted novelty 4.0

ABot-Claw is an embodied software layer that adds unified robot scheduling, cross-embodiment visual memory, and critic-driven replanning on top of OpenClaw to support persistent multi-robot execution from natural-lang...
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · cited by 31 Pith papers · 11 internal anchors

[1]

1x world model: From video to action

1X Technologies. 1x world model: From video to action. https://www.1x.tech/discover/world-model-self-learning,

work page
[2]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, Shu Jiang, Yuxin Jiang, Cheng Jing, Hongyang Li, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025

Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025

work page arXiv 2025
[4]

Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei 18 Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. InConference on Robot Learning (CoRL), 2024

work page 2024
[5]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review arXiv 2025
[6]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Jim Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

π0: A vision- language-action flow model for general robot control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, et al. π0: A vision- language-action flow model for general robot control. InRobotics: Science and Systems, 2025

work page 2025
[8]

Real-time execution of action chunking flow policies.arXiv preprint arXiv:2506.07339, 2025

Kevin Black, Manuel Y . Galliker, and Sergey Levine. Real-time execution of action chunking flow policies.arXiv preprint arXiv:2506.07339, 2025

work page arXiv 2025
[9]

Zero-shot robotic manipulation with pretrained image-editing diffusion models

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models. InInt. Conf. Learn. Represent., 2024

work page 2024
[10]

Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

Kevin Black, Allen Z Ren, Michael Equi, and Sergey Levine. Training-time action conditioning for efficient real-time chunking. arXiv preprint arXiv:2512.05964, 2025

work page arXiv 2025
[11]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning ...

work page 2023
[12]

Rt-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, et al. Rt-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems, 2023

work page 2023
[13]

Genie: Generative interactive environments

Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, et al. Genie: Generative interactive environments. InInt. Conf. Mach. Learn., 2024

work page 2024
[14]

Chatvla: Unified multimodal understanding and robot control with vision-language-action model, 2025

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2502.14420, 2025

work page arXiv 2025
[15]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Moto: Latent motion token as the bridging language for robot manipulation

Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for robot manipulation. InInt. Conf. Comput. Vis., 2025

work page 2025
[17]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems, 2023

work page 2023
[18]

Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots

Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. InRobotics: Science and Systems, 2024

work page 2024
[19]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Tenenbaum, Dale Schuurmans, and Pieter Abbeel

Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdv. Neural Inform. Process. Syst., 2023

work page 2023
[21]

Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

work page 2023
[22]

Vidar: Embodied video diffusion model for generalist manipulation, 2025

Yao Feng, Hengkai Tan, Xinyi Mao, Guodong Liu, Shuhe Huang, Chendong Xiang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist bimanual manipulation.arXiv preprint arXiv:2507.12898, 2025. 19

work page arXiv 2025
[23]

Veo: A text-to-video generation system.Google DeepMind Technical Report, 2025

Google DeepMind. Veo: A text-to-video generation system.Google DeepMind Technical Report, 2025

work page 2025
[24]

Prediction with action: Visual policy learning via joint denoising process

Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process. InAdv. Neural Inform. Process. Syst., 2024

work page 2024
[25]

Mastering diverse control tasks through world models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models. Nature, 2025

work page 2025
[26]

Td-mpc2: Scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. InInt. Conf. Learn. Represent., 2024

work page 2024
[27]

Video prediction policy: A generalist robot policy with predictive visual representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. InInt. Conf. Mach. Learn., 2025

work page 2025
[28]

Thinkact: Vision-language-action reasoning via reinforced visual latent planning

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning. InAdv. Neural Inform. Process. Syst., 2025

work page 2025
[29]

π0.5: A generalist robot policy with flow matching and world models

Physical Intelligence et al. π0.5: A generalist robot policy with flow matching and world models. InConference on Robot Learning (CoRL), 2025

work page 2025
[30]

Mixture of horizons in action chunking.arXiv preprint arXiv:2511.19433, 2025

Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun, Yunchao Yao, Zhenyu Wei, Yunhui Liu, Zhiwu Lu, and Mingyu Ding. Mixture of horizons in action chunking.arXiv preprint arXiv:2511.19433, 2025

work page arXiv 2025
[31]

Droid: A large-scale in-the-wild robot manipulation dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, et al. Droid: A large-scale in-the-wild robot manipulation dataset. InRobotics: Science and Systems, 2024

work page 2024
[32]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

work page internal anchor Pith review arXiv 2026
[34]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning (CoRL), 2024

work page 2024
[35]

Kling ai.https://klingai.kuaishou.com/, 2024

Kuaishou. Kling ai.https://klingai.kuaishou.com/, 2024

work page 2024
[36]

Deformnet: Latent space modeling and dynamics prediction for deformable object manipulation

Chenchang Li, Zihao Ai, Tong Wu, Xiaosa Li, Wenbo Ding, and Huazhe Xu. Deformnet: Latent space modeling and dynamics prediction for deformable object manipulation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 14770–14776. IEEE, 2024

work page 2024
[37]

Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816, 2025

Hao Li, Shuai Yang, Yilun Chen, Yang Tian, Xiaoda Yang, Xinyi Chen, Hanqing Wang, Tai Wang, Feng Zhao, Dahua Lin, et al. Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816, 2025

work page arXiv 2025
[38]

arXiv preprint arXiv:2509.09674 , year=

Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025

work page arXiv 2025
[39]

Cheang, S

Jiacheng Li, Mengzhou Sun, Bowen Zhang, Zhe Zhao, Xiu Liu, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

work page arXiv 2025
[40]

Unified video action model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model. InRobotics: Science and Systems, 2025

work page 2025
[41]

Propagation networks for model-based control under partial observation

Yunzhu Li, Jiajun Wu, Jun-Yan Zhu, Joshua B Tenenbaum, Antonio Torralba, and Russ Tedrake. Propagation networks for model-based control under partial observation. In2019 International Conference on Robotics and Automation (ICRA), pages 1205–1211. IEEE, 2019

work page 2019
[42]

Dreamitate: Real-world visuomotor policy learning via video generation

Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation. InConference on Robot Learning (CoRL), 2024

work page 2024
[43]

Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.Transactions on Machine Learning Research, 2025

Weixin Liang, LILI YU, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.Transactions on Machine Learning Research, 2025. 20

work page 2025
[44]

Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language- action policies.arXiv preprint arXiv:2508.20072,

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, and Ping Luo. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

work page arXiv 2025
[45]

Data scaling laws in imitation learning for robotic manipulation

Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in imitation learning for robotic manipulation. InInt. Conf. Learn. Represent., 2025

work page 2025
[46]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInt. Conf. Learn. Represent., 2023

work page 2023
[47]

Libero: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. InAdv. Neural Inform. Process. Syst., 2023

work page 2023
[48]

Vitamin: Learning contact-rich tasks through robot-free visuo-tactile manipulation interface.arXiv preprint arXiv:2504.06156, 2025

Fangchen Liu, Chuanyu Li, Yihua Qin, Austin Shaw, Jing Xu, Pieter Abbeel, and Rui Chen. Vitamin: Learning contact-rich tasks through robot-free visuo-tactile manipulation interface.arXiv preprint arXiv:2504.06156, 2025

work page arXiv 2025
[49]

Rdt-1b: A diffusion foundation model for bimanual manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: A diffusion foundation model for bimanual manipulation. InInt. Conf. Learn. Represent., 2025

work page 2025
[50]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInt. Conf. Learn. Represent., 2023

work page 2023
[51]

Maniwav: Learning robot manipulation from in-the-wild audio-visual data.arXiv preprint arXiv:2406.19464, 2024

Zeyi Liu, Cheng Chi, Eric Cousineau, Naveen Kuppuswamy, Benjamin Burchfiel, and Shuran Song. Maniwav: Learning robot manipulation from in-the-wild audio-visual data.arXiv preprint arXiv:2406.19464, 2024

work page arXiv 2024
[52]

Deep learning for universal linear embeddings of nonlinear dynamics

Bethany Lusch, J Nathan Kutz, and Steven L Brunton. Deep learning for universal linear embeddings of nonlinear dynamics. Nature communications, 9(1):4950, 2018

work page 2018
[53]

Open x-embodiment: Robotic learning datasets and rt-x models

Open X-Embodiment Collaboration. Open x-embodiment: Robotic learning datasets and rt-x models. InIEEE International Conference on Robotics and Automation (ICRA), 2024

work page 2024
[54]

Video generation models as world simulators.OpenAI Technical Report, 2024

OpenAI. Video generation models as world simulators.OpenAI Technical Report, 2024

work page 2024
[55]

mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint 2512.15692, 2025

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint 2512.15692, 2025

work page arXiv 2025
[56]

Genie 2: A large-scale foundation world model

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, et al. Genie 2: A large-scale foundation world model. https://deepmind.google/discover/blog/ genie-2-a-large-scale-fou...

work page 2024
[57]

Fast: Efficient action tokenization for vision-language-action models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models. InRobotics: Science and Systems, 2025

work page 2025
[58]

Spatialvla: Exploring spatial representations for visual-language-action model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spatial representations for visual-language-action model. InRobotics: Science and Systems, 2025

work page 2025
[59]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

work page 2020
[60]

Mv-umi: A scalable multi- view interface for cross-embodiment learning, 2025

Omar Rayyan, John Abanes, Mahmoud Hafez, Anthony Tzes, and Fares Abu-Dakka. Mv-umi: A scalable multi-view interface for cross-embodiment learning.arXiv preprint arXiv:2509.18757, 2025

work page arXiv 2025
[61]

Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning

Moritz Reuss, Jyothish Pari, Pulkit Agrawal, and Rudolf Lioutikov. Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning. InInt. Conf. Learn. Represent., 2025

work page 2025
[62]

Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies

Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Ya˘gmurlu, Fabian Otto, and Rudolf Lioutikov. Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies. InConference on Robot Learning (CoRL), 2025

work page 2025
[63]

Action- conditional implicit visual dynamics for deformable object manipulation.The International Journal of Robotics Research, 43(4):437–455, 2024

Bokui Shen, Zhenyu Jiang, Christopher Choy, Silvio Savarese, Leonidas J Guibas, Anima Anandkumar, and Yuke Zhu. Action- conditional implicit visual dynamics for deformable object manipulation.The International Journal of Robotics Research, 43(4):437–455, 2024

work page 2024
[64]

Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963, 2025

Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963, 2025. 21

work page arXiv 2025
[65]

Memoryvla: Perceptual-cognitive memory in vision- language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

work page arXiv 2025
[66]

Robocraft: Learning to see, simulate, and shape elasto- plastic objects in 3d with graph networks.The International Journal of Robotics Research, 43(4):533–549, 2024

Haochen Shi, Huazhe Xu, Zhiao Huang, Yunzhu Li, and Jiajun Wu. Robocraft: Learning to see, simulate, and shape elasto- plastic objects in 3d with graph networks.The International Journal of Robotics Research, 43(4):533–549, 2024

work page 2024
[67]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

Ajay Sridhar, Jennifer Pan, Satvik Sharma, and Chelsea Finn. Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

work page arXiv 2025
[69]

Application of a particle-in-cell method to solid mechanics.Computer physics communications, 87(1-2):236–252, 1995

Deborah Sulsky, Shi-Jian Zhou, and Howard L Schreyer. Application of a particle-in-cell method to solid mechanics.Computer physics communications, 87(1-2):236–252, 1995

work page 1995
[70]

arXiv preprint arXiv:2512.01031 , year=

Jiaming Tang, Yufei Sun, Yilong Zhao, Shang Yang, Yujun Lin, Zhuoyang Zhang, James Hou, Yao Lu, Zhijian Liu, and Song Han. Vlash: Real-time vlas via future-state-aware asynchronous inference.arXiv preprint arXiv:2512.01031, 2025

work page arXiv 2025
[71]

Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025

Gemini Robotics Team, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Fangchen Liu, Anirudha Majumdar, et al. Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025

work page arXiv 2025
[72]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, et al. Octo: An open-source generalist robot policy. InRobotics: Science and Systems, 2024

work page 2024
[73]

Seer: Predictive inverse dynamics models are scalable learners for robotic manipulation

Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Seer: Predictive inverse dynamics models are scalable learners for robotic manipulation. InInt. Conf. Learn. Represent., 2025

work page 2025
[74]

Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, Yaping Li, Ping Wang, Junhao Cai, Jia Zeng, Hao Dong, et al. Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

work page arXiv 2025
[75]

Improving and generalizing flow-based generative models with minibatch optimal transport.Transactions on Machine Learning Research, 2024

Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport.Transactions on Machine Learning Research, 2024

work page 2024
[76]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdv. Neural Inform. Process. Syst., 2017

work page 2017
[77]

Dynamic-resolution model learning for object pile manipulation.arXiv preprint arXiv:2306.16700, 2023

Yixuan Wang, Yunzhu Li, Katherine Driggs-Campbell, Li Fei-Fei, and Jiajun Wu. Dynamic-resolution model learning for object pile manipulation.arXiv preprint arXiv:2306.16700, 2023

work page arXiv 2023
[78]

Unifiedvision-language-action model.arXiv preprint arXiv:2506.19850,

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

work page arXiv 2025
[79]

Wan: Open and Advanced Large-Scale Video Generative Models

WanTeam. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[80]

Embed to control: A locally linear latent dynamics model for control from raw images.Advances in neural information processing systems, 28, 2015

Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images.Advances in neural information processing systems, 28, 2015

work page 2015

Showing first 80 references.