pith. machine review for the scientific record. sign in

arxiv: 2604.19734 · v1 · submitted 2026-04-21 · 💻 cs.RO · cs.AI

Recognition: unknown

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:59 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords cross-embodiment transferhumanoid roboticslatent action tokenizervisual anchoringpolicy learningworld modelinghuman data distillationunified representation
0
0 comments X

The pith

UniT creates a shared latent space of physical intents by anchoring human and humanoid actions to their common visual outcomes, allowing direct transfer of human data into robot policies and simulations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the shortage of robot training data by drawing on abundant human videos instead. It rests on the idea that different body structures still produce matching visual results for the same underlying physical actions. A tri-branch reconstruction process lets actions predict visuals and visuals recover actions, stripping away body-specific noise to isolate core intents. These intents are encoded as discrete tokens that both humans and humanoids can use, which then power two applications: one that learns control policies from human examples and another that models the world for generating humanoid videos. The approach is shown to improve data efficiency, out-of-distribution performance, and zero-shot task transfer on both simulated and real humanoid setups.

Core claim

UniT employs a tri-branch cross-reconstruction mechanism in which actions predict vision to ground kinematics in observable outcomes and vision reconstructs actions to remove visual confounders, while a fusion branch merges the purified signals into a single discrete latent space of embodiment-agnostic physical intents; this space then serves as the common language for both policy learning from human data and direct human-to-humanoid transfer in world modeling.

What carries the argument

The tri-branch cross-reconstruction mechanism that forms the Unified Latent Action Tokenizer via Visual Anchoring, which links actions to their visual consequences and filters confounders to yield a shared discrete space of physical intents.

If this is right

  • Predicting the unified tokens lets VLA-UniT leverage diverse human data to reach state-of-the-art data efficiency and robust out-of-distribution generalization, including zero-shot task transfer on humanoid simulation benchmarks and real-world deployments.
  • Conditioning on the unified tokens lets WM-UniT align cross-embodiment dynamics, enabling direct human-to-humanoid action transfer and higher action controllability in generated humanoid videos.
  • The resulting representation is highly aligned, as confirmed by t-SNE plots where human and humanoid features merge into a single manifold.
  • The framework supplies a scalable route for converting large volumes of human knowledge into general-purpose humanoid capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same visual-anchoring principle could be applied to transfer knowledge between other robot body types that differ in kinematics but share observable physical effects.
  • Pretraining world models primarily on human video using these tokens might reduce or eliminate the need to collect large-scale robot-specific interaction data.
  • The discrete tokens could allow seamless mixing of human and robot demonstrations in a single training run without additional alignment modules.

Load-bearing premise

Different body structures and movement styles still produce the same visual results for the same physical intent, so that vision can reliably link and purify actions across embodiments.

What would settle it

t-SNE visualizations that fail to show human and humanoid features converging into one shared manifold, or policy-learning experiments in which UniT shows no gains in data efficiency or zero-shot transfer relative to baselines that do not use the unified tokens.

Figures

Figures reproduced from arXiv: 2604.19734 by Boyu Chen, Jerry Bai, Lu Qiu, Yi Chen, Yixiao Ge, Yuying Ge.

Figure 1
Figure 1. Figure 1: Overview of the UniT Framework, which establishes a [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of Latent Action Architectures. (a) Action-Only: Relies on action reconstruction (Za), suffering from distribution misalignment without visual grounding. (b) Vision￾Only: Infers representations (Zv) from pixels, which entangles with low-level appearances and misses fine-grained physical details. (c) Decoupled Vision-Action: Encodes modalities into disconnected spaces (Zv, Za), lacking the explic… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of UniT. Heterogeneous cross-embodiment vision pairs (ot, ot+k) and action chunks at:t+k are encoded into vision, action, and fused features via tri-branch encoders. A shared RQ-VAE codebook quantizes all three branches into a unified discrete space, yielding embodiment-agnostic Unified Latent Action tokens. Both vision and action decoders reconstruct from these shared tokens, enforcing cross-… view at source ↗
Figure 4
Figure 4. Figure 4: Downstream Applications of UniT Tokens. (a) VLA-UniT: Integrates UniT token prediction into a VLA architecture. Rather than direct action regression, the model predicts Unified Latent Action tokens in the shared space, while a lightweight action expert generates embodiment￾specific controls from the same vision-language context, covering head pose, waist pose, wrist pose, and hand pose. (b) WM-UniT: Employ… view at source ↗
Figure 5
Figure 5. Figure 5: Real-world in-domain tasks. We design two tasks analogous to EgoDex subsets: Pick & Place (pick an object and place it into a box) and Pouring (grasp a bottle and a cup, then pour). (e) Combinational Generalization (a) Geometry Generalization Pick up bowl to the box Pick up cup to another cup Pick up unseen cup to the box (c) Target Level Generalization (b) Distractor Generalization Pick up banana to the b… view at source ↗
Figure 6
Figure 6. Figure 6: Real-world OOD generalization scenarios. (a)–(d): human demonstrations introduce complementary variations absent from robot data; we co-train and evaluate on human-introduced conditions. (a) Geometry (new shapes, same affordance). (b) Target (new placement destinations). (c) Distractor (unseen objects introduced as distractors). (d) Visual (unseen surfaces). (e) Combinational (instruction-based disambiguat… view at source ↗
Figure 7
Figure 7. Figure 7 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Robustness to action noise. Gaussian noise of varying intensity σ is injected into EgoDex action trajectories. Here, σ denotes the relative noise level normalized by the global action standard deviation of the dataset, so the injected noise magnitude is σ × global std. Reconstruction quality is evaluated using mean squared error (MSE) between reconstructed actions and the original clean actions. (Left) Abs… view at source ↗
Figure 9
Figure 9. Figure 9: Overall performance on RoboCasa GR1 Tabletop Simulation with full training data. VLA-UniT achieves the strongest overall policy performance among all compared methods. 24×100 trajs 24×1000 trajs 10% 20% 30% 40% 50% 60% 70% Success Rate 21.3 45.5 47.8 66.7 n=1200 n=1200 GR00T-Qwen2.5 VLA-UniT Pick and Place Articulated In-Domain Average Unseen Appearance Unseen Combinations Unseen Object Types OOD Average 2… view at source ↗
Figure 10
Figure 10. Figure 10: (Left) Data efficiency results on RoboCasa GR1 under full-data and few-shot settings. (Right) Impact of incorporating EgoDex basic_pick_place human demonstrations on few-shot performance in RoboCasa GR1. UniT improves both sample efficiency and the utility of human co-training. 5.2 Policy Learning 5.2.1 Benchmark Performance and Data Efficiency As established in Sec. 3.3, UniT token prediction provides th… view at source ↗
Figure 11
Figure 11. Figure 11: (Left) In-domain results on the IRON-R01-1.11 robot. (Right) OOD generalization results on the IRON-R01-1.11 robot. Human Demonstration improves VLA-UniT’s real-world execution and OOD robustness. advantage of UniT’s compact, cross-modality aligned representation: by operating in a structured discrete latent space rather than regressing raw actions, the VLM extracts task-relevant intent more efficiently f… view at source ↗
Figure 12
Figure 12. Figure 12: Zero-shot task transfer on the IRON-R01-1.11 robot. (Left) Task illustration. (Right) Success rates on the unseen stacking task. UniT with human co-training shows the clearest transfer to the unseen task [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Human-to-robot conditioning. Top: human reference video. Bottom: human reference actions condition robot video generation via UniT vs. Raw Action. UniT yields more faithful cross￾embodiment conditioning than raw actions [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14 [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Tokenizer paradigm ablation on RoboCasa GR1 (full data with human co-training). (Left) OOD generalization. (Right) In-domain Pick & Place performance. sensitivity and pose detail. WM-UniT’s consistent improvement across all three confirms that it encodes a precise and transferable action representation for cross-embodiment world modeling. 5.4 Tokenizer Design Ablation UniT’s design rests on two key claims… view at source ↗
read the original abstract

Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces UniT (Unified Latent Action Tokenizer via Visual Anchoring), a tri-branch cross-reconstruction framework that anchors actions to visual outcomes and reconstructs actions from vision to produce a shared discrete latent space of embodiment-agnostic physical intents. It applies this to policy learning (VLA-UniT) for claimed SOTA data efficiency, OOD generalization, and zero-shot humanoid task transfer from human data, and to world modeling (WM-UniT) for direct human-to-humanoid action transfer in video generation, with cross-embodiment alignment verified via t-SNE visualizations of converging human and humanoid features.

Significance. If the tri-branch mechanism and resulting unified tokens genuinely enable embodiment-agnostic transfer without kinematic confounders, the approach could meaningfully address humanoid data scarcity by distilling human egocentric data into scalable policies and world models.

major comments (2)
  1. [Abstract] Abstract: claims of state-of-the-art performance, zero-shot task transfer, robust OOD generalization, and enhanced action controllability are asserted without any quantitative metrics, baselines, error bars, dataset sizes, or ablation results, preventing assessment of whether the tri-branch mechanism delivers the stated improvements.
  2. [Abstract] Abstract: the central claim of a highly aligned cross-embodiment representation rests on t-SNE visualizations showing convergence into a shared manifold; t-SNE is sensitive to perplexity, initialization, and metrics and can produce apparent alignment even when representations remain distinguishable by embodiment, with no quantitative corroboration (e.g., embodiment classification accuracy or distribution distances) referenced.
minor comments (1)
  1. The manuscript would benefit from explicit definitions of the tri-branch loss terms, the discrete latent tokenizer, and the fusion branch architecture, including any equations governing the action-to-vision and vision-to-action reconstruction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our manuscript. We address the major comments point by point below, clarifying the role of the abstract and the supporting evidence in the full paper while committing to revisions that improve the presentation of quantitative results and alignment metrics.

read point-by-point responses
  1. Referee: [Abstract] Abstract: claims of state-of-the-art performance, zero-shot task transfer, robust OOD generalization, and enhanced action controllability are asserted without any quantitative metrics, baselines, error bars, dataset sizes, or ablation results, preventing assessment of whether the tri-branch mechanism delivers the stated improvements.

    Authors: The abstract is written as a concise high-level summary of the contributions and key outcomes. Comprehensive quantitative support—including SOTA comparisons, baselines, error bars, dataset sizes, and ablations demonstrating the tri-branch mechanism's contributions—is provided in the main text (Sections 4.2–4.4, Tables 1–5, and Figures 4–8). To directly address the concern and make the abstract more self-contained, we will revise it to incorporate representative quantitative highlights (e.g., data-efficiency gains and zero-shot success rates) with pointers to the detailed experiments. revision: yes

  2. Referee: [Abstract] Abstract: the central claim of a highly aligned cross-embodiment representation rests on t-SNE visualizations showing convergence into a shared manifold; t-SNE is sensitive to perplexity, initialization, and metrics and can produce apparent alignment even when representations remain distinguishable by embodiment, with no quantitative corroboration (e.g., embodiment classification accuracy or distribution distances) referenced.

    Authors: We agree that t-SNE visualizations are inherently qualitative and sensitive to parameter choices, and that they should not stand alone as evidence of alignment. The manuscript presents the t-SNE plots as illustrative of observed feature convergence, which is corroborated by the downstream empirical results in cross-embodiment policy learning and world modeling. To strengthen the claim, we will add quantitative corroboration in the revision, including linear-probe embodiment classification accuracy and distribution-distance metrics (e.g., MMD or Wasserstein distance) between human and humanoid latent features, to be reported in Section 4.4. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework without self-referential derivations or fitted predictions

full rationale

The paper introduces UniT as an architectural framework (tri-branch cross-reconstruction plus fusion branch) grounded in an explicit philosophical premise about universal visual consequences of kinematics. No equations, parameter-fitting procedures, or predictive claims that reduce to fitted inputs appear in the abstract or described content. Alignment is asserted via t-SNE visualizations as external empirical verification rather than a quantity defined in terms of the model's own outputs. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are referenced as load-bearing. The central claims rest on design choices and reported experimental outcomes, not on any step that equates to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; no explicit free parameters, mathematical axioms, or invented physical entities beyond the framework itself are detailed.

axioms (1)
  • domain assumption Heterogeneous kinematics share universal visual consequences
    Grounding philosophy stated in the abstract that enables the cross-reconstruction mechanism.
invented entities (1)
  • Unified Latent Action Tokenizer (UniT) no independent evidence
    purpose: Produces embodiment-agnostic discrete tokens of physical intent
    New framework component introduced to serve as the shared language.

pith-pipeline@v0.9.0 · 5602 in / 1248 out tokens · 31310 ms · 2026-05-10T01:59:30.854946+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 25 canonical work pages · 10 internal anchors

  1. [1]

    arXiv preprint arXiv:2511.15704 (2025)

    Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu, Tianshu Huang, Xuxin Cheng, and Xiaolong Wang. In-n-on: Scaling egocentric manipulation with in-the-wild and on-task data.arXiv preprint arXiv:2511.15704, 2025

  2. [2]

    Dexwild: Dexterous human interactions for in-the-wild robot policies.arXiv preprint arXiv:2505.07813, 2025

    Tony Tao, Mohan Kumar Srirama, Jason Jingzhou Liu, Kenneth Shaw, and Deepak Pathak. Dexwild: Dexterous human interactions for in-the-wild robot policies.arXiv preprint arXiv:2505.07813, 2025

  3. [3]

    Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

    Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

  4. [4]

    Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

    Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

  5. [5]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  6. [6]

    Action tokenizer matters in in-context imitation learning

    An Dinh Vuong, Minh Nhat Vu, Dong An, and Ian Reid. Action tokenizer matters in in-context imitation learning. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13490–13496. IEEE, 2025

  7. [7]

    Quest: Self-supervised skill abstractions for learning continuous control, 2024.URL https://arxiv

    Atharva Mete, Haotian Xue, Albert Wilcox, Yongxin Chen, and Animesh Garg. Quest: Self-supervised skill abstractions for learning continuous control, 2024.URL https://arxiv. org/abs/2407.15840, 2024

  8. [8]

    Vq-vla: Improving vision-language-action models via scaling vector-quantized action tokenizers

    Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, and Tong He. Vq-vla: Improving vision-language-action models via scaling vector-quantized action tokenizers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11089–11099, 2025

  9. [9]

    Moto: Latent motion token as the bridging language for learning robot manipulation from videos

    Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for learning robot manipulation from videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19752–19763, 2025

  10. [10]

    Latent Action Pretraining from Videos

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758, 2024

  11. [11]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

  12. [12]

    Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Xindong He, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025

  13. [13]

    Metis: Multi-source egocentric training for integrated dexterous vision-language-action model.arXiv preprint arXiv:2511.17366, 2025

    Yankai Fu, Ning Chen, Junkai Zhao, Shaozhe Shan, Guocai Yao, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Metis: Multi-source egocentric training for integrated dexterous vision-language-action model.arXiv preprint arXiv:2511.17366, 2025

  14. [14]

    arXiv preprint arXiv:2203.12601 (2022)

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

  15. [15]

    VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

    Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

  16. [16]

    Real-world robot learning with masked visual pre-training

    Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learning with masked visual pre-training. InConference on Robot Learning, pages 416–426. PMLR, 2023

  17. [17]

    Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers.Advances in neural information processing systems, 37:124420– 124450, 2024

    Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers.Advances in neural information processing systems, 37:124420– 124450, 2024

  18. [18]

    Humanoid policy˜ human policy,

    Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J Yoon, Ryan Hoque, Lars Paulsen, et al. Humanoid policy˜ human policy.arXiv preprint arXiv:2503.13441, 2025. 19

  19. [19]

    Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies.arXiv preprint arXiv:2509.17759, 2025

    Chengbo Yuan, Rui Zhou, Mengzhen Liu, Yingdong Hu, Shengjie Wang, Li Yi, Chuan Wen, Shanghang Zhang, and Yang Gao. Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies.arXiv preprint arXiv:2509.17759, 2025

  20. [20]

    Egomimic: Scaling imitation learning via egocentric video

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025

  21. [21]

    H-rdt: Human manipulation enhanced bimanual robotic manipulation

    Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-rdt: Human manipulation enhanced bimanual robotic manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18135–18143, 2026

  22. [22]

    arXiv preprint arXiv:2512.22414 (2025)

    Simar Kareer, Karl Pertsch, James Darpinian, Judy Hoffman, Danfei Xu, Sergey Levine, Chelsea Finn, and Suraj Nair. Emergence of human to robot transfer in vision-language-action models.arXiv preprint arXiv:2512.22414, 2025

  23. [23]

    Villa-x: enhancing latent action modeling in vision-language-action models,

    Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language- action models.arXiv preprint arXiv:2507.23682, 2025

  24. [24]

    Xr-1: Towards versatile vision-language-action models via learning unified vision- motion representations.arXiv preprint arXiv:2511.02776,

    Shichao Fan, Kun Wu, Zhengping Che, Xinhua Wang, Di Wu, Fei Liao, Ning Liu, Yixue Zhang, Zhen Zhao, Zhiyuan Xu, et al. Xr-1: Towards versatile vision-language-action models via learning unified vision-motion representations.arXiv preprint arXiv:2511.02776, 2025

  25. [25]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

  26. [26]

    GR00T N1: An open foundation model for generalist humanoid robots

    NVIDIA, Johan Bjorck, Nikita Cherniadev Fernando Castañeda, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You L...

  27. [27]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  28. [28]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  29. [29]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  30. [30]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  31. [31]

    Irasim: A fine-grained world model for robot manipulation

    Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. Irasim: A fine-grained world model for robot manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9834–9844, 2025

  32. [32]

    Ctrl-world: A controllable generative world model for robot manipulation, 2026

    Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

  33. [33]

    Evaluating robot policies in a world model.arXiv e-prints, pages arXiv–2506, 2025

    Julian Quevedo, Percy Liang, and Sherry Yang. Evaluating robot policies in a world model.arXiv e-prints, pages arXiv–2506, 2025

  34. [34]

    NVIDIA, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jin...

  35. [35]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  36. [36]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  37. [37]

    Gr00t n1: An open foundation model for generalist humanoid robots, 2025

    NVIDIA, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You ...

  38. [38]

    EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video,

    Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

  39. [39]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

  40. [40]

    Unified Video Action Model

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

  41. [41]

    Flare: Robot learning with implicit world modeling, 2025

    Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, et al. Flare: Robot learning with implicit world modeling.arXiv preprint arXiv:2505.15659, 2025

  42. [42]

    Gr00t n1.6: An improved open foundation model for generalist humanoid robots

    NVIDIA GEAR Team, Allison Azzolini, Johan Bjorck, Valts Blukis, et al. Gr00t n1.6: An improved open foundation model for generalist humanoid robots. https://research.nvidia.com/labs/gear/ gr00t-n1_6/, December 2025

  43. [43]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  44. [44]

    Fvd: A new metric for video generation

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019

  45. [45]

    Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008. 21