pith. machine review for the scientific record. sign in

arxiv: 2512.16811 · v2 · submitted 2025-12-18 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:24 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords Vision-Language-ActionRobotic Manipulation3D GeometryPredictive KinematicsGaussian SplattingDepth RenderingTraining-Time Supervision
0
0 comments X

The pith

GeoPredict augments VLA policies with predictive 3D kinematic trajectories and Gaussian geometry that supervise training but add no decoding cost at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current vision-language-action models for robots are mostly reactive and limited to 2D image features, which makes them unreliable when tasks demand accurate spatial reasoning in three dimensions. GeoPredict adds two predictive modules during training: one that forecasts multi-step 3D keypoint paths of the robot arm from motion history, and another that predicts future workspace geometry using 3D Gaussians refined along those paths. Both modules supply supervision only through rendered depth images and are never invoked at test time. The resulting policy therefore receives richer geometric signals while remaining as fast as a standard VLA. Experiments show the largest gains appear precisely on manipulation benchmarks that stress precise 3D positioning and geometry.

Core claim

GeoPredict augments a continuous-action VLA policy with a trajectory-level predictive kinematic module that encodes motion history and outputs multi-step 3D keypoint trajectories, together with a predictive 3D Gaussian geometry module that forecasts workspace structure and refines it along the predicted tracks. These modules supply training-time supervision exclusively via depth-based rendering; at inference the policy uses only lightweight additional query tokens and performs no 3D decoding or reconstruction.

What carries the argument

The trajectory-level predictive kinematic module combined with the track-guided 3D Gaussian geometry module, which together supply depth-rendered supervision signals only during training.

If this is right

  • The policy outperforms strong VLA baselines on RoboCasa Human-50, LIBERO, and real-world manipulation benchmarks.
  • Gains are largest in geometry-intensive and spatially demanding scenarios.
  • Inference cost stays low because no 3D decoding or reconstruction occurs at runtime.
  • Only extra query tokens are needed at test time to carry the learned priors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same training-time rendering supervision could be applied to other action-prediction models that currently lack explicit 3D structure.
  • Future work could test whether the predicted trajectories themselves become usable as open-loop plans when closed-loop feedback is unavailable.
  • The approach may reduce reliance on dense 3D ground-truth labels by turning predicted geometry into an auxiliary training signal.

Load-bearing premise

The geometric and kinematic predictions learned through depth rendering transfer useful information to the final policy without creating a distribution shift between training and deployment.

What would settle it

Performance on geometry-heavy tasks falls back to baseline levels when the kinematic or Gaussian modules are removed or when their predictions are deliberately corrupted during training.

Figures

Figures reproduced from arXiv: 2512.16811 by Boyao Han, Chen Shi, Jingjing Qian, Lei Xiao, Li Jiang, Long Yang, Shaoshuai Shi.

Figure 1
Figure 1. Figure 1: Overview of GeoPredict. Given an instruction, multi-view images and motion history encoded by the Track Encoder, a central LLM Transformer learns two main tasks. First, it predicts multi-timestep 3D keypoint trajectories using learnable Future Track Query. Second, it forecasts future workspace geometry as a predictive 3D Gaussian by processing a 3D Spatial Query through a Voxel Decoder. A track-guided refi… view at source ↗
Figure 2
Figure 2. Figure 2: Block-wise Causal Attention Mechanism. For simplic￾ity, the detailed attention pathways from the 3D Token and State Token blocks to other blocks are not fully drawn. tokens), (3) 3D Query tokens (future track queries and spa￾tial queries), (4) State Token (proprioceptive token) and (5) Action Noise tokens (used by flow matching). Attention is fully bidirectional within each block, enabling rich intra￾block… view at source ↗
Figure 3
Figure 3. Figure 3: Real-world Evaluation Suite. These settings aim to evaluate the model’s spatial generalization, geometry generaliza￾tion and robustness to distractors. Each column represents differ￾ent trials of the same task. Baselines. Our primary baseline is our VLA backbone, π0 [3], trained without our proposed predictive 3D mod￾ules. This comparison directly isolates the contribution of our geometry-aware predictive … view at source ↗
Figure 4
Figure 4. Figure 4: provides a qualitative visualization of our predic￾tive 3DGS geometry module, which compares the predicted future depths at various timesteps (t + 1, t + 10, t + 20). While the initial Gaussians (Ginit) capture only the coarse scene layout, the refined Gaussians (Gtotal) exhibit signif￾icantly sharper geometric details, particularly surrounding the robotic arm. This visually confirms that our refinement me… view at source ↗
read the original abstract

Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 1 minor

Summary. The manuscript proposes GeoPredict, a geometry-aware augmentation to Vision-Language-Action (VLA) models. It adds a trajectory-level module that predicts multi-step 3D keypoint trajectories from motion history and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement. Both modules supply training-time supervision exclusively through depth-based rendering losses; at inference only lightweight query tokens are added to the base policy, with no 3D decoding performed. The central claim is that this yields consistent outperformance over strong VLA baselines on RoboCasa Human-50, LIBERO, and real-world manipulation tasks, particularly in geometry-intensive scenarios.

Significance. If the experimental claims are substantiated, the approach would provide a practical route to embedding 3D geometric priors into continuous-action VLA policies without raising inference cost, addressing a recognized limitation of current 2D-centric models in spatially precise manipulation.

major comments (4)
  1. [Abstract] Abstract: the claim of 'consistent outperformance' is presented without any quantitative metrics, standard deviations, or error bars, preventing assessment of effect size or statistical reliability.
  2. [Method] Method (training objective): no description is given of how the depth-rendering losses from the kinematic and Gaussian modules are weighted or balanced against the primary policy loss, leaving the training dynamics and potential for auxiliary-signal dominance unexamined.
  3. [Experiments] Experiments: no ablation studies isolate the contribution of the predictive kinematic module, the 3D Gaussian module, or the depth-rendering supervision itself versus simple capacity increases, so the attribution of gains to geometric priors remains unverified.
  4. [Experiments] Experiments: no quantitative comparison of train versus test distribution shift is reported for the policy when trained with 3D rendering supervision yet evaluated without 3D decoding, which directly tests the central assumption that the supervision transfers without mismatch penalty.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'track-guided refinement' is used without a one-sentence definition or pointer to its implementation.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript, including quantitative details, clarifications, and additional experiments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'consistent outperformance' is presented without any quantitative metrics, standard deviations, or error bars, preventing assessment of effect size or statistical reliability.

    Authors: We agree that the abstract should provide quantitative support for the claim. In the revised version, we will include specific success rates (with standard deviations) from the RoboCasa Human-50, LIBERO, and real-world experiments to substantiate the consistent outperformance and allow assessment of effect sizes. revision: yes

  2. Referee: [Method] Method (training objective): no description is given of how the depth-rendering losses from the kinematic and Gaussian modules are weighted or balanced against the primary policy loss, leaving the training dynamics and potential for auxiliary-signal dominance unexamined.

    Authors: We will expand the method section to explicitly describe the loss weighting. The revised text will specify the balancing coefficients between the kinematic trajectory depth-rendering loss, the 3D Gaussian geometry depth-rendering loss, and the primary policy loss, along with the hyperparameter search procedure used to avoid auxiliary-signal dominance. revision: yes

  3. Referee: [Experiments] Experiments: no ablation studies isolate the contribution of the predictive kinematic module, the 3D Gaussian module, or the depth-rendering supervision itself versus simple capacity increases, so the attribution of gains to geometric priors remains unverified.

    Authors: We will add a dedicated ablation section in the revision. These studies will systematically disable the kinematic predictor, the 3D Gaussian module, and the depth-rendering supervision, while also comparing against capacity-matched baselines (identical parameter count) to isolate the contribution of the geometric priors. revision: yes

  4. Referee: [Experiments] Experiments: no quantitative comparison of train versus test distribution shift is reported for the policy when trained with 3D rendering supervision yet evaluated without 3D decoding, which directly tests the central assumption that the supervision transfers without mismatch penalty.

    Authors: We will include new quantitative experiments in the revised manuscript that directly measure the train-test distribution shift. These will compare policy performance when trained with full 3D rendering supervision versus inference using only the lightweight query tokens, providing empirical evidence on the mismatch penalty and validating the transfer assumption. revision: yes

Circularity Check

0 steps flagged

No circularity: method is architectural augmentation with external supervision, no derivations or self-referential equations

full rationale

The paper describes GeoPredict as an architectural addition of trajectory prediction and 3D Gaussian modules that supply training-time depth-rendering losses to a VLA policy. Inference uses only added query tokens. No equations, closed-form derivations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. The central claim rests on empirical outperformance on RoboCasa, LIBERO, and real-world tasks rather than any reduction of outputs to inputs by construction. Self-citations, if present, are not load-bearing for any derivation because none exists. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes standard VLA training objectives can be augmented with auxiliary depth-rendering losses without destabilizing policy learning; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Depth-based rendering of predicted 3D keypoints and Gaussians provides useful supervisory signal for 2D-centric VLA policies
    Abstract states these modules serve exclusively as training-time supervision through depth-based rendering.

pith-pipeline@v0.9.0 · 5469 in / 1199 out tokens · 20449 ms · 2026-05-16T21:24:10.006952+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

    cs.RO 2026-05 unverdicted novelty 6.0

    GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.

  2. STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 5.0

    STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising r...

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 2 Pith papers · 12 internal anchors

  1. [1]

    Paligemma: A versatile 3b vlm for trans- fer.CoRR, 2024

    Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexan- der Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for trans- fer.CoRR, 2024. 3

  2. [2]

    Zero-shot robotic manipulation with pre-trained image- editing diffusion models

    Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pre-trained image- editing diffusion models. InNeurIPS 2023 Workshop on Goal-Conditioned Reinforcement Learning, 2023. 2

  3. [3]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

  4. [4]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 1, 2

  5. [5]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent ac- tions.arXiv preprint arXiv:2505.06111, 2025. 2, 6, 7

  6. [6]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025. 1, 2

  7. [7]

    villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

    Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025. 1

  8. [8]

    Moto: Latent mo- tion token as the bridging language for learning robot ma- nipulation from videos

    Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent mo- tion token as the bridging language for learning robot ma- nipulation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19752– 19763, 2025. 1

  9. [9]

    Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025. 7

  10. [10]

    Learning universal policies via text-guided video genera- tion.Advances in neural information processing systems, 36:9156–9172, 2023

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video genera- tion.Advances in neural information processing systems, 36:9156–9172, 2023. 2

  11. [11]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18995–19012, 2022. 1

  12. [12]

    Pre-trained text- to-image diffusion models are versatile representation learn- ers for control.Advances in Neural Information Processing Systems, 37:74182–74210, 2024

    Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong Lu, and Tim GJ Rudner. Pre-trained text- to-image diffusion models are versatile representation learn- ers for control.Advances in Neural Information Processing Systems, 37:74182–74210, 2024. 1

  13. [13]

    Mastering atari with discrete world mod- els

    Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world mod- els. InInternational Conference on Learning Representa- tions, 2020. 2

  14. [14]

    Deep hierarchical planning from pixels.Advances in Neural Information Processing Systems, 35:26091–26104, 2022

    Danijar Hafner, Kuang-Huei Lee, Ian Fischer, and Pieter Abbeel. Deep hierarchical planning from pixels.Advances in Neural Information Processing Systems, 35:26091–26104, 2022

  15. [15]

    Td-mpc2: Scalable, robust world models for continuous control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. 2

  16. [16]

    Efficientnerf efficient neural radiance fields

    Tao Hu, Shu Liu, Yilun Chen, Tiancheng Shen, and Jiaya Jia. Efficientnerf efficient neural radiance fields. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12902–12911, 2022. 2

  17. [17]

    Video prediction policy: A gener- alist robot policy with predictive visual representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A gener- alist robot policy with predictive visual representations. In Forty-second International Conference on Machine Learn- ing, 2025. 1, 2

  18. [18]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):1–14, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):1–14, 2023. 2, 4, 5

  19. [19]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025. 2, 3, 6, 7

  20. [20]

    BridgeVLA: Input-output alignment for efficient 3d manipu- lation learning with vision-language models.arXiv preprint arXiv:2506.07961, 2025

    Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, and Tieniu Tan. Bridgevla: Input-output alignment for efficient 3d manipu- lation learning with vision-language models.arXiv preprint arXiv:2506.07961, 2025. 2

  21. [21]

    Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipulation.CoRR, 2024

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipulation.CoRR, 2024. 1, 2

  22. [22]

    Unified Video Action Model

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025. 1

  23. [23]

    Vision-language foundation models as effective robot imitators

    Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. InICLR, 2024. 1 9

  24. [24]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2, 3

  25. [25]

    Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 6, 7

  26. [26]

    Geometry-aware 4d video generation for robot manipulation.arXiv preprint arXiv:2507.01099, 2025

    Zeyi Liu, Shuang Li, Eric Cousineau, Siyuan Feng, Ben- jamin Burchfiel, and Shuran Song. Geometry-aware 4d video generation for robot manipulation.arXiv preprint arXiv:2507.01099, 2025. 2

  27. [27]

    Marching cubes: A high resolution 3d surface construction algorithm

    William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface construction algorithm. InSem- inal graphics: pioneering efforts that shaped the field, pages 347–353. 1998. 2

  28. [28]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2017. 6

  29. [29]

    Gwm: Towards scalable gaussian world models for robotic manipulation

    Guanxing Lu, Baoxiong Jia, Puhao Li, Yixin Chen, Ziwei Wang, Yansong Tang, and Siyuan Huang. Gwm: Towards scalable gaussian world models for robotic manipulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9263–9274, 2025. 1, 2, 6, 7

  30. [30]

    Trans- formers are sample-efficient world models

    Vincent Micheli, Eloi Alonso, and Franc ¸ois Fleuret. Trans- formers are sample-efficient world models. InDeep Rein- forcement Learning Workshop NeurIPS 2022, 2022. 2

  31. [31]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

  32. [32]

    Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

    Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022. 2

  33. [33]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Man- dlekar, and Yuke Zhu. Robocasa: Large-scale simula- tion of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024. 6, 7

  34. [34]

    Pointnet: Deep learning on point sets for 3d classification and segmentation

    Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,

  35. [35]

    Wristworld: Generating wrist-views via 4d world models for robotic manipulation.arXiv preprint arXiv:2510.07313,

    Zezhong Qian, Xiaowei Chi, Yuming Li, Shizun Wang, Zhiyuan Qin, Xiaozhu Ju, Sirui Han, and Shanghang Zhang. Wristworld: Generating wrist-views via 4d world models for robotic manipulation.arXiv preprint arXiv:2510.07313,

  36. [36]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial represen- tations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025. 2, 6, 7

  37. [37]

    Scube: Instant large-scale scene reconstruction us- ing voxsplats.Advances in Neural Information Processing Systems, 37:97670–97698, 2024

    Xuanchi Ren, Yifan Lu, Hanxue Liang, Zhangjie Wu, Huan Ling, Mike Chen, Sanja Fidler, Francis Williams, and Jiahui Huang. Scube: Instant large-scale scene reconstruction us- ing voxsplats.Advances in Neural Information Processing Systems, 37:97670–97698, 2024. 2

  38. [38]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024. 7

  39. [39]

    4d gaussian splatting for real-time dynamic scene rendering

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 20310–20320, 2024. 2

  40. [40]

    Unleashing large-scale video generative pre-training for visual robot manipulation

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. InICLR, 2024. 2

  41. [41]

    3d shapenets: A deep representation for volumetric shapes

    Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin- guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015. 2

  42. [42]

    Point- nerf: Point-based neural radiance fields

    Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point- nerf: Point-based neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5438–5448, 2022. 2

  43. [43]

    Street gaussians: Modeling dynamic urban scenes with gaussian splatting

    Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision, pages 156–173. Springer, 2024. 2

  44. [44]

    4d-vla: spatiotemporal vision-language-action pretraining with cross-scene calibration,

    Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Yan- peng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, et al. 4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibra- tion.arXiv preprint arXiv:2506.22242, 2025. 7

  45. [45]

    Storm: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural In- formation Processing Systems, 36:27147–27166, 2023

    Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural In- formation Processing Systems, 36:27147–27166, 2023. 2

  46. [46]

    Dreamvla: A vision-language- action model dreamed with comprehensive world knowl- edge

    Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xin- Qiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, et al. Dreamvla: A vision-language- action model dreamed with comprehensive world knowl- edge. InThe Thirty-ninth Annual Conference on Neural In- formation Processing Systems, 2025. 1, 2

  47. [47]

    Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025. 1

  48. [48]

    Universal actions for en- hanced embodied foundation models

    Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya- Qin Zhang, and Xianyuan Zhan. Universal actions for en- hanced embodied foundation models. InProceedings of the 10 Computer Vision and Pattern Recognition Conference, pages 22508–22519, 2025. 1, 2

  49. [49]

    TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345, 2024. 2, 7

  50. [50]

    A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

    Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A survey on vision- language-action models: An action tokenization perspective. arXiv preprint arXiv:2507.01925, 2025. 1, 2

  51. [51]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 1 11