arxiv: 2512.16811 · v2 · submitted 2025-12-18 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

Jingjing Qian , Boyao Han , Chen Shi , Lei Xiao , Long Yang , Shaoshuai Shi , Li Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:24 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords Vision-Language-ActionRobotic Manipulation3D GeometryPredictive KinematicsGaussian SplattingDepth RenderingTraining-Time Supervision

0 comments

The pith

GeoPredict augments VLA policies with predictive 3D kinematic trajectories and Gaussian geometry that supervise training but add no decoding cost at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current vision-language-action models for robots are mostly reactive and limited to 2D image features, which makes them unreliable when tasks demand accurate spatial reasoning in three dimensions. GeoPredict adds two predictive modules during training: one that forecasts multi-step 3D keypoint paths of the robot arm from motion history, and another that predicts future workspace geometry using 3D Gaussians refined along those paths. Both modules supply supervision only through rendered depth images and are never invoked at test time. The resulting policy therefore receives richer geometric signals while remaining as fast as a standard VLA. Experiments show the largest gains appear precisely on manipulation benchmarks that stress precise 3D positioning and geometry.

Core claim

GeoPredict augments a continuous-action VLA policy with a trajectory-level predictive kinematic module that encodes motion history and outputs multi-step 3D keypoint trajectories, together with a predictive 3D Gaussian geometry module that forecasts workspace structure and refines it along the predicted tracks. These modules supply training-time supervision exclusively via depth-based rendering; at inference the policy uses only lightweight additional query tokens and performs no 3D decoding or reconstruction.

What carries the argument

The trajectory-level predictive kinematic module combined with the track-guided 3D Gaussian geometry module, which together supply depth-rendered supervision signals only during training.

If this is right

The policy outperforms strong VLA baselines on RoboCasa Human-50, LIBERO, and real-world manipulation benchmarks.
Gains are largest in geometry-intensive and spatially demanding scenarios.
Inference cost stays low because no 3D decoding or reconstruction occurs at runtime.
Only extra query tokens are needed at test time to carry the learned priors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same training-time rendering supervision could be applied to other action-prediction models that currently lack explicit 3D structure.
Future work could test whether the predicted trajectories themselves become usable as open-loop plans when closed-loop feedback is unavailable.
The approach may reduce reliance on dense 3D ground-truth labels by turning predicted geometry into an auxiliary training signal.

Load-bearing premise

The geometric and kinematic predictions learned through depth rendering transfer useful information to the final policy without creating a distribution shift between training and deployment.

What would settle it

Performance on geometry-heavy tasks falls back to baseline levels when the kinematic or Gaussian modules are removed or when their predictions are deliberately corrupted during training.

Figures

Figures reproduced from arXiv: 2512.16811 by Boyao Han, Chen Shi, Jingjing Qian, Lei Xiao, Li Jiang, Long Yang, Shaoshuai Shi.

**Figure 1.** Figure 1: Overview of GeoPredict. Given an instruction, multi-view images and motion history encoded by the Track Encoder, a central LLM Transformer learns two main tasks. First, it predicts multi-timestep 3D keypoint trajectories using learnable Future Track Query. Second, it forecasts future workspace geometry as a predictive 3D Gaussian by processing a 3D Spatial Query through a Voxel Decoder. A track-guided refi… view at source ↗

**Figure 2.** Figure 2: Block-wise Causal Attention Mechanism. For simplicity, the detailed attention pathways from the 3D Token and State Token blocks to other blocks are not fully drawn. tokens), (3) 3D Query tokens (future track queries and spatial queries), (4) State Token (proprioceptive token) and (5) Action Noise tokens (used by flow matching). Attention is fully bidirectional within each block, enabling rich intrablock… view at source ↗

**Figure 3.** Figure 3: Real-world Evaluation Suite. These settings aim to evaluate the model’s spatial generalization, geometry generalization and robustness to distractors. Each column represents different trials of the same task. Baselines. Our primary baseline is our VLA backbone, π0 [3], trained without our proposed predictive 3D modules. This comparison directly isolates the contribution of our geometry-aware predictive … view at source ↗

**Figure 4.** Figure 4: provides a qualitative visualization of our predictive 3DGS geometry module, which compares the predicted future depths at various timesteps (t + 1, t + 10, t + 20). While the initial Gaussians (Ginit) capture only the coarse scene layout, the refined Gaussians (Gtotal) exhibit significantly sharper geometric details, particularly surrounding the robotic arm. This visually confirms that our refinement me… view at source ↗

read the original abstract

Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GeoPredict uses training-only 3D predictive modules for VLA supervision but the abstract lacks the numbers and ablations to back up the performance claims.

read the letter

The one thing to know about this paper is that it proposes adding predictive 3D kinematics and Gaussian geometry modules to a VLA model, but only uses them to generate supervision signals during training through depth rendering. Inference stays lightweight with just added query tokens. This design is new in how it pairs trajectory-level prediction with track-guided refinement exclusively for supervision, and it does a decent job of addressing the 2D-centric limitation in existing VLAs without increasing inference cost. The focus on geometry-intensive scenarios is well-motivated for manipulation tasks. The soft spots are in the lack of concrete evidence. The abstract claims consistent outperformance on RoboCasa Human-50, LIBERO, and real-world tasks but provides no quantitative results, ablations, or details on loss balancing. This makes it hard to verify if the geometric priors are truly internalized by the policy or if gains come from elsewhere. The stress-test concern about distribution shift holds up based on what's shown, since no analysis of policy changes or train-test mismatch is described. This paper is for VLA researchers looking for incremental ways to inject 3D reasoning into policies. A reader interested in practical robotics improvements might find the architecture useful as a starting point, though the current presentation leaves the central claims unverified. I would send this to peer review because the core idea is sound and the problem it targets is relevant, even though revisions would be needed to add the missing experimental rigor.

Referee Report

4 major / 1 minor

Summary. The manuscript proposes GeoPredict, a geometry-aware augmentation to Vision-Language-Action (VLA) models. It adds a trajectory-level module that predicts multi-step 3D keypoint trajectories from motion history and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement. Both modules supply training-time supervision exclusively through depth-based rendering losses; at inference only lightweight query tokens are added to the base policy, with no 3D decoding performed. The central claim is that this yields consistent outperformance over strong VLA baselines on RoboCasa Human-50, LIBERO, and real-world manipulation tasks, particularly in geometry-intensive scenarios.

Significance. If the experimental claims are substantiated, the approach would provide a practical route to embedding 3D geometric priors into continuous-action VLA policies without raising inference cost, addressing a recognized limitation of current 2D-centric models in spatially precise manipulation.

major comments (4)

[Abstract] Abstract: the claim of 'consistent outperformance' is presented without any quantitative metrics, standard deviations, or error bars, preventing assessment of effect size or statistical reliability.
[Method] Method (training objective): no description is given of how the depth-rendering losses from the kinematic and Gaussian modules are weighted or balanced against the primary policy loss, leaving the training dynamics and potential for auxiliary-signal dominance unexamined.
[Experiments] Experiments: no ablation studies isolate the contribution of the predictive kinematic module, the 3D Gaussian module, or the depth-rendering supervision itself versus simple capacity increases, so the attribution of gains to geometric priors remains unverified.
[Experiments] Experiments: no quantitative comparison of train versus test distribution shift is reported for the policy when trained with 3D rendering supervision yet evaluated without 3D decoding, which directly tests the central assumption that the supervision transfers without mismatch penalty.

minor comments (1)

[Abstract] Abstract: the phrase 'track-guided refinement' is used without a one-sentence definition or pointer to its implementation.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript, including quantitative details, clarifications, and additional experiments.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'consistent outperformance' is presented without any quantitative metrics, standard deviations, or error bars, preventing assessment of effect size or statistical reliability.

Authors: We agree that the abstract should provide quantitative support for the claim. In the revised version, we will include specific success rates (with standard deviations) from the RoboCasa Human-50, LIBERO, and real-world experiments to substantiate the consistent outperformance and allow assessment of effect sizes. revision: yes
Referee: [Method] Method (training objective): no description is given of how the depth-rendering losses from the kinematic and Gaussian modules are weighted or balanced against the primary policy loss, leaving the training dynamics and potential for auxiliary-signal dominance unexamined.

Authors: We will expand the method section to explicitly describe the loss weighting. The revised text will specify the balancing coefficients between the kinematic trajectory depth-rendering loss, the 3D Gaussian geometry depth-rendering loss, and the primary policy loss, along with the hyperparameter search procedure used to avoid auxiliary-signal dominance. revision: yes
Referee: [Experiments] Experiments: no ablation studies isolate the contribution of the predictive kinematic module, the 3D Gaussian module, or the depth-rendering supervision itself versus simple capacity increases, so the attribution of gains to geometric priors remains unverified.

Authors: We will add a dedicated ablation section in the revision. These studies will systematically disable the kinematic predictor, the 3D Gaussian module, and the depth-rendering supervision, while also comparing against capacity-matched baselines (identical parameter count) to isolate the contribution of the geometric priors. revision: yes
Referee: [Experiments] Experiments: no quantitative comparison of train versus test distribution shift is reported for the policy when trained with 3D rendering supervision yet evaluated without 3D decoding, which directly tests the central assumption that the supervision transfers without mismatch penalty.

Authors: We will include new quantitative experiments in the revised manuscript that directly measure the train-test distribution shift. These will compare policy performance when trained with full 3D rendering supervision versus inference using only the lightweight query tokens, providing empirical evidence on the mismatch penalty and validating the transfer assumption. revision: yes

Circularity Check

0 steps flagged

No circularity: method is architectural augmentation with external supervision, no derivations or self-referential equations

full rationale

The paper describes GeoPredict as an architectural addition of trajectory prediction and 3D Gaussian modules that supply training-time depth-rendering losses to a VLA policy. Inference uses only added query tokens. No equations, closed-form derivations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. The central claim rests on empirical outperformance on RoboCasa, LIBERO, and real-world tasks rather than any reduction of outputs to inputs by construction. Self-citations, if present, are not load-bearing for any derivation because none exists. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes standard VLA training objectives can be augmented with auxiliary depth-rendering losses without destabilizing policy learning; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Depth-based rendering of predicted 3D keypoints and Gaussians provides useful supervisory signal for 2D-centric VLA policies
Abstract states these modules serve exclusively as training-time supervision through depth-based rendering.

pith-pipeline@v0.9.0 · 5469 in / 1199 out tokens · 20449 ms · 2026-05-16T21:24:10.006952+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories ... predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement ... supervised through future depth-map rendering
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

block-wise causal attention ... predictive modules serve exclusively as training-time supervision

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
cs.RO 2026-05 unverdicted novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 5.0

STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising r...

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 2 Pith papers · 12 internal anchors

[1]

Paligemma: A versatile 3b vlm for trans- fer.CoRR, 2024

Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexan- der Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for trans- fer.CoRR, 2024. 3

work page 2024
[2]

Zero-shot robotic manipulation with pre-trained image- editing diffusion models

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pre-trained image- editing diffusion models. InNeurIPS 2023 Workshop on Goal-Conditioned Reinforcement Learning, 2023. 2

work page 2023
[3]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent ac- tions.arXiv preprint arXiv:2505.06111, 2025. 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Moto: Latent mo- tion token as the bridging language for learning robot ma- nipulation from videos

Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent mo- tion token as the bridging language for learning robot ma- nipulation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19752– 19763, 2025. 1

work page 2025
[9]

Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025. 7

work page 2025
[10]

Learning universal policies via text-guided video genera- tion.Advances in neural information processing systems, 36:9156–9172, 2023

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video genera- tion.Advances in neural information processing systems, 36:9156–9172, 2023. 2

work page 2023
[11]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18995–19012, 2022. 1

work page 2022
[12]

Pre-trained text- to-image diffusion models are versatile representation learn- ers for control.Advances in Neural Information Processing Systems, 37:74182–74210, 2024

Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong Lu, and Tim GJ Rudner. Pre-trained text- to-image diffusion models are versatile representation learn- ers for control.Advances in Neural Information Processing Systems, 37:74182–74210, 2024. 1

work page 2024
[13]

Mastering atari with discrete world mod- els

Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world mod- els. InInternational Conference on Learning Representa- tions, 2020. 2

work page 2020
[14]

Deep hierarchical planning from pixels.Advances in Neural Information Processing Systems, 35:26091–26104, 2022

Danijar Hafner, Kuang-Huei Lee, Ian Fischer, and Pieter Abbeel. Deep hierarchical planning from pixels.Advances in Neural Information Processing Systems, 35:26091–26104, 2022

work page 2022
[15]

Td-mpc2: Scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. 2

work page 2023
[16]

Efficientnerf efficient neural radiance fields

Tao Hu, Shu Liu, Yilun Chen, Tiancheng Shen, and Jiaya Jia. Efficientnerf efficient neural radiance fields. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12902–12911, 2022. 2

work page 2022
[17]

Video prediction policy: A gener- alist robot policy with predictive visual representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A gener- alist robot policy with predictive visual representations. In Forty-second International Conference on Machine Learn- ing, 2025. 1, 2

work page 2025
[18]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):1–14, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):1–14, 2023. 2, 4, 5

work page 2023
[19]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025. 2, 3, 6, 7

work page 2025
[20]

BridgeVLA: Input-output alignment for efficient 3d manipu- lation learning with vision-language models.arXiv preprint arXiv:2506.07961, 2025

Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, and Tieniu Tan. Bridgevla: Input-output alignment for efficient 3d manipu- lation learning with vision-language models.arXiv preprint arXiv:2506.07961, 2025. 2

work page arXiv 2025
[21]

Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipulation.CoRR, 2024

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipulation.CoRR, 2024. 1, 2

work page 2024
[22]

Unified Video Action Model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Vision-language foundation models as effective robot imitators

Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. InICLR, 2024. 1 9

work page 2024
[24]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 6, 7

work page 2023
[26]

Geometry-aware 4d video generation for robot manipulation.arXiv preprint arXiv:2507.01099, 2025

Zeyi Liu, Shuang Li, Eric Cousineau, Siyuan Feng, Ben- jamin Burchfiel, and Shuran Song. Geometry-aware 4d video generation for robot manipulation.arXiv preprint arXiv:2507.01099, 2025. 2

work page arXiv 2025
[27]

Marching cubes: A high resolution 3d surface construction algorithm

William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface construction algorithm. InSem- inal graphics: pioneering efforts that shaped the field, pages 347–353. 1998. 2

work page 1998
[28]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2017. 6

work page 2017
[29]

Gwm: Towards scalable gaussian world models for robotic manipulation

Guanxing Lu, Baoxiong Jia, Puhao Li, Yixin Chen, Ziwei Wang, Yansong Tang, and Siyuan Huang. Gwm: Towards scalable gaussian world models for robotic manipulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9263–9274, 2025. 1, 2, 6, 7

work page 2025
[30]

Trans- formers are sample-efficient world models

Vincent Micheli, Eloi Alonso, and Franc ¸ois Fleuret. Trans- formers are sample-efficient world models. InDeep Rein- forcement Learning Workshop NeurIPS 2022, 2022. 2

work page 2022
[31]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

work page 2021
[32]

Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022. 2

work page 2022
[33]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Man- dlekar, and Yuke Zhu. Robocasa: Large-scale simula- tion of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,

work page
[35]

Wristworld: Generating wrist-views via 4d world models for robotic manipulation.arXiv preprint arXiv:2510.07313,

Zezhong Qian, Xiaowei Chi, Yuming Li, Shizun Wang, Zhiyuan Qin, Xiaozhu Ju, Sirui Han, and Shanghang Zhang. Wristworld: Generating wrist-views via 4d world models for robotic manipulation.arXiv preprint arXiv:2510.07313,

work page arXiv
[36]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial represen- tations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025. 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Scube: Instant large-scale scene reconstruction us- ing voxsplats.Advances in Neural Information Processing Systems, 37:97670–97698, 2024

Xuanchi Ren, Yifan Lu, Hanxue Liang, Zhangjie Wu, Huan Ling, Mike Chen, Sanja Fidler, Francis Williams, and Jiahui Huang. Scube: Instant large-scale scene reconstruction us- ing voxsplats.Advances in Neural Information Processing Systems, 37:97670–97698, 2024. 2

work page 2024
[38]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

4d gaussian splatting for real-time dynamic scene rendering

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 20310–20320, 2024. 2

work page 2024
[40]

Unleashing large-scale video generative pre-training for visual robot manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. InICLR, 2024. 2

work page 2024
[41]

3d shapenets: A deep representation for volumetric shapes

Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin- guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015. 2

work page 1912
[42]

Point- nerf: Point-based neural radiance fields

Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point- nerf: Point-based neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5438–5448, 2022. 2

work page 2022
[43]

Street gaussians: Modeling dynamic urban scenes with gaussian splatting

Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision, pages 156–173. Springer, 2024. 2

work page 2024
[44]

4d-vla: spatiotemporal vision-language-action pretraining with cross-scene calibration,

Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Yan- peng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, et al. 4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibra- tion.arXiv preprint arXiv:2506.22242, 2025. 7

work page arXiv 2025
[45]

Storm: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural In- formation Processing Systems, 36:27147–27166, 2023

Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural In- formation Processing Systems, 36:27147–27166, 2023. 2

work page 2023
[46]

Dreamvla: A vision-language- action model dreamed with comprehensive world knowl- edge

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xin- Qiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, et al. Dreamvla: A vision-language- action model dreamed with comprehensive world knowl- edge. InThe Thirty-ninth Annual Conference on Neural In- formation Processing Systems, 2025. 1, 2

work page 2025
[47]

Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025. 1

work page 2025
[48]

Universal actions for en- hanced embodied foundation models

Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya- Qin Zhang, and Xianyuan Zhan. Universal actions for en- hanced embodied foundation models. InProceedings of the 10 Computer Vision and Pattern Recognition Conference, pages 22508–22519, 2025. 1, 2

work page 2025
[49]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345, 2024. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A survey on vision- language-action models: An action tokenization perspective. arXiv preprint arXiv:2507.01925, 2025. 1, 2

work page internal anchor Pith review arXiv 2025
[51]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 1 11

work page 2023