Recognition: 2 theorem links
· Lean TheoremGeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation
Pith reviewed 2026-05-16 21:24 UTC · model grok-4.3
The pith
GeoPredict augments VLA policies with predictive 3D kinematic trajectories and Gaussian geometry that supervise training but add no decoding cost at inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GeoPredict augments a continuous-action VLA policy with a trajectory-level predictive kinematic module that encodes motion history and outputs multi-step 3D keypoint trajectories, together with a predictive 3D Gaussian geometry module that forecasts workspace structure and refines it along the predicted tracks. These modules supply training-time supervision exclusively via depth-based rendering; at inference the policy uses only lightweight additional query tokens and performs no 3D decoding or reconstruction.
What carries the argument
The trajectory-level predictive kinematic module combined with the track-guided 3D Gaussian geometry module, which together supply depth-rendered supervision signals only during training.
If this is right
- The policy outperforms strong VLA baselines on RoboCasa Human-50, LIBERO, and real-world manipulation benchmarks.
- Gains are largest in geometry-intensive and spatially demanding scenarios.
- Inference cost stays low because no 3D decoding or reconstruction occurs at runtime.
- Only extra query tokens are needed at test time to carry the learned priors.
Where Pith is reading between the lines
- The same training-time rendering supervision could be applied to other action-prediction models that currently lack explicit 3D structure.
- Future work could test whether the predicted trajectories themselves become usable as open-loop plans when closed-loop feedback is unavailable.
- The approach may reduce reliance on dense 3D ground-truth labels by turning predicted geometry into an auxiliary training signal.
Load-bearing premise
The geometric and kinematic predictions learned through depth rendering transfer useful information to the final policy without creating a distribution shift between training and deployment.
What would settle it
Performance on geometry-heavy tasks falls back to baseline levels when the kinematic or Gaussian modules are removed or when their predictions are deliberately corrupted during training.
Figures
read the original abstract
Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes GeoPredict, a geometry-aware augmentation to Vision-Language-Action (VLA) models. It adds a trajectory-level module that predicts multi-step 3D keypoint trajectories from motion history and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement. Both modules supply training-time supervision exclusively through depth-based rendering losses; at inference only lightweight query tokens are added to the base policy, with no 3D decoding performed. The central claim is that this yields consistent outperformance over strong VLA baselines on RoboCasa Human-50, LIBERO, and real-world manipulation tasks, particularly in geometry-intensive scenarios.
Significance. If the experimental claims are substantiated, the approach would provide a practical route to embedding 3D geometric priors into continuous-action VLA policies without raising inference cost, addressing a recognized limitation of current 2D-centric models in spatially precise manipulation.
major comments (4)
- [Abstract] Abstract: the claim of 'consistent outperformance' is presented without any quantitative metrics, standard deviations, or error bars, preventing assessment of effect size or statistical reliability.
- [Method] Method (training objective): no description is given of how the depth-rendering losses from the kinematic and Gaussian modules are weighted or balanced against the primary policy loss, leaving the training dynamics and potential for auxiliary-signal dominance unexamined.
- [Experiments] Experiments: no ablation studies isolate the contribution of the predictive kinematic module, the 3D Gaussian module, or the depth-rendering supervision itself versus simple capacity increases, so the attribution of gains to geometric priors remains unverified.
- [Experiments] Experiments: no quantitative comparison of train versus test distribution shift is reported for the policy when trained with 3D rendering supervision yet evaluated without 3D decoding, which directly tests the central assumption that the supervision transfers without mismatch penalty.
minor comments (1)
- [Abstract] Abstract: the phrase 'track-guided refinement' is used without a one-sentence definition or pointer to its implementation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript, including quantitative details, clarifications, and additional experiments.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'consistent outperformance' is presented without any quantitative metrics, standard deviations, or error bars, preventing assessment of effect size or statistical reliability.
Authors: We agree that the abstract should provide quantitative support for the claim. In the revised version, we will include specific success rates (with standard deviations) from the RoboCasa Human-50, LIBERO, and real-world experiments to substantiate the consistent outperformance and allow assessment of effect sizes. revision: yes
-
Referee: [Method] Method (training objective): no description is given of how the depth-rendering losses from the kinematic and Gaussian modules are weighted or balanced against the primary policy loss, leaving the training dynamics and potential for auxiliary-signal dominance unexamined.
Authors: We will expand the method section to explicitly describe the loss weighting. The revised text will specify the balancing coefficients between the kinematic trajectory depth-rendering loss, the 3D Gaussian geometry depth-rendering loss, and the primary policy loss, along with the hyperparameter search procedure used to avoid auxiliary-signal dominance. revision: yes
-
Referee: [Experiments] Experiments: no ablation studies isolate the contribution of the predictive kinematic module, the 3D Gaussian module, or the depth-rendering supervision itself versus simple capacity increases, so the attribution of gains to geometric priors remains unverified.
Authors: We will add a dedicated ablation section in the revision. These studies will systematically disable the kinematic predictor, the 3D Gaussian module, and the depth-rendering supervision, while also comparing against capacity-matched baselines (identical parameter count) to isolate the contribution of the geometric priors. revision: yes
-
Referee: [Experiments] Experiments: no quantitative comparison of train versus test distribution shift is reported for the policy when trained with 3D rendering supervision yet evaluated without 3D decoding, which directly tests the central assumption that the supervision transfers without mismatch penalty.
Authors: We will include new quantitative experiments in the revised manuscript that directly measure the train-test distribution shift. These will compare policy performance when trained with full 3D rendering supervision versus inference using only the lightweight query tokens, providing empirical evidence on the mismatch penalty and validating the transfer assumption. revision: yes
Circularity Check
No circularity: method is architectural augmentation with external supervision, no derivations or self-referential equations
full rationale
The paper describes GeoPredict as an architectural addition of trajectory prediction and 3D Gaussian modules that supply training-time depth-rendering losses to a VLA policy. Inference uses only added query tokens. No equations, closed-form derivations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. The central claim rests on empirical outperformance on RoboCasa, LIBERO, and real-world tasks rather than any reduction of outputs to inputs by construction. Self-citations, if present, are not load-bearing for any derivation because none exists. The approach is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Depth-based rendering of predicted 3D keypoints and Gaussians provides useful supervisory signal for 2D-centric VLA policies
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories ... predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement ... supervised through future depth-map rendering
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
block-wise causal attention ... predictive modules serve exclusively as training-time supervision
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
-
STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation
STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising r...
Reference graph
Works this paper leans on
-
[1]
Paligemma: A versatile 3b vlm for trans- fer.CoRR, 2024
Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexan- der Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for trans- fer.CoRR, 2024. 3
work page 2024
-
[2]
Zero-shot robotic manipulation with pre-trained image- editing diffusion models
Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pre-trained image- editing diffusion models. InNeurIPS 2023 Workshop on Goal-Conditioned Reinforcement Learning, 2023. 2
work page 2023
-
[3]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent ac- tions.arXiv preprint arXiv:2505.06111, 2025. 2, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
WorldVLA: Towards Autoregressive Action World Model
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models
Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Moto: Latent mo- tion token as the bridging language for learning robot ma- nipulation from videos
Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent mo- tion token as the bridging language for learning robot ma- nipulation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19752– 19763, 2025. 1
work page 2025
-
[9]
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025. 7
work page 2025
-
[10]
Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video genera- tion.Advances in neural information processing systems, 36:9156–9172, 2023. 2
work page 2023
-
[11]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18995–19012, 2022. 1
work page 2022
-
[12]
Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong Lu, and Tim GJ Rudner. Pre-trained text- to-image diffusion models are versatile representation learn- ers for control.Advances in Neural Information Processing Systems, 37:74182–74210, 2024. 1
work page 2024
-
[13]
Mastering atari with discrete world mod- els
Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world mod- els. InInternational Conference on Learning Representa- tions, 2020. 2
work page 2020
-
[14]
Danijar Hafner, Kuang-Huei Lee, Ian Fischer, and Pieter Abbeel. Deep hierarchical planning from pixels.Advances in Neural Information Processing Systems, 35:26091–26104, 2022
work page 2022
-
[15]
Td-mpc2: Scalable, robust world models for continuous control
Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. 2
work page 2023
-
[16]
Efficientnerf efficient neural radiance fields
Tao Hu, Shu Liu, Yilun Chen, Tiancheng Shen, and Jiaya Jia. Efficientnerf efficient neural radiance fields. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12902–12911, 2022. 2
work page 2022
-
[17]
Video prediction policy: A gener- alist robot policy with predictive visual representations
Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A gener- alist robot policy with predictive visual representations. In Forty-second International Conference on Machine Learn- ing, 2025. 1, 2
work page 2025
-
[18]
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):1–14, 2023. 2, 4, 5
work page 2023
-
[19]
Openvla: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025. 2, 3, 6, 7
work page 2025
-
[20]
Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, and Tieniu Tan. Bridgevla: Input-output alignment for efficient 3d manipu- lation learning with vision-language models.arXiv preprint arXiv:2506.07961, 2025. 2
-
[21]
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipulation.CoRR, 2024. 1, 2
work page 2024
-
[22]
Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Vision-language foundation models as effective robot imitators
Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. InICLR, 2024. 1 9
work page 2024
-
[24]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 6, 7
work page 2023
-
[26]
Geometry-aware 4d video generation for robot manipulation.arXiv preprint arXiv:2507.01099, 2025
Zeyi Liu, Shuang Li, Eric Cousineau, Siyuan Feng, Ben- jamin Burchfiel, and Shuran Song. Geometry-aware 4d video generation for robot manipulation.arXiv preprint arXiv:2507.01099, 2025. 2
-
[27]
Marching cubes: A high resolution 3d surface construction algorithm
William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface construction algorithm. InSem- inal graphics: pioneering efforts that shaped the field, pages 347–353. 1998. 2
work page 1998
-
[28]
Decoupled weight de- cay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2017. 6
work page 2017
-
[29]
Gwm: Towards scalable gaussian world models for robotic manipulation
Guanxing Lu, Baoxiong Jia, Puhao Li, Yixin Chen, Ziwei Wang, Yansong Tang, and Siyuan Huang. Gwm: Towards scalable gaussian world models for robotic manipulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9263–9274, 2025. 1, 2, 6, 7
work page 2025
-
[30]
Trans- formers are sample-efficient world models
Vincent Micheli, Eloi Alonso, and Franc ¸ois Fleuret. Trans- formers are sample-efficient world models. InDeep Rein- forcement Learning Workshop NeurIPS 2022, 2022. 2
work page 2022
-
[31]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2
work page 2021
-
[32]
Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022. 2
work page 2022
-
[33]
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Man- dlekar, and Yuke Zhu. Robocasa: Large-scale simula- tion of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Pointnet: Deep learning on point sets for 3d classification and segmentation
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,
-
[35]
Zezhong Qian, Xiaowei Chi, Yuming Li, Shizun Wang, Zhiyuan Qin, Xiaozhu Ju, Sirui Han, and Shanghang Zhang. Wristworld: Generating wrist-views via 4d world models for robotic manipulation.arXiv preprint arXiv:2510.07313,
-
[36]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial represen- tations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025. 2, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Xuanchi Ren, Yifan Lu, Hanxue Liang, Zhangjie Wu, Huan Ling, Mike Chen, Sanja Fidler, Francis Williams, and Jiahui Huang. Scube: Instant large-scale scene reconstruction us- ing voxsplats.Advances in Neural Information Processing Systems, 37:97670–97698, 2024. 2
work page 2024
-
[38]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
4d gaussian splatting for real-time dynamic scene rendering
Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 20310–20320, 2024. 2
work page 2024
-
[40]
Unleashing large-scale video generative pre-training for visual robot manipulation
Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. InICLR, 2024. 2
work page 2024
-
[41]
3d shapenets: A deep representation for volumetric shapes
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin- guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015. 2
work page 1912
-
[42]
Point- nerf: Point-based neural radiance fields
Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point- nerf: Point-based neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5438–5448, 2022. 2
work page 2022
-
[43]
Street gaussians: Modeling dynamic urban scenes with gaussian splatting
Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision, pages 156–173. Springer, 2024. 2
work page 2024
-
[44]
4d-vla: spatiotemporal vision-language-action pretraining with cross-scene calibration,
Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Yan- peng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, et al. 4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibra- tion.arXiv preprint arXiv:2506.22242, 2025. 7
-
[45]
Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural In- formation Processing Systems, 36:27147–27166, 2023. 2
work page 2023
-
[46]
Dreamvla: A vision-language- action model dreamed with comprehensive world knowl- edge
Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xin- Qiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, et al. Dreamvla: A vision-language- action model dreamed with comprehensive world knowl- edge. InThe Thirty-ninth Annual Conference on Neural In- formation Processing Systems, 2025. 1, 2
work page 2025
-
[47]
Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025. 1
work page 2025
-
[48]
Universal actions for en- hanced embodied foundation models
Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya- Qin Zhang, and Xianyuan Zhan. Universal actions for en- hanced embodied foundation models. InProceedings of the 10 Computer Vision and Pattern Recognition Conference, pages 22508–22519, 2025. 1, 2
work page 2025
-
[49]
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345, 2024. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A survey on vision- language-action models: An action tokenization perspective. arXiv preprint arXiv:2507.01925, 2025. 1, 2
work page internal anchor Pith review arXiv 2025
-
[51]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 1 11
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.