Recognition: unknown
Robotic Manipulation is Vision-to-Geometry Mapping (f(v) rightarrow G): Vision-Geometry Backbones over Language and Video Models
Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3
The pith
Robotic manipulation is a vision-to-geometry mapping problem best solved by pretrained 3D world models rather than language or video backbones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
At its core, robotic manipulation is a problem of vision-to-geometry mapping. The Vision-Geometry-Action model replaces conventional language or video backbones with a pretrained native 3D world model to translate visual inputs directly into physical actions, supported by Progressive Volumetric Modulation and joint training for geometric consistency.
What carries the argument
Vision-Geometry-Action (VGA) model that conditions action generation on pretrained 3D world model representations to create a direct vision-to-geometry mapping.
If this is right
- VGA outperforms top-tier VLA baselines including π0.5 and GeoVLA on simulation benchmarks for precise manipulation.
- VGA achieves stronger zero-shot generalization to unseen viewpoints in real-world robot deployments than π0.5.
- Direct operation on native 3D representations, rather than translation through language or 2D priors, supports more generalizable physical intelligence.
Where Pith is reading between the lines
- Robotics foundation models may shift priority toward 3D geometry sources over multimodal language training for spatial tasks.
- The same 3D backbone approach could apply to navigation or assembly problems that also hinge on precise spatial relations.
- Reduced dependence on language intermediaries might simplify training pipelines when only geometric accuracy matters.
Load-bearing premise
Native 3D geometric representations align better with the spatial requirements of physical actions than representations shaped by language semantics or 2D video priors.
What would settle it
A controlled test in which a vision-language model achieves equal or higher success rates than VGA on precise manipulation tasks under novel real-world viewpoints would disprove the claimed superiority.
Figures
read the original abstract
At its core, robotic manipulation is a problem of vision-to-geometry mapping ($f(v) \rightarrow G$). Physical actions are fundamentally defined by geometric properties like 3D positions and spatial relationships. Consequently, we argue that the foundation for generalizable robotic control should be a vision-geometry backbone, rather than the widely adopted vision-language or video models. Conventional VLA and video-predictive models rely on backbones pretrained on large-scale 2D image-text or temporal pixel data. While effective, their representations are largely shaped by semantic concepts or 2D priors, which do not intrinsically align with the precise 3D geometric nature required for physical manipulation. Driven by this insight, we propose the Vision-Geometry-Action (VGA) model, which directly conditions action generation on pretrained native 3D representations. Specifically, VGA replaces conventional language or video backbones with a pretrained 3D world model, establishing a seamless vision-to-geometry mapping that translates visual inputs directly into physical actions. To further enhance geometric consistency, we introduce a Progressive Volumetric Modulation module and adopt a joint training strategy. Extensive experiments validate the effectiveness of our approach. In simulation benchmarks, VGA outperforms top-tier VLA baselines including $\pi_{0.5}$ and GeoVLA, demonstrating its superiority in precise manipulation. More importantly, VGA exhibits remarkable zero-shot generalization to unseen viewpoints in real-world deployments, consistently outperforming $\pi_{0.5}$. These results highlight that operating on native 3D representations-rather than translating through language or 2D video priors-is a highly promising direction for achieving generalizable physical intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that robotic manipulation is fundamentally a vision-to-geometry mapping problem (f(v) → G) and that native 3D representations should replace vision-language or video backbones for generalizable control. It introduces the Vision-Geometry-Action (VGA) model, which conditions actions on a pretrained 3D world model, adds a Progressive Volumetric Modulation module, and uses joint training; the abstract claims this yields superior performance on simulation benchmarks versus π0.5 and GeoVLA plus zero-shot real-world generalization to unseen viewpoints.
Significance. If the central claim holds and the performance edge is attributable to native 3D geometry rather than ancillary design choices, the work could shift robotic foundation models toward explicit 3D world models, improving precision and viewpoint invariance in manipulation tasks. The absence of ablations, error bars, and dataset details in the reported results, however, leaves the magnitude of this potential contribution uncertain.
major comments (3)
- [Abstract] Abstract: the claim that VGA 'outperforms top-tier VLA baselines including π0.5 and GeoVLA' and exhibits 'remarkable zero-shot generalization' cannot be evaluated because no error bars, dataset sizes, number of trials, or statistical tests are provided; without these, it is impossible to determine whether the reported gains are robust or driven by the 3D backbone itself.
- [Abstract] Abstract and experimental description: the manuscript does not report ablations that isolate the contribution of the pretrained 3D world model from the Progressive Volumetric Modulation module, the joint training strategy, or differences in the action head and data pipeline. Consequently the central premise—that native 3D representations are the load-bearing factor for the observed gains over π0.5 and GeoVLA—remains untested.
- [Abstract] Abstract: the zero-shot viewpoint generalization result is presented as evidence for the superiority of 3D geometry, yet no quantitative metrics (e.g., success rate deltas, viewpoint ranges, or failure modes) or comparison conditions (e.g., 2D backbone with viewpoint augmentation) are supplied, weakening the link between the architectural choice and the claimed generalization.
minor comments (2)
- [Abstract] The abstract introduces the notation f(v) → G without defining the symbol G or the precise geometric output space; a short clarification of the target representation would improve readability.
- [Abstract] The phrase 'seamless vision-to-geometry mapping' is used repeatedly; a concrete description of how the 3D world model outputs are aligned with the action space (e.g., via specific layers or loss terms) would help readers assess the mapping.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where the concerns identify gaps in the current presentation, we have revised the manuscript to incorporate additional details, experiments, and clarifications.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that VGA 'outperforms top-tier VLA baselines including π0.5 and GeoVLA' and exhibits 'remarkable zero-shot generalization' cannot be evaluated because no error bars, dataset sizes, number of trials, or statistical tests are provided; without these, it is impossible to determine whether the reported gains are robust or driven by the 3D backbone itself.
Authors: We agree that the absence of these statistical details in the abstract limits the ability to assess robustness. In the revised manuscript we have expanded the abstract and the experimental results section to report error bars across all benchmarks, the exact number of trials and dataset sizes used, and the results of statistical significance tests. These additions make it possible to evaluate whether the observed improvements are reliable and attributable to the architectural choices. revision: yes
-
Referee: [Abstract] Abstract and experimental description: the manuscript does not report ablations that isolate the contribution of the pretrained 3D world model from the Progressive Volumetric Modulation module, the joint training strategy, or differences in the action head and data pipeline. Consequently the central premise—that native 3D representations are the load-bearing factor for the observed gains over π0.5 and GeoVLA—remains untested.
Authors: We acknowledge that isolating the contribution of the pretrained 3D world model is essential to support the central claim. We have added a dedicated ablation study in the revised experimental section that compares the full VGA model against controlled variants: one without the pretrained 3D backbone (replaced by a 2D or language backbone), one without Progressive Volumetric Modulation, and one using separate rather than joint training. The new results show that removing the native 3D representations produces the largest performance drop relative to the VLA baselines, while the other components provide complementary gains. revision: yes
-
Referee: [Abstract] Abstract: the zero-shot viewpoint generalization result is presented as evidence for the superiority of 3D geometry, yet no quantitative metrics (e.g., success rate deltas, viewpoint ranges, or failure modes) or comparison conditions (e.g., 2D backbone with viewpoint augmentation) are supplied, weakening the link between the architectural choice and the claimed generalization.
Authors: We agree that stronger quantitative support is needed to link the 3D geometry backbone to viewpoint-invariant generalization. In the revision we have augmented the real-world evaluation section with success-rate deltas across multiple unseen viewpoint ranges, a breakdown of failure modes, and a direct comparison against a 2D backbone baseline trained with explicit viewpoint augmentation. These additions provide concrete evidence that the native 3D representations confer advantages beyond what 2D augmentation alone can achieve. revision: yes
Circularity Check
No circularity: conceptual premise with independent experimental validation
full rationale
The paper advances a first-principles-style argument that manipulation reduces to vision-to-geometry mapping and therefore favors native 3D backbones over VLA or video models. This premise is stated directly in the abstract and introduction without deriving it from fitted parameters, self-referential equations, or prior self-citations that themselves depend on the target result. VGA is introduced as a concrete architecture (pretrained 3D world model plus Progressive Volumetric Modulation and joint training) whose performance is then measured against baselines. No step equates a 'prediction' to its own training inputs by construction, nor does any load-bearing uniqueness claim collapse to an unverified self-citation. The derivation chain therefore remains self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained 3D world models supply native geometric representations that align directly with physical manipulation requirements.
invented entities (2)
-
Vision-Geometry-Action (VGA) model
no independent evidence
-
Progressive Volumetric Modulation module
no independent evidence
Reference graph
Works this paper leans on
- [1]
-
[2]
Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. 2025. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030(2025)
work page internal anchor Pith review arXiv 2025
-
[4]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky
-
[5]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
𝜋0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164 [cs.LG] https://arxiv.org/abs/2410.24164
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al . 2025. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669(2025)
work page internal anchor Pith review arXiv 2025
-
[7]
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. 2025. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111(2025)
work page internal anchor Pith review arXiv 2025
-
[8]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging Properties in Self-Supervised Vision Transformers. InProceedings of the International Conference on Computer Vision (ICCV)
2021
-
[9]
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al . 2025. WorldVLA: Towards Autoregressive Action World Model.arXiv preprint arXiv:2506.21539(2025)
work page internal anchor Pith review arXiv 2025
-
[10]
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. 14455–14465
2024
-
[11]
Guangyan Chen, Meiling Wang, Te Cui, Luojie Yang, Qi Shao, Lin Zhao, Tianle Zhang, Yihang Li, Yi Yang, and Yufeng Yue. 2025. Unifying Latent Action and Latent State Pre-training for Policy Learning from Videos. InProceedings of the SIGGRAPH Asia 2025 Conference Papers. 1–11
2025
-
[12]
Hongyu Chen, Liang Lin, and Guangrun Wang. 2026. OOWM: Structuring Embodied Reasoning and Planning via Object-Oriented Programmatic World Modeling. arXiv:2604.09580 [cs.AI] https://arxiv.org/abs/2604.09580
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Shizhe Chen, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. 2024. Sugar: Pre-training 3d visual representations for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18049–18060
2024
- [14]
-
[15]
Yuhao Chen, Zhihao Zhan, Xiaoxin Lin, Zijian Song, Hao Liu, Qinhan Lyu, Yubo Zu, Xiao Chen, Zhiyuan Liu, Tao Pu, et al. 2026. RADAR: Benchmarking Vision- Language-Action Generalization via Real-World Dynamics, Spatial-Physical In- telligence, and Autonomous Evaluation.arXiv preprint arXiv:2602.10980(2026)
- [16]
-
[17]
David Ha and Jürgen Schmidhuber. 2018. World models.arXiv preprint arXiv:1803.101222, 3 (2018), 440
work page internal anchor Pith review arXiv 2018
-
[18]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.Iclr1, 2 (2022), 3
2022
-
[19]
Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. 2024. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803(2024)
work page internal anchor Pith review arXiv 2024
- [20]
-
[21]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...
-
[22]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
𝜋0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv:2504.16054 [cs.LG] https://arxiv.org/abs/2504.16054
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Yueru Jia, Jiaming Liu, Sixiang Chen, Chenyang Gu, Zhilue Wang, Longzan Luo, Lily Lee, Pengwei Wang, Zhongyuan Wang, Renrui Zhang, et al. 2024. Lift3d foundation policy: Lifting 2d large-scale pretrained models for robust 3d robotic manipulation.arXiv preprint arXiv:2411.18623(2024)
-
[24]
Moo Jin Kim, Chelsea Finn, and Percy Liang. 2025. Fine-tuning vision-language- action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645 (2025)
work page internal anchor Pith review arXiv 2025
-
[25]
Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. 2026. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163(2026)
work page internal anchor Pith review arXiv 2026
-
[26]
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakr- ishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al
-
[27]
Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al . 2025. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917 (2025)
work page internal anchor Pith review arXiv 2025
-
[29]
Chengmeng Li, Junjie Wen, Yaxin Peng, Yan Peng, and Yichen Zhu. 2026. Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters11, 3 (2026), 2506–2513
2026
- [30]
-
[31]
Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. 2026. Causal World Modeling for Robot Control.arXiv preprint arXiv:2601.21998(2026)
work page internal anchor Pith review arXiv 2026
-
[32]
Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. 2025. Unified video action model.arXiv preprint arXiv:2503.00200(2025)
work page internal anchor Pith review arXiv 2025
- [33]
-
[34]
Xiaoqi Li, Liang Heng, Jiaming Liu, Yan Shen, Chenyang Gu, Zhuoyang Liu, Hao Chen, Nuowei Han, Renrui Zhang, Hao Tang, et al . 2025. 3ds-vla: A 3d spatial-aware vision language action model for robust multi-task manipulation. In9th Annual Conference on Robot Learning
2025
-
[35]
Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281(2023)
work page internal anchor Pith review arXiv 2023
- [36]
- [37]
- [38]
-
[39]
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. 2023. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36 (2023), 44776–44791
2023
-
[40]
Disheng Liu, Tuo Liang, Zhe Hu, Jierui Peng, Yiren Lu, Yi Xu, Yun Fu, and Yu Yin
-
[41]
Deconstructing Spatial Intelligence in Vision-Language Models.Authorea Preprints(2025)
2025
-
[42]
Disheng Liu, Tuo Liang, Zhe Hu, Jierui Peng, Yiren Lu, Yi Xu, Yun Fu, and Yu Yin. 2026. Spatial Intelligence in Vision-Language Models: A Comprehensive Survey. (2026)
2026
-
[43]
Haoyun Liu, Jianzhuang Zhao, Xinyuan Chang, Tianle Shi, Chuanzhang Meng, Jiayuan Tan, Feng Xiong, Tong Lin, Dongjie Huo, Mu Xu, et al . 2026. Neural Implicit Action Fields: From Discrete Waypoints to Continuous Functions for Vision-Language-Action Models.arXiv preprint arXiv:2603.01766(2026)
- [44]
- [45]
- [46]
-
[47]
J Bjorck Nvidia, Fernando Castaneda, N Cherniadev, X Da, R Ding, L Fan, Y Fang, D Fox, F Hu, S Huang, et al . 2025. Gr00t n1: An open foundation model for ACM Conference, 2026, generalist humanoid robots.arXiv preprint arXiv:2503.147342 (2025)
work page internal anchor Pith review arXiv 2025
- [48]
-
[49]
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. 2025. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830 (2025)
work page internal anchor Pith review arXiv 2025
-
[50]
Zhifeng Rao, Wenlong Chen, Lei Xie, Xia Hua, Dongfu Yin, Zhen Tian, and F Richard Yu. 2026. AugVLA-3D: Depth-Driven Feature Augmentation for Vision- Language-Action Models.arXiv preprint arXiv:2602.10698(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[51]
Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. 2021. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction
2021
-
[52]
Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. 2025. VideoVLA: Video Generators Can Be Generalizable Robot Manipulators.Advances in neural information processing systems(2025)
2025
-
[53]
Zijian Song, Qichang Li, Sihan Qin, Yuhao Chen, Tianshui Chen, Liang Lin, and Guangrun Wang. 2026. Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation.arXiv preprint arXiv:2603.00110(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [54]
- [55]
-
[56]
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. 2024. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213(2024)
work page internal anchor Pith review arXiv 2024
- [57]
- [58]
-
[59]
Guangcong Wang, Zhaoxi Chen, Chen Change Loy, and Ziwei Liu. 2023. Sparsen- erf: Distilling depth ranking for few-shot novel view synthesis. InProceedings of the IEEE/CVF international conference on computer vision. 9065–9076
2023
-
[60]
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. 2025. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference. 5294– 5306
2025
- [61]
-
[62]
Zuoxu Wang, Zhijie Yan, Shufei Li, and Jihong Liu. 2025. IndVisSGG: VLM-based scene graph generation for industrial spatial intelligence.Advanced Engineering Informatics65 (2025), 103107
2025
-
[63]
Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng
-
[64]
Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855(2025)
work page Pith review arXiv 2025
-
[65]
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. 2025. Tinyvla: Towards fast, data- efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters(2025)
2025
-
[66]
Philipp Wu, Yide Shentu, Zhongke Yi, Xingyu Lin, and Pieter Abbeel. 2024. Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 12156–12163
2024
- [67]
-
[68]
Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, et al. 2026. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[69]
Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. 2020. Blendedmvs: A large-scale dataset for generalized multi- view stereo networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1790–1799
2020
-
[70]
Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al
-
[71]
World action models are zero-shot policies.arXiv preprint arXiv:2602.15922 (2026)
work page internal anchor Pith review arXiv 2026
- [72]
-
[73]
Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 2024. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954(2024)
work page internal anchor Pith review arXiv 2024
-
[74]
Zhihao Zhan, Yuhao Chen, Jiaying Zhou, Qinhan Lv, Hao Liu, Keze Wang, Liang Lin, and Guangrun Wang. 2026. Stable Language Guidance for Vision-Language- Action Models.arXiv preprint arXiv:2601.04052(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[75]
Zhihao Zhan, Jiaying Zhou, Likui Zhang, Qinhan Lv, Hao Liu, Jusheng Zhang, Weizheng Li, Ziliang Chen, Tianshui Chen, Ruifeng Zhai, Keze Wang, Liang Lin, and Guangrun Wang. 2026. E0: Enhancing Generalization and Fine-Grained Control in VLA Models via Tweedie Discrete Diffusion. arXiv:2511.21542 [cs.RO] https://arxiv.org/abs/2511.21542
- [76]
-
[77]
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al . 2025. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference. 1702–1713
2025
-
[78]
Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. 2023. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705(2023)
work page internal anchor Pith review arXiv 2023
- [79]
-
[80]
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. 2024. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404(2024)
work page internal anchor Pith review arXiv 2024
- [81]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.