Recognition: unknown
Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images
Pith reviewed 2026-05-10 15:40 UTC · model grok-4.3
The pith
UniSplat learns unified 3D representations from unposed multi-view images by combining dual masking for geometry, coarse-to-fine splatting, and pose-conditioned recalibration for consistency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors state that the UniSplat framework, built from a dual-masking strategy that masks both encoder and decoder tokens with geometry-rich decoder targets, a coarse-to-fine Gaussian splatting strategy that progressively refines the radiance field, and a pose-conditioned recalibration mechanism that re-projects predicted 3D point and semantic maps into the image plane using estimated camera parameters and aligns them with RGB and semantic predictions, produces unified 3D representations that are robust to unposed and sparse-view inputs and generalize across diverse tasks.
What carries the argument
UniSplat feed-forward framework whose three components are the dual-masking strategy for geometry induction, the coarse-to-fine Gaussian splatting strategy for appearance-semantics consistency, and the pose-conditioned recalibration mechanism that re-projects and aligns multiple output heads.
If this is right
- Unified 3D representations can be obtained in a single feed-forward pass without requiring known camera poses.
- Geometry induction is strengthened even when input views are sparse and unposed.
- Appearance-semantics inconsistencies are reduced by the progressive refinement of the radiance field.
- Cross-task consistency between geometry and semantics is maintained through re-projection alignment.
- The resulting representations support generalization to scene understanding and embodied AI tasks.
Where Pith is reading between the lines
- The approach could reduce reliance on separate structure-from-motion pipelines in practical 3D capture systems.
- Extending the recalibration step to sequential video frames might allow learning from moving cameras without explicit tracking.
- The same consistency mechanism could be tested on outdoor or large-scale scenes to check robustness beyond indoor benchmarks.
Load-bearing premise
The pose-conditioned recalibration mechanism successfully enforces geometric-semantic consistency by re-projecting predicted 3D point and semantic maps into the image plane using estimated camera parameters.
What would settle it
A controlled test on a multi-view dataset with known ground-truth poses where the re-projected semantic maps diverge from the independently predicted semantic maps or where geometry quality collapses under sparse unposed inputs would falsify the consistency and robustness claims.
Figures
read the original abstract
Robust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics. We introduce UniSplat, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a dual-masking strategy that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural information from incomplete visual cues, yielding geometry-aware representations even under unposed inputs. Second, we develop a coarse-to-fine Gaussian splatting strategy that reduces appearance-semantics inconsistencies by progressively refining the radiance field. Finally, to enforce geometric-semantic consistency, we introduce a pose-conditioned recalibration mechanism that interrelates the outputs of multiple heads by re-projecting predicted 3D point and semantic maps into the image plane using estimated camera parameters, and aligning them with corresponding RGB and semantic predictions to ensure cross-task consistency, thereby resolving geometry-semantic mismatches. Together, these components yield unified 3D representations that are robust to unposed, sparse-view inputs and generalize across diverse tasks, laying a perceptual foundation for spatial intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces UniSplat, a feed-forward framework for learning unified 3D representations from unposed multi-view images. It consists of three components: a dual-masking strategy that masks encoder and decoder tokens to strengthen geometry induction, a coarse-to-fine Gaussian splatting approach to progressively refine the radiance field and reduce appearance-semantics inconsistencies, and a pose-conditioned recalibration mechanism that re-projects predicted 3D point and semantic maps into the image plane using estimated camera parameters to align with RGB and semantic predictions for cross-task consistency.
Significance. If the components successfully deliver robust unified representations that generalize across tasks without relying on posed inputs, the work could establish a perceptual foundation for spatial intelligence in scene understanding and embodied AI. The self-supervised unification of geometry, appearance, and semantics from sparse unposed views addresses a relevant gap, though the absence of supporting evidence limits assessment of its practical impact.
major comments (2)
- [Pose-conditioned recalibration mechanism] The pose-conditioned recalibration mechanism (described in the abstract) re-projects 3D point and semantic maps using camera parameters that must be estimated by the model itself, since inputs are unposed. This setup risks degenerate solutions in which pose adjustments compensate for errors in the 3D predictions rather than enforcing genuine geometric-semantic consistency; the dual-masking and coarse-to-fine components do not break this coupling, leaving the central claim of robustness under unposed sparse-view inputs vulnerable.
- [Abstract and throughout] The manuscript supplies no quantitative results, ablation studies, error analysis, or comparisons to prior self-supervised methods. This absence is load-bearing for the claims of robustness, generalization across tasks, and resolution of weak geometry/inconsistencies, as stated in the abstract.
minor comments (1)
- [Abstract] The abstract consists of a single extended paragraph; splitting it would improve readability without altering content.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and outlining planned revisions where appropriate.
read point-by-point responses
-
Referee: [Pose-conditioned recalibration mechanism] The pose-conditioned recalibration mechanism (described in the abstract) re-projects 3D point and semantic maps using camera parameters that must be estimated by the model itself, since inputs are unposed. This setup risks degenerate solutions in which pose adjustments compensate for errors in the 3D predictions rather than enforcing genuine geometric-semantic consistency; the dual-masking and coarse-to-fine components do not break this coupling, leaving the central claim of robustness under unposed sparse-view inputs vulnerable.
Authors: We appreciate the referee's concern regarding the risk of degenerate solutions where estimated poses could compensate for inaccuracies in 3D predictions. However, the dual-masking strategy is designed to operate independently of pose estimation: by masking both encoder and decoder tokens and directing decoder masks toward geometry-rich regions, the model must infer structural information solely from incomplete visual cues in the input images. This creates a geometry-aware representation prior that does not depend on pose adjustments. The coarse-to-fine Gaussian splatting further mitigates coupling by initializing with coarse geometry and radiance predictions before progressive refinement, limiting the scope for pose to retroactively correct errors. The recalibration mechanism then uses the estimated poses only to re-project and align outputs for consistency losses, with the joint multi-task objective encouraging genuine cross-task alignment rather than compensation. We will add a dedicated discussion and potential failure-case analysis in the revised manuscript to better articulate these interactions. revision: partial
-
Referee: [Abstract and throughout] The manuscript supplies no quantitative results, ablation studies, error analysis, or comparisons to prior self-supervised methods. This absence is load-bearing for the claims of robustness, generalization across tasks, and resolution of weak geometry/inconsistencies, as stated in the abstract.
Authors: We agree that the absence of quantitative results, ablation studies, error analysis, and comparisons to prior self-supervised methods is a significant limitation in the current manuscript. This weakens the ability to fully substantiate the claims regarding robustness under unposed inputs and cross-task generalization. We will incorporate these elements in the revised version, including benchmark evaluations for geometry, appearance, and semantic consistency, component-wise ablations, error breakdowns, and direct comparisons to relevant self-supervised baselines. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper presents UniSplat as a new feed-forward framework with three proposed components (dual-masking for geometry induction, coarse-to-fine Gaussian splatting, and pose-conditioned recalibration via re-projection and alignment). These are architectural and loss-design choices, not derivations that reduce predictions or results to inputs by construction. No equations are exhibited that make any output equivalent to a fitted parameter or self-defined quantity. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The recalibration mechanism is described as an interrelation step using estimated parameters, but this is a proposed consistency objective rather than a tautological redefinition; any potential degeneracy from joint pose estimation is a methodological concern outside the circularity criteria. The overall claim of unified representations is presented as emerging from the combination of these independent components without reducing to prior inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Dual-masking strategy strengthens geometry induction by forcing inference from incomplete cues
- domain assumption Coarse-to-fine Gaussian splatting reduces appearance-semantics inconsistencies
Reference graph
Works this paper leans on
-
[1]
Beit: Bert pre-training of image transformers
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. InICLR, 2022. 2, 3
2022
-
[2]
Mip-nerf: A multiscale representation for anti-aliasing neu- ral radiance fields
Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neu- ral radiance fields. InICCV, 2021. 1
2021
-
[3]
Nope-nerf: Optimising neural ra- diance field with no pose prior
Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Victor Adrian Prisacariu. Nope-nerf: Optimising neural ra- diance field with no pose prior. InCVPR, 2023. 2, 3
2023
-
[4]
Cycle-consistent learning for joint layout-to-image generation and object detection
Xinhao Cai, Qiuxia Lai, Gensheng Pei, Xiangbo Shu, Yazhou Yao, and Wenguan Wang. Cycle-consistent learning for joint layout-to-image generation and object detection. In ICCV, 2025. 1
2025
-
[5]
Unbiased object detection beyond frequency with visually prompted image synthesis
Xinhao Cai, Liulei Li, Gensheng Pei, Tao Chen, Jinshan Pan, Yazhou Yao, and Wenguan Wang. Unbiased object detection beyond frequency with visually prompted image synthesis. InICLR, 2026. 1
2026
-
[6]
pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction
David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InCVPR,
-
[7]
Mvsnerf: Fast general- izable radiance field reconstruction from multi-view stereo
Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast general- izable radiance field reconstruction from multi-view stereo. InICCV, 2021. 1, 2
2021
-
[8]
Scanrefer: 3d object localization in rgb-d scans using natural language
Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InECCV, 2020. 15
2020
-
[9]
Shizhe Chen, Ricardo Garcia, Cordelia Schmid, and Ivan Laptev. Polarnet: 3d point clouds for language-guided robotic manipulation.arXiv preprint arXiv:2309.15596,
-
[10]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. 2
2020
-
[11]
Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images
Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InECCV, 2024. 1, 2
2024
-
[12]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR,
-
[13]
Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research,
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research,
-
[14]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 6, 13
2017
-
[15]
Zibin Dong, Fei Ni, Yifu Yuan, Yinchuan Li, and Jianye Hao. Embodiedmae: A Unified 3d Multi-Modal Representation for Robot Manipulation.arXiv preprint arXiv:2505.10105,
-
[16]
Large Spatial Model: End-to-end Unposed Images to Semantic 3d
Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, Boris Ivanovic, Marco Pavone, and Yue Wang. Large Spatial Model: End-to-end Unposed Images to Semantic 3d. InNeurIPS, 2024. 1, 2, 6, 7, 13
2024
-
[17]
Eva: Exploring the limits of masked visual representa- tion learning at scale
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representa- tion learning at scale. InCVPR, 2023. 6
2023
-
[18]
Colmap-free 3d gaussian splat- ting
Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A Efros, and Xiaolong Wang. Colmap-free 3d gaussian splat- ting. InCVPR, 2024. 2, 3
2024
-
[19]
Rvt-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545,
Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. Rvt-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545, 2024. 13
-
[20]
Bootstrap your own latent-a new approach to self-supervised learning.NeurIPS, 2020
Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.NeurIPS, 2020. 2
2020
-
[21]
arXiv preprint arXiv:1910.11956 , year=
Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, and Karol Hausman. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956, 2019. 6, 13
-
[22]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. 2, 3, 6
2022
-
[23]
Bottom up top down detection transform- ers for language grounding in images and point clouds
Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, and Kate- rina Fragkiadaki. Bottom up top down detection transform- ers for language grounding in images and point clouds. In ECCV, 2022. 15
2022
-
[24]
Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Let- ters, 2020
Stephen James, Zicong Ma, David Rovick Arrojo, and An- drew J Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Let- ters, 2020. 6, 13
2020
-
[25]
Large scale multi-view stereopsis evalu- ation
Rasmus Jensen, Anders Dahl, George V ogiatzis, Engin Tola, and Henrik Aanæs. Large scale multi-view stereopsis evalu- ation. InCVPR, 2014. 7 9
2014
-
[26]
Leap: Liberate sparse-view 3d modeling from camera poses
Hanwen Jiang, Zhenyu Jiang, Yue Zhao, and Qixing Huang. Leap: Liberate sparse-view 3d modeling from camera poses. ICLR, 2024. 1, 2
2024
-
[27]
Rayzer: A Self- supervised Large View Synthesis Model.arXiv preprint arXiv:2505.00702, 2025
Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qix- ing Huang, and Georgios Pavlakos. Rayzer: A Self- supervised Large View Synthesis Model.arXiv preprint arXiv:2505.00702, 2025. 2, 3, 4
-
[28]
Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, Dahua Lin, and Bo Dai. Anysplat: Feed-forward 3d Gaussian Splatting from Unconstrained Views.arXiv preprint arXiv:2505.23716, 2025. 2, 5
-
[29]
Selfsplat: Pose-Free and 3d Prior-Free Generalizable 3d Gaussian Splatting
Gyeongjin Kang, Jisang Yoo, Jihyeon Park, Seungtae Nam, Hyeonsoo Im, Sangheon Shin, Sangpil Kim, and Eunbyung Park. Selfsplat: Pose-Free and 3d Prior-Free Generalizable 3d Gaussian Splatting. InCVPR, 2025. 2, 3
2025
-
[30]
3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023. 1, 2
2023
-
[31]
Decomposing nerf for editing via feature field dis- tillation.NeurIPS, 2022
Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitz- mann. Decomposing nerf for editing via feature field dis- tillation.NeurIPS, 2022. 6
2022
-
[32]
PhD thesis, University of Washington,
Vikash Kumar.Manipulators and Manipulation in high di- mensional spaces. PhD thesis, University of Washington,
-
[33]
Ground- ing Image Matching in 3d with MASt3R
Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing Image Matching in 3d with MASt3R. InECCV, 2025. 2
2025
-
[34]
Weinberger, Serge J
Boyi Li, Kilian Q. Weinberger, Serge J. Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven Semantic Seg- mentation. InICLR, 2022. 5, 6
2022
-
[35]
Semanticsplat: Feed-forward 3d scene understanding with language-aware gaussian fields
Qijing Li, Jingxiang Sun, Liang An, Zhaoqi Su, Hong- wen Zhang, and Yebin Liu. Semanticsplat: Feed-Forward 3d Scene Understanding with Language-Aware Gaussian Fields.arXiv preprint arXiv:2506.09565, 2025. 1, 2
-
[36]
Zhiqi Li, Chengrui Dong, Yiming Chen, Zhangchi Huang, and Peidong Liu. Vicasplat: A Single Run is All You Need for 3d Gaussian Splatting and Camera Estimation from Unposed Video Frames.arXiv preprint arXiv:2503.10286,
-
[37]
Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, 2024. 8
2024
-
[38]
Infinite nature: Perpetual view generation of natural scenes from a single image
Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Makadia, Noah Snavely, and Angjoo Kanazawa. Infinite nature: Perpetual view generation of natural scenes from a single image. InICCV, 2021. 7, 13, 15
2021
-
[39]
Libero: Benchmarking knowl- edge transfer for lifelong robot learning.NeurIPS, 2023
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.NeurIPS, 2023. 6, 13
2023
-
[40]
Gwm: Towards scalable gaussian world models for robotic manipulation
Guanxing Lu, Baoxiong Jia, Puhao Li, Yixin Chen, Ziwei Wang, Yansong Tang, and Siyuan Huang. Gwm: Towards scalable gaussian world models for robotic manipulation. In ICCV, 2025. 13
2025
-
[41]
Scaffold-GS: Structured 3d Gaussians for View-Adaptive Rendering
Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-GS: Structured 3d Gaussians for View-Adaptive Rendering. InCVPR, 2024. 2, 4
2024
-
[42]
Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence? InNeurIPS, 2023
Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Tingfan Wu, Jay Vakil, Pieter Abbeel, Jitendra Malik, Dhruv Batra, Yixin Lin, Oleksandr Maksymets, Aravind Rajeswaran, and Franziska Meier. Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence? InNeu...
2023
-
[43]
What mat- ters in learning from offline human demonstrations for robot manipulation
Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiri- any, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart´ın-Mart´ın. What mat- ters in learning from offline human demonstrations for robot manipulation. InCoRL, 2022. 13
2022
-
[44]
Mimicgen: A data generation system for scalable robot learning using human demonstrations
Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. InCoRL, 2023. 13
2023
-
[45]
Prune and merge: Efficient token compression for vision transformer with spatial infor- mation preserved.IEEE TMM, 2025
Junzhu Mao, Yang Shen, Jinyang Guo, Yazhou Yao, Xian- sheng Hua, and Hengtao Shen. Prune and merge: Efficient token compression for vision transformer with spatial infor- mation preserved.IEEE TMM, 2025. 1
2025
-
[46]
Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 2021
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 2021. 1
2021
-
[47]
Instant neural graphics primitives with a multires- olution hash encoding.ACM TOG, 41(4):1–15, 2022
Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a multires- olution hash encoding.ACM TOG, 41(4):1–15, 2022. 1
2022
-
[48]
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Man- dlekar, and Yuke Zhu. Robocasa: Large-scale simula- tion of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024. 13
work page internal anchor Pith review arXiv 2024
-
[49]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Ani- matable neural radiance fields for modeling dynamic human bodies
Sida Peng, Junting Dong, Qianqian Wang, Shangzhan Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. Ani- matable neural radiance fields for modeling dynamic human bodies. InICCV, 2021. 1
2021
-
[51]
Deep hough voting for 3d object detection in point clouds
Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep hough voting for 3d object detection in point clouds. InICCV, 2019. 15
2019
-
[52]
Langsplat: 3d Language Gaussian Splat- ting
Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d Language Gaussian Splat- ting. InCVPR, 2024. 2
2024
-
[53]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICLR, 2021. 6 10
2021
-
[54]
Real-world robot learn- ing with masked visual pre-training
Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learn- ing with masked visual pre-training. InCoRL, 2023. 6
2023
-
[55]
Fcaf3d: Fully convolutional anchor-free 3d object detection
Danila Rukhovich, Anna V orontsova, and Anton Konushin. Fcaf3d: Fully convolutional anchor-free 3d object detection. InECCV, 2022. 15
2022
-
[56]
Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection
Danila Rukhovich, Anna V orontsova, and Anton Konushin. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. InCVPR,
-
[57]
Yu Sheng, Jiajun Deng, Xinran Zhang, Yu Zhang, Bei Hua, Yanyong Zhang, and Jianmin Ji. Spatialsplat: Efficient Se- mantic 3d from Sparse Unposed Images.arXiv preprint arXiv:2505.23044, 2025. 2
-
[58]
Brandon Smart, Chuanxia Zheng, Iro Laina, and V . Prisacariu. Splatt3r: Zero-shot Gaussian Splatting from Un- calibrated Image Pairs.arXiv preprint arXiv:2408.13912,
-
[59]
Xiangyu Sun, Liu Liu, Seungtae Nam, Gyeongjin Kang, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang, Eunbyung Park, et al. Uni3r: Unified 3d reconstruction and semantic understanding via generalizable gaussian splatting from un- posed multi-view images.arXiv preprint arXiv:2508.03643,
-
[60]
Lgm: Large multi-view gaus- sian model for high-resolution 3d content creation
Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaus- sian model for high-resolution 3d content creation. InECCV,
-
[61]
Qijian Tian, Xin Tan, Jingyu Gong, Yuan Xie, and Lizhuang Ma. Uniforward: Unified 3d scene and semantic field re- construction via feed-forward gaussian splatting from only sparse-view images.arXiv preprint arXiv:2506.09378, 2025. 2
-
[62]
Scene as occupancy
Wenwen Tong, Chonghao Sima, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, Ping Luo, Dahua Lin, et al. Scene as occupancy. InICCV, 2023. 15
2023
-
[63]
dm control: Software and tasks for continuous control.Software Impacts, 2020
Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm control: Software and tasks for continuous control.Software Impacts, 2020. 13
2020
-
[64]
Vggt: Visual Geometry Grounded Transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotn ´y. Vggt: Visual Geometry Grounded Transformer. InCVPR, 2025. 2, 5
2025
-
[65]
Videomae V2: Scaling Video Masked Autoencoders with Dual Masking
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae V2: Scaling Video Masked Autoencoders with Dual Masking. In CVPR, 2023. 3
2023
-
[66]
Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction.ICLR, 2024
Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction.ICLR, 2024. 1, 2
2024
-
[67]
Dust3r: Geometric 3d Vi- sion Made Easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d Vi- sion Made Easy. InCVPR, 2024. 2
2024
-
[68]
Embodiedscan: A holistic multi- modal 3d perception suite towards embodied ai
Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, et al. Embodiedscan: A holistic multi- modal 3d perception suite towards embodied ai. InCVPR,
-
[69]
Visual knowledge in the big model era: Retrospect and prospect.Frontiers of Information Technology & Electronic Engineering, 2025
Wenguan Wang, Yi Yang, and Yunhe Pan. Visual knowledge in the big model era: Retrospect and prospect.Frontiers of Information Technology & Electronic Engineering, 2025. 1
2025
-
[70]
Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving
Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving. InICCV, 2023. 15
2023
-
[71]
Croco: Self-Supervised Pre-training for 3d Vision Tasks by Cross-View Completion
Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Ro- main Br´egier, Yohann Cabon, Vaibhav Arora, Leonid Ants- feld, Boris Chidlovskii, Gabriela Csurka, and J ´erˆome Re- vaud. Croco: Self-Supervised Pre-training for 3d Vision Tasks by Cross-View Completion. InNeurIPS, 2022. 2, 3, 8
2022
-
[72]
TriFin- ger: An Open-Source Robot for Learning Dexterity
Manuel W ¨uthrich, Felix Widmaier, Felix Grimminger, Joel Akpo, Shruti Joshi, Vaibhav Agrawal, Bilal Hammoud, Ma- jid Khadiv, Miroslav Bogdanovic, Vincent Berenz, et al. Trifinger: An open-source robot for learning dexterity.arXiv preprint arXiv:2008.03596, 2020. 13
-
[73]
Spatialformer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding
Han Xiao, Wenzhao Zheng, Sicheng Zuo, Peng Gao, Jie Zhou, and Jiwen Lu. Spatialformer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding. In ECCV, 2024. 1
2024
-
[74]
Murf: multi-baseline radiance fields
Haofei Xu, Anpei Chen, Yuedong Chen, Christos Sakaridis, Yulun Zhang, Marc Pollefeys, Andreas Geiger, and Fisher Yu. Murf: multi-baseline radiance fields. InCVPR, 2024. 1, 2
2024
-
[75]
Depthsplat: Connecting Gaussian Splatting and Depth
Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depthsplat: Connecting Gaussian Splatting and Depth. In CVPR, 2024
2024
-
[76]
Grm: Large gaussian reconstruction model for ef- ficient 3d reconstruction and generation
Yinghao Xu, Zifan Shi, Wang Yifan, Hansheng Chen, Ceyuan Yang, Sida Peng, Yujun Shen, and Gordon Wet- zstein. Grm: Large gaussian reconstruction model for ef- ficient 3d reconstruction and generation. InECCV, 2024. 2
2024
-
[77]
Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie
Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multi- modal Large Language Models See, Remember, and Recall Spaces. InCVPR, 2025. 1
2025
-
[78]
Changetitans: Towards re- mote sensing change detection with neural memory.IEEE TGRS, 2025
Zhenyu Yang, Gensheng Pei, Yazhou Yao, Tianfei Zhou, Lizhong Ding, and Fumin Shen. Changetitans: Towards re- mote sensing change detection with neural memory.IEEE TGRS, 2025. 1
2025
-
[79]
No Pose, No Prob- lem: Surprisingly Simple 3d Gaussian Splats from Sparse Unposed Images
Botao Ye, Sifei Liu, Songyou Peng, Haofei Xu, Xueting Li, Ming-Hsuan Yang, and Marc Pollefeys. No Pose, No Prob- lem: Surprisingly Simple 3d Gaussian Splats from Sparse Unposed Images. InICLR, 2025. 1, 2, 7, 13
2025
-
[80]
Scannet++: A high-fidelity dataset of 3d indoor scenes
Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InICCV, 2023. 6, 8, 13
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.