Recognition: no theorem link
Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding
Pith reviewed 2026-05-13 21:16 UTC · model grok-4.3
The pith
Pretraining a transformer on multi-view colored pointmaps with language contrast produces unified 3D scene representations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UniScene3D is a transformer encoder pretrained on colored pointmaps from multiple views by contrastive alignment with language. Two new mechanisms, cross-view geometric alignment and grounded view alignment, enforce consistency in geometry and semantics across viewpoints. This joint modeling of appearance and structure produces unified scene representations that reach state-of-the-art results on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA after low-shot or task-specific fine-tuning.
What carries the argument
The UniScene3D transformer encoder together with cross-view geometric alignment and grounded view alignment that enforce cross-view consistency on colored pointmap inputs.
If this is right
- A single pretrained encoder can be adapted to multiple 3D tasks with far less data than training each task separately.
- Joint appearance-geometry modeling improves results on tasks that require both visual recognition and spatial reasoning.
- Low-shot fine-tuning becomes viable for new scenes or cameras because the pretraining already supplies rich features.
- The same representations support viewpoint grounding, retrieval, classification, and question answering without architectural changes.
Where Pith is reading between the lines
- The colored pointmap format could simplify combining this method with existing 2D image pipelines that already output depth or point clouds.
- Extending the same alignment losses to video sequences might add temporal consistency for dynamic scene understanding.
- If the approach scales to larger environments, it could support language-guided 3D scene editing or robot navigation from few observations.
Load-bearing premise
The proposed cross-view geometric and grounded view alignments will successfully enforce the consistency needed for generalizable unified representations.
What would settle it
An ablation that removes the two alignment losses and measures no drop in cross-view consistency metrics or downstream task accuracy would falsify the central claim.
Figures
read the original abstract
Pretraining 3D encoders by aligning with Contrastive Language Image Pretraining (CLIP) has emerged as a promising direction to learn generalizable representations for 3D scene understanding. In this paper, we propose UniScene3D, a transformer-based encoder that learns unified scene representations from multi-view colored pointmaps, jointly modeling image appearance and geometry. For robust colored pointmap representation learning, we introduce novel cross-view geometric alignment and grounded view alignment to enforce cross-view geometry and semantic consistency. Extensive low-shot and task-specific fine-tuning evaluations on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA demonstrate our state-of-the-art performance. These results highlight the effectiveness of our approach for unified 3D scene understanding. https://yebulabula.github.io/UniScene3D/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes UniScene3D, a transformer-based encoder that learns unified 3D scene representations from multi-view colored pointmaps by jointly modeling image appearance and geometry via contrastive language pretraining aligned with CLIP. It introduces two novel objectives—cross-view geometric alignment and grounded view alignment—to enforce cross-view geometry and semantic consistency. The method is evaluated via low-shot and task-specific fine-tuning on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA, where it reports state-of-the-art performance.
Significance. If the empirical results hold, this work would advance unified 3D scene understanding by bridging 2D appearance and 3D geometry through colored pointmap inputs and targeted alignment losses. The low-shot evaluation focus is practically relevant for data-scarce 3D settings, and successful verification of the alignment objectives could reduce reliance on task-specific 3D architectures.
major comments (2)
- [Abstract] Abstract: The claim of state-of-the-art performance on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA is stated without any quantitative metrics, baseline comparisons, ablation tables, or error analysis. This absence prevents verification of whether the proposed alignments drive the reported gains or whether the colored-pointmap input alone suffices.
- [Section 3.2] Section 3.2 (Alignment Objectives): The cross-view geometric alignment and grounded view alignment are introduced to enforce consistency, yet no ablation studies, consistency metrics (e.g., cross-view feature similarity before/after), or failure-mode analysis are provided to confirm these objectives produce the claimed geometry and semantic consistency beyond standard contrastive losses.
minor comments (1)
- [Section 4] Section 4: Include full details on dataset splits, hyperparameter choices, and training schedules to support reproducibility of the low-shot and fine-tuning experiments.
Simulated Author's Rebuttal
Thank you for your constructive review and for acknowledging the potential of UniScene3D to advance unified 3D scene understanding. We address each major comment below and will make the corresponding revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim of state-of-the-art performance on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA is stated without any quantitative metrics, baseline comparisons, ablation tables, or error analysis. This absence prevents verification of whether the proposed alignments drive the reported gains or whether the colored-pointmap input alone suffices.
Authors: We agree that the abstract would be strengthened by including quantitative metrics. In the revised manuscript we will update the abstract to report key numerical results from our low-shot and task-specific evaluations (e.g., accuracy or recall gains on viewpoint grounding and scene retrieval), together with brief baseline comparisons. This will make the SOTA claims verifiable and clarify the contribution of the alignment objectives beyond the colored-pointmap input. revision: yes
-
Referee: [Section 3.2] Section 3.2 (Alignment Objectives): The cross-view geometric alignment and grounded view alignment are introduced to enforce consistency, yet no ablation studies, consistency metrics (e.g., cross-view feature similarity before/after), or failure-mode analysis are provided to confirm these objectives produce the claimed geometry and semantic consistency beyond standard contrastive losses.
Authors: We acknowledge that explicit ablations are needed to isolate the effect of the two proposed alignment objectives. We will add a dedicated ablation subsection (or expand Section 3.2) that reports (i) cross-view feature similarity and geometric consistency metrics before versus after each alignment, (ii) incremental performance gains when each objective is added to the base contrastive loss, and (iii) a concise discussion of observed failure cases where the alignments do not fully resolve inconsistencies. revision: yes
Circularity Check
No circularity: empirical pretraining pipeline with independent downstream evaluations
full rationale
The paper defines a transformer encoder on colored pointmaps, adds two new alignment losses to a contrastive objective, trains the model, and reports performance on separate tasks (viewpoint grounding, retrieval, classification, VQA). No equation reduces by construction to a fitted parameter or prior self-citation; the alignments are introduced as explicit, independent terms rather than being defined in terms of the target consistency they are meant to produce. All load-bearing claims rest on measured fine-tuning results rather than renaming or self-referential derivation. This is a standard empirical ML contribution whose central result is falsifiable outside the training loop.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: European conference on computer vision
Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., Guibas, L.: Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In: European conference on computer vision. pp. 422–440. Springer (2020) 8, 20
work page 2020
-
[2]
Locate 3d: Real-world object localization via self-supervised learning in 3d,
Arnaud, S., McVay, P., Martin, A., Majumdar, A., Jatavallabhula, K.M., Thomas, P., Partsey, R., Dugas, D., Gejji, A., Sax, A., et al.: Locate 3d: Real-world object localization via self-supervised learning in 3d. arXiv preprint arXiv:2504.14151 (2025) 3, 24
-
[3]
In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Azuma, D., Miyanishi, T., Kurita, S., Kawanabe, M.: Scanqa: 3d question answer- ing for spatial scene understanding. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19129–19139 (2022) 2, 4, 8
work page 2022
-
[4]
Baruch, G., Chen, Z., Dehghan, A., Dimry, T., Feigin, Y., Fu, P., Gebauer, T., Joffe, B., Kurz, D., Schwartz, A., et al.: Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. arXiv preprint arXiv:2111.08897 (2021) 4, 8
-
[5]
Robotics and Computer-Integrated Manufacturing26(5), 403–413 (2010) 4
Bi, Z., Wang, L.: Advances in 3d data acquisition and processing for industrial applications. Robotics and Computer-Integrated Manufacturing26(5), 403–413 (2010) 4
work page 2010
-
[6]
In: European conference on computer vision
Chen, D.Z., Chang, A.X., Nießner, M.: Scanrefer: 3d object localization in rgb-d scans using natural language. In: European conference on computer vision. pp. 202–221. Springer (2020) 4, 8, 20
work page 2020
-
[7]
Chen, J., Barath, D., Armeni, I., Pollefeys, M., Blum, H.: “where am i?” scene retrieval with language. In: European Conference on Computer Vision. pp. 201–
-
[8]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Chen, R., Liu, Y., Kong, L., Zhu, X., Ma, Y., Li, Y., Hou, Y., Qiao, Y., Wang, W.: Clip2scene: Towards label-efficient 3d scene understanding by clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7020–7030 (2023) 2, 3
work page 2023
-
[9]
In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition
Chen, Z., Gholami, A., Nießner, M., Chang, A.X.: Scan2cap: Context-aware dense captioning in rgb-d scans. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 3193–3203 (2021) 4
work page 2021
-
[10]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner,M.:Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5828–5839 (2017) 4, 8
work page 2017
-
[11]
Advances in Neural Information Processing Systems 35, 5982–5994 (2022) 4
Deitke, M., VanderBilt, E., Herrasti, A., Weihs, L., Ehsani, K., Salvador, J., Han, W., Kolve, E., Kembhavi, A., Mottaghi, R.: Procthor: Large-scale embodied ai using procedural generation. Advances in Neural Information Processing Systems 35, 5982–5994 (2022) 4
work page 2022
-
[12]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Ding, R., Yang, J., Xue, C., Zhang, W., Bai, S., Qi, X.: Pla: Language-driven open- vocabulary 3d scene understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7010–7019 (2023) 3 16 Y. Mao et al
work page 2023
-
[13]
arXiv preprint arXiv:2412.05274 (2024) 2
Dong, J., Wu, T., Qian, R., Wang, J.: Simc3d: A simple contrastive 3d pretraining framework using rgb images. arXiv preprint arXiv:2412.05274 (2024) 2
-
[14]
IEEE Transactions on Emerging Topics in Computational Intelligence6(2), 230–244 (2022) 2
Duan, J., Yu, S., Tan, H.L., Zhu, H., Tan, C.: A survey of embodied ai: From simu- lators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence6(2), 230–244 (2022) 2
work page 2022
-
[15]
In: Proceedings of the 32nd ACM Inter- national Conference on Multimedia
Fan, G., Qi, Z., Shi, W., Ma, K.: Point-gcc: Universal self-supervised 3d scene pre-training via geometry-color contrast. In: Proceedings of the 32nd ACM Inter- national Conference on Multimedia. pp. 4709–4718 (2024) 2
work page 2024
-
[16]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3d object recon- struction from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 605–613 (2017) 6
work page 2017
-
[17]
arXiv preprint arXiv:2309.17425 (2023) 3, 4, 9, 11, 20, 21, 22
Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A., Shankar, V.: Data filtering networks. arXiv preprint arXiv:2309.17425 (2023) 3, 4, 9, 11, 20, 21, 22
-
[18]
In: European conference on computer vision
Feng, T., Wang, W., Quan, R., Yang, Y.: Shape2scene: 3d scene representation learning through pre-training on shape data. In: European conference on computer vision. pp. 73–91. Springer (2024) 4
work page 2024
-
[19]
Gao, Y., Wang, Z., Zheng, W.S., Xie, C., Zhou, Y.: Mixcon3d: Synergizing multi- view and cross-modal contrastive learning for enhancing 3d representation (2023) 3
work page 2023
-
[20]
arXiv preprint arXiv:2309.00615 , year=
Guo, Z., Zhang, R., Zhu, X., Tang, Y., Ma, X., Han, J., Chen, K., Gao, P., Li, X., Li, H., et al.: Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615 (2023) 3
-
[21]
arXiv preprint arXiv:2103.05423 (2021) 2
He, Y., Yu, H., Liu, X., Yang, Z., Sun, W., Anwar, S., Mian, A.: Deep learning based 3d segmentation: A survey. arXiv preprint arXiv:2103.05423 (2021) 2
-
[22]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Hou, J., Graham, B., Nießner, M., Xie, S.: Exploring data-efficient 3d scene un- derstanding with contrastive scene contexts. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15587–15597 (2021) 4
work page 2021
-
[23]
arXiv preprint arXiv:2509.17246 (2025) 4
Huang, R., Mikolajczyk, K.: Spfsplatv2: Efficient self-supervised pose-free 3d gaus- sian splatting from sparse views. arXiv preprint arXiv:2509.17246 (2025) 4
-
[24]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Huang, T., Dong, B., Yang, Y., Huang, X., Lau, R.W., Ouyang, W., Zuo, W.: Clip2point: Transfer clip to point cloud classification with image-depth pre- training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22157–22167 (2023) 3
work page 2023
-
[25]
In: European Conference on Computer Vision
Jia, B., Chen, Y., Yu, H., Wang, Y., Niu, X., Liu, T., Li, Q., Huang, S.: Scen- everse: Scaling 3d vision-language learning for grounded scene understanding. In: European Conference on Computer Vision. pp. 289–310. Springer (2024) 3, 4, 7, 8, 9, 11, 12, 13, 20, 21, 22
work page 2024
-
[26]
Jiao, S., Dong, H., Yin, Y., Jie, Z., Qian, Y., Zhao, Y., Shi, H., Wei, Y.: Clip-gs: Unifyingvision-languagerepresentationwith3dgaussiansplatting.In:Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4670–4680 (2025) 3
work page 2025
-
[27]
arXiv preprint arXiv:2412.08802 (2024) 13
Koukounas, A., Mastrapas, G., Eslami, S., Wang, B., Akram, M.K., Günther, M., Mohr, I., Sturua, S., Wang, N., Xiao, H.: jina-clip-v2: Multilingual multimodal embeddings for text and images. arXiv preprint arXiv:2412.08802 (2024) 13
-
[28]
arXiv preprint arXiv:2406.11579 (2024) 2, 3
Lee, H.H., Zhang, Y., Chang, A.X.: Duoduo clip: Efficient 3d understanding with multi-view images. arXiv preprint arXiv:2406.11579 (2024) 2, 3
-
[29]
arXiv preprint arXiv:2507.07136 (2025) 3 Contrastive Language-Colored Pointmap Pretraining 17
Li, W., Zhao, Y., Qin, M., Liu, Y., Cai, Y., Gan, C., Pfister, H.: Langsplatv2: High-dimensional 3d language gaussian splatting with 450+ fps. arXiv preprint arXiv:2507.07136 (2025) 3 Contrastive Language-Colored Pointmap Pretraining 17
-
[30]
arXiv preprint arXiv:2503.18052 (2025) 3
Li, Y., Ma, Q., Yang, R., Li, H., Ma, M., Ren, B., Popovic, N., Sebe, N., Konukoglu, E., Gevers, T., et al.: Scenesplat: Gaussian splatting-based scene understanding with vision-language pretraining. arXiv preprint arXiv:2503.18052 (2025) 3
-
[31]
arXiv preprint arXiv:2502.00342 (2025) 2
Li, Z., Yu, H., Ding, Y., Li, Y., He, Y., Akhtar, N.: Embodied intelligence for 3d understanding: A survey on 3d scene question answering. arXiv preprint arXiv:2502.00342 (2025) 2
-
[32]
ACM Transactions on Multimedia Computing, Communications and Applications21(8), 1–24 (2025) 3
Liao, G., Li, J., Bao, Z., Ye, X., Li, Q., Liu, K.: Clip-gs: Clip-informed gaussian splatting for view-consistent 3d indoor semantic understanding. ACM Transactions on Multimedia Computing, Communications and Applications21(8), 1–24 (2025) 3
work page 2025
-
[33]
Mathematical programming45(1), 503–528 (1989) 12, 21
Liu, D.C., Nocedal, J.: On the limited memory bfgs method for large scale opti- mization. Mathematical programming45(1), 503–528 (1989) 12, 21
work page 1989
-
[34]
Advances in neural information processing systems36, 44860–44879 (2023) 2, 3, 4
Liu, M., Shi, R., Kuang, K., Zhu, Y., Li, X., Han, S., Cai, H., Porikli, F., Su, H.: Openshape:Scalingup3dshaperepresentationtowardsopen-worldunderstanding. Advances in neural information processing systems36, 44860–44879 (2023) 2, 3, 4
work page 2023
-
[35]
Decoupled Weight Decay Regularization
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 10
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[36]
Sqa3d: Situated question answering in 3d scenes,
Ma,X.,Yong,S.,Zheng,Z.,Li,Q.,Liang,Y.,Zhu,S.C.,Huang,S.:Sqa3d:Situated question answering in 3d scenes. arXiv preprint arXiv:2210.07474 (2022) 4, 8
-
[37]
Advances in Neural Information Processing Systems37, 101144–101167 (2024) 2, 3, 4
Mao, Y., Jing, J., Mikolajczyk, K.: Opendlign: Open-world point cloud under- standing with depth-aligned images. Advances in Neural Information Processing Systems37, 101144–101167 (2024) 2, 3, 4
work page 2024
-
[38]
POMA-3D: The Point Map Way to 3D Scene Understanding
Mao, Y., Luo, W., Huang, R., Jing, J., Mikolajczyk, K.: Poma-3d: The point map way to 3d scene understanding. arXiv preprint arXiv:2511.16567 (2025) 2, 3, 7, 8, 9, 11, 12, 13, 20, 21, 22, 24
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
arXiv preprint arXiv:2502.00954 (2025) 4, 8
Mao, Y., Luo, W., Jing, J., Qiu, A., Mikolajczyk, K.: Hypo3d: Exploring hypo- thetical reasoning in 3d. arXiv preprint arXiv:2502.00954 (2025) 4, 8
-
[40]
Advances in neural infor- mation processing systems35, 9058–9071 (2022) 4
Mao, Y., Zhang, Y., Jiang, H., Chang, A., Savva, M.: Multiscan: Scalable rgbd scanning for 3d environments with articulated objects. Advances in neural infor- mation processing systems35, 9058–9071 (2022) 4
work page 2022
-
[41]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Nguyen, K., Hassan, G.M., Mian, A.: Occlusion-aware text-image-point cloud pre- training for open-world 3d object recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16965–16975 (2025) 3
work page 2025
-
[42]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: Langsplat: 3d language gaussian splatting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20051–20060 (2024) 3
work page 2024
-
[43]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 2, 3, 4, 8, 21
work page 2021
-
[44]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Ramakrishnan, S.K., Gokaslan, A., Wijmans, E., Maksymets, O., Clegg, A., Turner, J., Undersander, E., Galuba, W., Westbury, A., Chang, A.X., et al.: Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for em- bodied ai. arXiv preprint arXiv:2109.08238 (2021) 4
work page internal anchor Pith review arXiv 2021
-
[45]
Sarkar, S.D., Miksik, O., Pollefeys, M., Barath, D., Armeni, I.: Crossover: 3d scene cross-modalalignment.In:ProceedingsoftheIEEE/CVFConferenceonComputer Vision and Pattern Recognition. pp. 8985–8994 (2025) 3
work page 2025
-
[46]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016) 4 18 Y. Mao et al
work page 2016
-
[47]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training tech- niques for clip at scale. arXiv preprint arXiv:2303.15389 (2023) 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
In: 2017 international conference on 3D vision (3DV)
Tchapmi, L., Choy, C., Armeni, I., Gwak, J., Savarese, S.: Segcloud: Semantic segmentation of 3d point clouds. In: 2017 international conference on 3D vision (3DV). pp. 537–547. IEEE (2017) 2
work page 2017
-
[49]
Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, lo- calization, and dense features. arXiv preprint arXiv:2502.14786 (2025) 2, 3, 4, 8, 9, 11, 20, 21, 22
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Wald, J., Avetisyan, A., Navab, N., Tombari, F., Nießner, M.: Rio: 3d object instance re-localization in changing indoor environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7658–7667 (2019) 4, 8
work page 2019
-
[51]
Advances in Neural Information Processing Systems36, 58717–58735 (2023) 2
Wang, X., Ma, W., Li, Z., Kortylewski, A., Yuille, A.L.: 3d-aware visual ques- tion answering about parts, poses and occlusions. Advances in Neural Information Processing Systems36, 58717–58735 (2023) 2
work page 2023
-
[52]
arXiv preprint arXiv:2505.05071 (2025) 5, 7, 9, 11, 12, 13, 20, 21, 22
Xie, C., Wang, B., Kong, F., Li, J., Liang, D., Zhang, G., Leng, D., Yin, Y.: Fg-clip: Fine-grained visual and textual alignment. arXiv preprint arXiv:2505.05071 (2025) 5, 7, 9, 11, 12, 13, 20, 21, 22
-
[53]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Xue, L., Gao, M., Xing, C., Martín-Martín, R., Wu, J., Xiong, C., Xu, R., Niebles, J.C., Savarese, S.: Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1179–1189 (2023) 2, 3, 4
work page 2023
-
[54]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Xue, L., Yu, N., Zhang, S., Panagopoulou, A., Li, J., Martín-Martín, R., Wu, J., Xiong, C., Xu, R., Niebles, J.C., et al.: Ulip-2: Towards scalable multimodal pre- training for 3d understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27091–27101 (2024) 3
work page 2024
-
[55]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yang, J., Ding, R., Deng, W., Wang, Z., Qi, X.: Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19823– 19832 (2024) 3
work page 2024
-
[56]
In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition
Yao, Y., Luo, Z., Li, S., Zhang, J., Ren, Y., Zhou, L., Fang, T., Quan, L.: Blended- mvs: A large-scale dataset for generalized multi-view stereo networks. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1790–1799 (2020) 4
work page 2020
-
[57]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: Scannet++: A high-fidelity dataset of 3d indoor scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12–22 (2023) 4, 24
work page 2023
-
[58]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zeng, Y., Jiang, C., Mao, J., Han, J., Ye, C., Huang, Q., Yeung, D.Y., Yang, Z., Liang, X., Xu, H.: Clip2: Contrastive language-image-point pretraining from real- world point cloud data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15244–15253 (2023) 3
work page 2023
-
[59]
In: Proceedings of the IEEE/CVF international conference on computer vision
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023) 4
work page 2023
-
[60]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhang, R., Guo, Z., Zhang, W., Li, K., Miao, X., Cui, B., Qiao, Y., Gao, P., Li, H.: Pointclip: Point cloud understanding by clip. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8552–8562 (2022) 3 Contrastive Language-Colored Pointmap Pretraining 19
work page 2022
-
[61]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Zheng, D., Huang, S., Wang, L.: Video-3d llm: Learning position-aware video rep- resentation for 3d scene understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8995–9006 (2025) 10
work page 2025
-
[62]
In: European Conference on Computer Vision
Zheng, J., Zhang, J., Li, J., Tang, R., Gao, S., Zhou, Z.: Structured3d: A large photo-realistic dataset for structured 3d modeling. In: European Conference on Computer Vision. pp. 519–535. Springer (2020) 4
work page 2020
-
[63]
Uni3d: Ex- ploring unified 3d representation at scale,
Zhou, J., Wang, J., Ma, B., Liu, Y.S., Huang, T., Wang, X.: Uni3d: Exploring unified 3d representation at scale. arXiv preprint arXiv:2310.06773 (2023) 2, 3, 4, 9, 11, 20, 21, 22
-
[64]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Zhou, Z., Wang, P., Liang, Z., Bai, H., Zhang, R.: Cross-modal 3d representation with multi-view images and point clouds. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 3728–3739 (2025) 2
work page 2025
-
[65]
In: Proceedings of the IEEE/CVF international conference on computer vision
Zhu, X., Zhang, R., He, B., Guo, Z., Zeng, Z., Qin, Z., Zhang, S., Gao, P.: Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2639–2650 (2023) 3, 4
work page 2023
-
[66]
There is a white box with a red bottom on a bookcase. It is on the third shelf from the top
Zhu, Z., Ma, X., Chen, Y., Deng, Z., Huang, S., Li, Q.: 3d-vista: Pre-trained trans- former for 3d vision and text alignment. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision. pp. 2911–2921 (2023) 3, 9, 11, 12, 20, 22 Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding Supplementary Material A ...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.