Occ-VLM: Occupancy Grounded Vision Language Model for Indoor Scene Understanding
Pith reviewed 2026-06-26 18:37 UTC · model grok-4.3
The pith
A vision-language model reconstructs 3D occupancy from posed RGB images to achieve 3D scene understanding without explicit 3D inputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Occ-VLM reconstructs 3D scene occupancy from posed RGB images using only a 2D vision encoder as an auxiliary geometric prior to spatially associate foreground 2D tokens with 3D space, which are then decoded by an LLM for unified scene understanding, attaining state-of-the-art on multi-view occupancy prediction and on par performance with 3D-input VLMs on 3D VQA and dense captioning.
What carries the argument
The occupancy reconstruction from 2D tokens that serves as the geometric prior to ground 2D visual features in 3D coordinates for the LLM decoder.
If this is right
- 3D scene understanding becomes possible using only standard RGB cameras without point clouds or depth sensors.
- The model maintains performance comparable to 3D-input methods on visual question answering and dense captioning in 3D scenes.
- Geometric perception improves to state-of-the-art levels in multi-view occupancy prediction while supporting language tasks.
- A single framework handles both accurate 3D geometry and robust vision-language reasoning without structural decoupling.
Where Pith is reading between the lines
- Such models might enable more efficient embodied AI systems that process only image streams in real time.
- Extending the occupancy prior to dynamic scenes could allow tracking moving objects in 3D from video inputs.
- The method suggests potential for reducing hardware requirements in indoor robotics by eliminating the need for 3D sensors.
- Testing on larger or more diverse indoor datasets would reveal if the occupancy prior scales beyond current benchmarks.
Load-bearing premise
Reconstructing 3D occupancy solely from 2D posed RGB images using a vision encoder provides a sufficient geometric prior to associate 2D tokens with 3D space.
What would settle it
Demonstrating that Occ-VLM underperforms 3D-input VLMs on standard 3D VQA or dense captioning benchmarks by a significant margin would falsify the claim of on-par performance.
Figures
read the original abstract
Recently, vision-language models (VLMs) have made significant progress in 3D scene understanding, driving advances in applications such as embodied intelligence and robotic vision. However, existing approaches typically either rely directly on explicit 3D inputs (e.g., point clouds or RGB-D sequences), or introduce an additional 3D geometry encoder to derive 3D-aware visual tokens from 2D images. Such designs structurally decouple 3D geometric perception from the rich 2D semantics learned via vision-language pre-training, hindering the development of a unified 3D vision-language representation. In this work, we propose Occ-VLM, a novel framework for 3D scene understanding that operates purely on posed RGB images and employs a single 2D vision encoder. Specifically, Occ-VLM reconstructs 3D scene occupancy as an auxiliary geometric prior, which is utilized to spatially associate foreground 2D tokens with 3D space. These tokens are then decoded by a Large Language Model (LLM) for unified scene understanding. Extensive experiments demonstrate that Occ-VLM achieves both accurate geometric perception and robust vision-language reasoning: it attains state-of-the-art performance on multi-view occupancy prediction, while performing on par with 3D-input VLMs on 3D Visual Question Answering (VQA) and 3D dense captioning benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Occ-VLM, a 3D scene understanding framework that operates solely on posed RGB images using a single 2D vision encoder. It reconstructs 3D scene occupancy as an auxiliary geometric prior to spatially associate foreground 2D tokens with 3D coordinates; these tokens are then fed to an LLM for unified reasoning on tasks including 3D VQA and dense captioning. The central claims are SOTA performance on multi-view occupancy prediction and parity with 3D-input VLMs on the language benchmarks.
Significance. If the results hold, the work would be significant for enabling a unified 3D vision-language representation without explicit 3D inputs or a separate geometry encoder, which could simplify architectures for embodied and robotic applications. The approach of grounding via occupancy prediction is a clean way to inject geometric structure while preserving 2D pre-training benefits.
major comments (2)
- [Abstract] Abstract: The claim that occupancy predicted from the 2D encoder alone provides a sufficient geometric prior for spatially associating 2D tokens with 3D space (enabling LLM reasoning without any 3D encoder) is load-bearing for the unified-representation thesis, yet the abstract supplies no verification that the occupancy head achieves the required accuracy or resolution for indoor VQA queries, nor that the association step (projection/attention/lifting) actually propagates the geometry rather than acting as a decoupled auxiliary loss.
- [Abstract] Abstract: The reported SOTA on occupancy prediction and parity on 3D VQA/dense captioning are stated without any mention of datasets, baselines, metrics, error bars, or ablation controls, rendering the central performance claims impossible to evaluate for soundness.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We agree that the abstract should better substantiate its central claims and will revise it accordingly while preserving its concise nature.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that occupancy predicted from the 2D encoder alone provides a sufficient geometric prior for spatially associating 2D tokens with 3D space (enabling LLM reasoning without any 3D encoder) is load-bearing for the unified-representation thesis, yet the abstract supplies no verification that the occupancy head achieves the required accuracy or resolution for indoor VQA queries, nor that the association step (projection/attention/lifting) actually propagates the geometry rather than acting as a decoupled auxiliary loss.
Authors: The abstract summarizes the framework; the full paper provides the requested verification. Section 4.1 reports occupancy mIoU and IoU on ScanNet and Matterport3D at multiple resolutions, showing the 2D-encoder occupancy head reaches accuracy sufficient for downstream VQA. Section 4.3 contains ablations that isolate the association module (projection + attention lifting) and demonstrate that removing it degrades VQA and captioning performance while the auxiliary occupancy loss alone does not, indicating the geometry is propagated rather than decoupled. We will revise the abstract to include one-sentence references to these quantitative results and the relevant sections. revision: yes
-
Referee: [Abstract] Abstract: The reported SOTA on occupancy prediction and parity on 3D VQA/dense captioning are stated without any mention of datasets, baselines, metrics, error bars, or ablation controls, rendering the central performance claims impossible to evaluate for soundness.
Authors: We acknowledge that the abstract omits these details. The manuscript reports them in full: datasets (ScanNet, Matterport3D, 3D-VQA, ScanRefer), baselines (both 2D- and 3D-input VLMs), metrics (mIoU, CIDEr, etc.), and ablations (Sections 4 and 5). Error bars appear in the supplementary material. We will expand the abstract with the primary dataset names and headline metrics to make the claims evaluable at a glance. revision: yes
Circularity Check
No circularity: empirical framework with no derivations or self-referential reductions
full rationale
The paper describes an empirical architecture (single 2D encoder + occupancy auxiliary prior + LLM) whose performance claims rest on benchmark results rather than any derivation chain. The abstract and provided text contain no equations, no fitted parameters renamed as predictions, and no load-bearing self-citations or uniqueness theorems. The central claim is that the model attains SOTA occupancy prediction and parity on VQA/captioning; these are falsifiable experimental outcomes, not reductions to inputs by construction. No steps match the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
NeurIPS35, 23716–23736 (2022)
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. NeurIPS35, 23716–23736 (2022)
2022
-
[2]
arXiv preprint arXiv:2509.23661 (2025)
An, X., Xie, Y., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y., Xu, S., Chen, C., Zhu, D., et al.: Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661 (2025)
Pith/arXiv arXiv 2025
-
[3]
In: CVPR
Azuma, D., Miyanishi, T., Kurita, S., Kawanabe, M.: Scanqa: 3d question answer- ing for spatial scene understanding. In: CVPR. pp. 19129–19139 (2022)
2022
-
[4]
arXiv preprint arXiv:2308.129661(2), 3 (2023) 18 J
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.129661(2), 3 (2023) 18 J. Li et al
Pith/arXiv arXiv 2023
-
[5]
In: CVPR
Cao, A.Q., De Charette, R.: Monoscene: Monocular 3d semantic scene completion. In: CVPR. pp. 3991–4001 (2022)
2022
-
[6]
In: 3DV (2017)
Chang, A., Dai, A., Funkhouser, T., Halber, M., Niebner, M., Savva, M., Song, S., Zeng, A., Zhang, Y.: Matterport3d: Learning from rgb-d data in indoor environ- ments. In: 3DV (2017)
2017
-
[7]
In: CVPR
Chen, S., Chen, X., Zhang, C., Li, M., Yu, G., Fei, H., Zhu, H., Fan, J., Chen, T.: Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. In: CVPR. pp. 26428–26438 (2024)
2024
-
[8]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, S., Zhu, H., Chen, X., Lei, Y., Yu, G., Chen, T.: End-to-end 3d dense captioning with vote2cap-detr. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11124–11133 (2023)
2023
-
[9]
In: CVPR
Chen, Z., Gholami, A., Nießner, M., Chang, A.X.: Scan2cap: Context-aware dense captioning in rgb-d scans. In: CVPR. pp. 3193–3203 (2021)
2021
-
[10]
arXiv preprint arXiv:2509.13317 (2025)
Cheng, A.C., Fu, Y., Chen, Y., Liu, Z., Li, X., Radhakrishnan, S., Han, S., Lu, Y., Kautz, J., Molchanov, P., et al.: 3d aware region prompted vision language model. arXiv preprint arXiv:2509.13317 (2025)
arXiv 2025
-
[11]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Choy, C., Gwak, J., Savarese, S.: 4d spatio-temporal convnets: Minkowski convolu- tional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3075–3084 (2019)
2019
-
[12]
In: CVPR
Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner,M.:Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: CVPR. pp. 5828–5839 (2017)
2017
-
[13]
Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023),https://arxiv.org/abs/2305.06500
Pith/arXiv arXiv 2023
-
[14]
arXiv preprint arXiv:2403.11401 (2024)
Fu, R., Liu, J., Chen, X., Nie, Y., Xiong, W.: Scene-llm: Extending language model for 3d visual understanding and reasoning. arXiv preprint arXiv:2403.11401 (2024)
arXiv 2024
-
[15]
arXiv preprint arXiv:2309.00615 (2023)
Guo, Z., Zhang, R., Zhu, X., Tang, Y., Ma, X., Han, J., Chen, K., Gao, P., Li, X., Li, H., et al.: Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615 (2023)
arXiv 2023
-
[16]
NeurIPS36, 20482–20494 (2023)
Hong,Y.,Zhen,H.,Chen,P.,Zheng,S.,Du,Y.,Chen,Z.,Gan,C.:3d-llm:Injecting the 3d world into large language models. NeurIPS36, 20482–20494 (2023)
2023
-
[17]
In: NeurIPS (2024)
Huang, H., Chen, Y., Wang, Z., Huang, R., Xu, R., Wang, T., Liu, L., Cheng, X., Zhao, Y., Pang, J., et al.: Chat-scene: Bridging 3d scene and large language models with object identifiers. In: NeurIPS (2024)
2024
-
[18]
arXiv preprint arXiv:2311.12871 (2023)
Huang, J., Yong, S., Ma, X., Linghu, X., Li, P., Wang, Y., Li, Q., Zhu, S.C., Jia, B., Huang, S.: An embodied generalist agent in 3d world. arXiv preprint arXiv:2311.12871 (2023)
Pith/arXiv arXiv 2023
-
[19]
In: CVPR
Huang, Y., Zheng, W., Zhang, Y., Zhou, J., Lu, J.: Tri-perspective view for vision- based 3d semantic occupancy prediction. In: CVPR. pp. 9223–9232 (2023)
2023
-
[20]
In: ECCV
Huang, Y., Zheng, W., Zhang, Y., Zhou, J., Lu, J.: Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction. In: ECCV. pp. 376–
-
[21]
arXiv preprint arXiv:2408.03326 (2024)
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)
Pith/arXiv arXiv 2024
-
[22]
In: ICRA
Li, J., Lu, M., Liu, J., Wang, H., Gu, C., Zheng, W., Du, L., Zhang, S.: Sliceocc: Indoor 3d semantic occupancy prediction with vertical slice representation. In: ICRA. pp. 15762–15768. IEEE (2025) Occ-VLM 19
2025
-
[23]
In: ICML
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML. pp. 19730–19742. PMLR (2023)
2023
-
[24]
arXiv preprint arXiv:2403.18814 (2024)
Li, Y., Zhang, Y., Wang, C., Zhong, Z., Chen, Y., Chu, R., Liu, S., Jia, J.: Mini- gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814 (2024)
Pith/arXiv arXiv 2024
-
[25]
In: CVPR
Li, Y., Yu, Z., Choy, C., Xiao, C., Alvarez, J.M., Fidler, S., Feng, C., Anandku- mar, A.: Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In: CVPR. pp. 9087–9098 (2023)
2023
-
[26]
In: Proceedings of the 2024 conference on empirical methods in natural language processing
Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 5971– 5984 (2024)
2024
-
[27]
In: CVPR
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR. pp. 26296–26306 (2024)
2024
-
[28]
NeurIPS36, 34892– 34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. NeurIPS36, 34892– 34916 (2023)
2023
-
[29]
In: European conference on computer vision
Liu, S., Cheng, H., Liu, H., Zhang, H., Li, F., Ren, T., Zou, X., Yang, J., Su, H., Zhu, J., et al.: Llava-plus: Learning to use tools for creating multimodal agents. In: European conference on computer vision. pp. 126–142. Springer (2024)
2024
-
[30]
ICLR (2023)
Ma,X.,Yong,S.,Zheng,Z.,Li,Q.,Liang,Y.,Zhu,S.C.,Huang,S.:Sqa3d:Situated question answering in 3d scenes. ICLR (2023)
2023
-
[31]
In: CVPR
Majumdar, A., Ajay, A., Zhang, X., Putta, P., Yenamandra, S., Henaff, M., Silwal, S., Mcvay, P., Maksymets, O., Arnaud, S., et al.: Openeqa: Embodied question answering in the era of foundation models. In: CVPR. pp. 16488–16498 (2024)
2024
-
[32]
NeurIPS37, 76819– 76847 (2024)
Man, Y., Zheng, S., Bao, Z., Hebert, M., Gui, L., Wang, Y.X.: Lexicon3d: Probing visual foundation models for complex 3d scene understanding. NeurIPS37, 76819– 76847 (2024)
2024
-
[33]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: Learning 3d reconstruction in function space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4460–4470 (2019)
2019
-
[34]
In: ICRA
Pan, M., Liu, J., Zhang, R., Huang, P., Li, X., Xie, H., Wang, B., Liu, L., Zhang, S.: Renderocc: Vision-centric 3d occupancy prediction with 2d rendering supervision. In: ICRA. pp. 12404–12411. IEEE (2024)
2024
-
[35]
In: European Conference on Computer Vision
Peng, S., Niemeyer, M., Mescheder, L., Pollefeys, M., Geiger, A.: Convolutional occupancy networks. In: European Conference on Computer Vision. pp. 523–540. Springer (2020)
2020
-
[36]
arXiv preprint arXiv:2511.16454 (2025)
Petit, D., Bourgeois, S., Gay-Bellile, V., Chabot, F., Barthe, L.: Llava 3: Repre- senting 3d scenes like a cubist painter to boost 3d scene understanding of vlms. arXiv preprint arXiv:2511.16454 (2025)
arXiv 2025
-
[37]
In: ECCV
Qi, Z., Dong, R., Zhang, S., Geng, H., Han, C., Ge, Z., Yi, L., Ma, K.: Shapellm: Universal 3d object understanding for embodied interaction. In: ECCV. pp. 214–
-
[38]
In: CVPR
Qi, Z., Fang, Y., Sun, Z., Wu, X., Wu, T., Wang, J., Lin, D., Zhao, H.: Gpt4point: A unified framework for point-language understanding and generation. In: CVPR. pp. 26417–26427 (2024)
2024
-
[39]
arXiv preprint arXiv:2501.01428 (2025) 20 J
Qi, Z., Zhang, Z., Fang, Y., Wang, J., Zhao, H.: Gpt4scene: Understand 3d scenes from videos with vision-language models. arXiv preprint arXiv:2501.01428 (2025) 20 J. Li et al
arXiv 2025
-
[40]
In: European conference on computer vision
Tang, H., Liu, Z., Zhao, S., Lin, Y., Lin, J., Wang, H., Han, S.: Searching efficient 3d architectures with sparse point-voxel convolution. In: European conference on computer vision. pp. 685–702. Springer (2020)
2020
-
[41]
In: ACM MM
Tang,Y.,Han, X.,Li,X., Yu,Q., Hao,Y.,Hu,L., Chen,M.:Minigpt-3d:Efficiently aligning 3d point clouds with large language models using 2d priors. In: ACM MM. pp. 6617–6626 (2024)
2024
-
[42]
arXiv preprint arXiv:2407.106712(8) (2024)
Team, Q., et al.: Qwen2 technical report. arXiv preprint arXiv:2407.106712(8) (2024)
Pith/arXiv arXiv 2024
-
[43]
arXiv preprint arXiv:2503.06271 (2025)
Thai, A., Peng, S., Genova, K., Guibas, L., Funkhouser, T.: Splattalk: 3d vqa with gaussian splatting. arXiv preprint arXiv:2503.06271 (2025)
arXiv 2025
-
[44]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Tong, W., Sima, C., Wang, T., Chen, L., Wu, S., Deng, H., Gu, Y., Lu, L., Luo, P., Lin, D., et al.: Scene as occupancy. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8406–8415 (2023)
2023
-
[45]
In: ICCV
Wald, J., Avetisyan, A., Navab, N., Tombari, F., Nießner, M.: Rio: 3d object in- stance re-localization in changing indoor environments. In: ICCV. pp. 7658–7667 (2019)
2019
-
[46]
In: Proceedings of the 33rd ACM International Conference on Multimedia
Wang, H., Wei, X., Zhang, X., Li, J., Bai, C., Li, Y., Lu, M., Zheng, W., Zhang, S.: Embodiedocc++: Boosting embodied 3d occupancy prediction with plane regular- ization and uncertainty sampler. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 925–934 (2025)
2025
-
[47]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)
2025
-
[48]
arXiv preprint arXiv:2409.12191 (2024)
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)
Pith/arXiv arXiv 2024
-
[49]
arXiv preprint arXiv:2308.08769 (2023)
Wang, Z., Huang, H., Zhao, Y., Zhang, Z., Zhao, Z.: Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes. arXiv preprint arXiv:2308.08769 (2023)
arXiv 2023
-
[50]
In: ICCV
Wei, Y., Zhao, L., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: Surroundocc: Multi- camera 3d occupancy prediction for autonomous driving. In: ICCV. pp. 21729– 21740 (2023)
2023
-
[51]
arXiv preprint arXiv:2505.23747 (2025)
Wu, D., Liu, F., Hung, Y.H., Duan, Y.: Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. arXiv preprint arXiv:2505.23747 (2025)
Pith/arXiv arXiv 2025
-
[52]
In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision
Wu, Y., Zheng, W., Zuo, S., Huang, Y., Zhou, J., Lu, J.: Embodiedocc: Embodied 3d occupancy prediction for vision-based online scene understanding. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 26360– 26370 (2025)
2025
-
[53]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Xing, L., Huang, Q., Dong, X., Lu, J., Zhang, P., Zang, Y., Cao, Y., He, C., Wang, J., Wu, F., et al.: Conical visual concentration for efficient large vision- language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14593–14603 (2025)
2025
-
[54]
Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: Pointllm: Empowering largelanguagemodelstounderstandpointclouds.In:ECCV.pp.131–147.Springer (2024)
2024
-
[55]
In: Proceedings of the IEEE/CVF international conference on computer vision
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)
2023
-
[56]
Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: Video instruction tuning with synthetic data (2024) Occ-VLM 21
2024
-
[57]
arXiv preprint arXiv:2505.24625 (2025)
Zheng, D., Huang, S., Li, Y., Wang, L.: Learning from videos for 3d world: En- hancing mllms with 3d vision geometry priors. arXiv preprint arXiv:2505.24625 (2025)
arXiv 2025
-
[58]
In: CVPR
Zheng, D., Huang, S., Wang, L.: Video-3d llm: Learning position-aware video rep- resentation for 3d scene understanding. In: CVPR. pp. 8995–9006 (2025)
2025
-
[59]
arXiv preprint arXiv:2409.18125 (2024)
Zhu, C., Wang, T., Zhang, W., Pang, J., Liu, X.: Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness. arXiv preprint arXiv:2409.18125 (2024)
arXiv 2024
-
[60]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhu, X., Zhou, H., Wang, T., Hong, F., Ma, Y., Li, W., Li, H., Lin, D.: Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9939–9948 (2021)
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.