pith. machine review for the scientific record. sign in

arxiv: 2605.09899 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 3D object detectioncross-modal distillationhyperbolic geometrymultimodal fusionpoint cloud featuresimage featuresknowledge distillationvoxel optimization
0
0 comments X

The pith

Hyperbolic geometry reduces semantic loss when fusing image and point cloud features for 3D object detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HGC-Det, a cross-modal distillation approach that maps high-dimensional image features and low-dimensional point cloud features into hyperbolic space. This mapping is intended to preserve more semantic information than standard Euclidean fusion while addressing spatial misalignment and modality differences. Three supporting modules refine voxel representations from 2D cues, perform the hyperbolic transfer, and correct resulting geometry issues. The authors report that the full pipeline delivers stronger detection results at lower computational cost than prior distillation methods across indoor and outdoor benchmark datasets.

Core claim

The authors claim that hyperbolic geometry constrained cross-modal feature transfer (HFT) alleviates semantic loss during the fusion of image and point cloud data. Combined with 2D semantic-guided voxel optimization and feature aggregation-based geometry optimization, the resulting HGC-Det framework produces more accurate 3D detections than existing cross-modal distillation techniques while maintaining a favorable accuracy-compute trade-off.

What carries the argument

The hyperbolic geometry constrained cross-modal feature transfer (HFT) component, which embeds image and point cloud features in hyperbolic space to exploit its exponential volume growth for hierarchical semantic preservation.

If this is right

  • Image semantics can adaptively improve the spatial representation quality of the point cloud branch.
  • Hyperbolic embedding limits information loss when combining features of mismatched dimensionality.
  • Post-fusion geometry correction can offset any spatial degradation from 2D-guided voxel refinement.
  • The overall pipeline achieves higher detection accuracy at lower computational cost on SUN RGB-D, ARKitScenes, KITTI, and nuScenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hyperbolic transfer idea could be tested on other multimodal perception tasks such as instance segmentation or tracking.
  • Adjusting the curvature of the hyperbolic space might further tune performance for indoor versus outdoor scene statistics.
  • The approach suggests non-Euclidean embeddings may help fusion problems in additional high-to-low dimensional sensor pairings.

Load-bearing premise

Mapping high-dimensional image features and low-dimensional point cloud features into hyperbolic space will reliably reduce semantic loss during fusion without introducing new distortions or alignment errors that cancel the gains.

What would settle it

Direct comparison on KITTI or nuScenes showing that HGC-Det yields no measurable gain in mean average precision or requires higher compute than standard Euclidean distillation baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.09899 by Houde Quan, Kanglin Ning, Qifan Li, Wenrui Li, Xiaopeng Fan, Xingtao Wang.

Figure 1
Figure 1. Figure 1: A comparative illustration of the conventional Fusion-to-Single cross [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustrative example of projecting a 3D point cloud onto a 2D image [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The pipeline of our proposed HGC-Det is shown. Our detector uses images and point clouds as input, where the image input can be a surround [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Analysis of manual setting hyper-parameter’s effect to detector’s mean [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Cross-modal knowledge distillation has emerged as an effective strategy for integrating point cloud and image features in 3D perception tasks. However, the modality heterogeneity, spatial misalignment, and the representation crisis of multiple modalities often limit the efficient of these cross-modal distillation methods. To address these limitations in existing approaches, we propose a hyperbolic constrained cross-modal distillation method for multimodal 3D object detection (HGC-Det). The proposed HGC-Det framework includes an image branch and a point cloud branch to extract semantic features from two different modalities. The point cloud branch comprises three core components: a 2D semantic-guided voxel optimization component (SGVO), a hyperbolic geometry constrained cross-modal feature transfer component (HFT), and a feature aggregation-based geometry optimization component (FAGO). Specifically, the SGVO component adaptively refines the spatial representation of the 3D branch by leveraging semantic cues from the image branch, thereby mitigating the issue of inadequate representation fusion. The HFT component exploits the intrinsic geometric properties of hyperbolic space to alleviate semantic loss during the fusion of high-dimensional image features and low-dimensional point cloud features. Finally, the FAGO compensates for potential spatial feature degradation introduced by the 2D semantic-guided voxel optimization component. Extensive experiments on indoor datasets (SUN RGB-D, ARKitScenes) and outdoor datasets (KITTI, nuScenes) demonstrate that our method achieves a better trade-off between detection accuracy and computational cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes HGC-Det, a cross-modal distillation framework for multimodal 3D object detection that integrates an image branch with a point-cloud branch containing three modules: 2D semantic-guided voxel optimization (SGVO), hyperbolic geometry constrained cross-modal feature transfer (HFT), and feature aggregation-based geometry optimization (FAGO). SGVO uses image semantics to refine voxel representations, HFT maps high-dimensional image features into hyperbolic space to reduce semantic loss when fusing with low-dimensional point-cloud features, and FAGO compensates for spatial degradation. Experiments on SUN RGB-D, ARKitScenes, KITTI, and nuScenes are reported to yield a superior accuracy versus computational cost trade-off.

Significance. If the hyperbolic transfer in HFT can be shown to preserve the geometric cues needed for 3D bounding-box regression without introducing unquantified distortions, the approach would provide a geometrically motivated solution to modality heterogeneity and representation crisis in cross-modal 3D detection. This could improve efficiency in both indoor and outdoor settings. The current description, however, supplies no ablations, curvature schedules, or error analysis that would allow assessment of whether the claimed gains are attributable to the hyperbolic constraint rather than the 2D guidance or aggregation steps.

major comments (1)
  1. [Abstract] Abstract: The central claim that HFT 'exploits the intrinsic geometric properties of hyperbolic space to alleviate semantic loss during the fusion of high-dimensional image features and low-dimensional point cloud features' is load-bearing for the accuracy-cost trade-off reported on KITTI and nuScenes, yet the abstract (and the provided manuscript description) contains no derivation of the embedding, no value or schedule for curvature, no exponential-map formulation, and no ablation that isolates HFT from SGVO and FAGO. Without these elements it is impossible to verify that curvature-induced misalignment or distortion does not offset the reported gains.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'limit the efficient of these cross-modal distillation methods' is grammatically incorrect and should read 'limit the efficiency of these cross-modal distillation methods.'

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying areas where the presentation of the HFT component can be strengthened. We address the major comment below and will revise the manuscript to incorporate the requested clarifications and additional experiments.

read point-by-point responses
  1. Referee: The central claim that HFT 'exploits the intrinsic geometric properties of hyperbolic space to alleviate semantic loss during the fusion of high-dimensional image features and low-dimensional point cloud features' is load-bearing for the accuracy-cost trade-off reported on KITTI and nuScenes, yet the abstract (and the provided manuscript description) contains no derivation of the embedding, no value or schedule for curvature, no exponential-map formulation, and no ablation that isolates HFT from SGVO and FAGO. Without these elements it is impossible to verify that curvature-induced misalignment or distortion does not offset the reported gains.

    Authors: We agree that the abstract is concise and omits key technical details of the HFT module, making it difficult to fully assess the source of the reported gains. In the revised manuscript we will expand the abstract to briefly state that HFT employs the exponential map to embed image features into hyperbolic space with a fixed curvature of -1 (standard for such feature hierarchies) and that this mapping is used to mitigate dimension-induced semantic loss. We will also insert a short derivation of the embedding and the hyperbolic distance formulation into the methods section. To isolate the contribution of the hyperbolic constraint, we will add a dedicated ablation that replaces HFT with Euclidean feature concatenation while keeping SGVO and FAGO unchanged; the resulting performance drop on KITTI and nuScenes will be reported to quantify the benefit attributable to hyperbolic geometry. We will further include a short sensitivity analysis on curvature values to address potential distortion concerns. These additions will allow readers to verify that the geometric properties of hyperbolic space are responsible for the observed accuracy-efficiency improvements rather than the other modules alone. revision: yes

Circularity Check

0 steps flagged

No circularity: framework components rest on external geometric properties and experimental validation

full rationale

The paper proposes HGC-Det with SGVO, HFT and FAGO modules that apply hyperbolic geometry and semantic guidance to cross-modal fusion. No equations, derivations, or parameter-fitting steps are shown in the abstract or described components that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central claims are supported by experiments across SUN RGB-D, ARKitScenes, KITTI and nuScenes rather than internal redefinitions. This matches the default expectation of a non-circular method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract; the method appears to build on standard deep-learning assumptions for feature extraction and geometric embedding.

pith-pipeline@v0.9.0 · 5578 in / 1172 out tokens · 46243 ms · 2026-05-12T04:21:45.367557+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 2 internal anchors

  1. [1]

    A survey on occupancy perception for autonomous driving: The information fusion perspective,

    H. Xu, J. Chen, S. Meng, Y . Wang, and L.-P. Chau, “A survey on occupancy perception for autonomous driving: The information fusion perspective,”Information Fusion, vol. 114, p. 102671, 2025

  2. [2]

    Is-fusion: Instance-scene collaborative fusion for multimodal 3d object detection,

    J. Yin, J. Shen, R. Chen, W. Li, R. Yang, P. Frossard, and W. Wang, “Is-fusion: Instance-scene collaborative fusion for multimodal 3d object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 14 905–14 915

  3. [3]

    Bevfusion: A simple and robust lidar-camera fusion frame- work,

    T. Liang, H. Xie, K. Yu, Z. Xia, Z. Lin, Y . Wang, T. Tang, B. Wang, and Z. Tang, “Bevfusion: A simple and robust lidar-camera fusion frame- work,”Advances in Neural Information Processing Systems, vol. 35, pp. 10 421–10 434, 2022

  4. [4]

    V oxel field fusion for 3d object detection,

    Y . Li, X. Qi, Y . Chen, L. Wang, Z. Li, J. Sun, and J. Jia, “V oxel field fusion for 3d object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1120– 1129

  5. [5]

    Mv2dfusion: Lever- aging modality-specific object semantics for multi-modal 3d detection,

    Z. Wang, Z. Huang, Y . Gao, N. Wang, and S. Liu, “Mv2dfusion: Lever- aging modality-specific object semantics for multi-modal 3d detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  6. [6]

    Logonet: Towards accurate 3d object detection with local-to-global cross-modal fusion,

    X. Li, T. Ma, Y . Hou, B. Shi, Y . Yang, Y . Liu, X. Wu, Q. Chen, Y . Li, Y . Qiaoet al., “Logonet: Towards accurate 3d object detection with local-to-global cross-modal fusion,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 524–17 534

  7. [7]

    2dpass: 2d priors assisted semantic segmentation on lidar point clouds,

    X. Yan, J. Gao, C. Zheng, C. Zheng, R. Zhang, S. Cui, and Z. Li, “2dpass: 2d priors assisted semantic segmentation on lidar point clouds,” inEuropean conference on computer vision. Springer, 2022, pp. 677– 695

  8. [8]

    X- trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning,

    Z. Yuan, X. Yan, Y . Liao, Y . Guo, G. Li, S. Cui, and Z. Li, “X- trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8563–8573

  9. [9]

    Uni-to-multi modal knowledge distillation for bidirectional lidar-camera semantic segmentation,

    T. Sun, Z. Zhang, X. Tan, Y . Peng, Y . Qu, and Y . Xie, “Uni-to-multi modal knowledge distillation for bidirectional lidar-camera semantic segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 11 059–11 072, 2024

  10. [10]

    From global to hybrid: a review of supervised deep learning for 2-d image feature representation,

    X. Dong, Q. Wang, H. Deng, Z. Yang, W. Ruan, W. Liu, L. Lei, X. Wu, and Y . Tian, “From global to hybrid: a review of supervised deep learning for 2-d image feature representation,”IEEE Transactions on Artificial Intelligence, vol. 6, no. 6, pp. 1540–1560, 2025

  11. [11]

    Revisiting nonlocal self-similarity from continuous representation,

    Y . Luo, X. Zhao, and D. Meng, “Revisiting nonlocal self-similarity from continuous representation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 1, pp. 450–468, 2024

  12. [12]

    Parameterized low-rank regularizer for high-dimensional visual data,

    S. Xu, Z. Zhao, X. Cao, J. Peng, X.-L. Zhao, D. Meng, Y . Zhang, R. Timofte, and L. Van Gool, “Parameterized low-rank regularizer for high-dimensional visual data,”International Journal of Computer Vision, vol. 133, no. 12, pp. 8546–8569, 2025

  13. [13]

    Deep learning frontiers in 3d object detection: a comprehensive review for autonomous driving,

    A. Pravallika, M. F. Hashmi, and A. Gupta, “Deep learning frontiers in 3d object detection: a comprehensive review for autonomous driving,” IEEE Access, 2024

  14. [14]

    Fully-connected transformer for multi-source image fusion,

    X. Wu, Z.-H. Cao, T.-Z. Huang, L.-J. Deng, J. Chanussot, and G. Vivone, “Fully-connected transformer for multi-source image fusion,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 3, pp. 2071–2088, 2025

  15. [15]

    Binarized neural network for multi-spectral image fusion,

    J. Hou, X. Chen, R. Ran, X. Cong, X. Liu, J. W. You, and L.-J. Deng, “Binarized neural network for multi-spectral image fusion,” inProceed- ings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2236–2245

  16. [16]

    Inf 3: Implicit neural feature fusion function for multispectral and hyperspectral image fusion,

    R.-C. Wu, S. Deng, R. Ran, H.-X. Dou, and L.-J. Deng, “Inf 3: Implicit neural feature fusion function for multispectral and hyperspectral image fusion,”IEEE Transactions on Computational Imaging, vol. 10, pp. 1547–1558, 2024

  17. [17]

    Tr3d: Towards real- time indoor 3d object detection,

    D. Rukhovich, A. V orontsova, and A. Konushin, “Tr3d: Towards real- time indoor 3d object detection,” in2023 IEEE International Conference on Image Processing (ICIP). IEEE, 2023, pp. 281–285

  18. [18]

    Cross modal transformer via coordinates encoding for 3d object dectection,

    J. Yan, Y . Liu, J. Sun, F. Jia, S. Li, T. Wang, and X. Zhang, “Cross modal transformer via coordinates encoding for 3d object dectection,” arXiv preprint arXiv:2301.01283, vol. 2, no. 3, p. 4, 2023

  19. [19]

    Unitr: A unified and efficient multi-modal transformer for bird’s-eye- view representation,

    H. Wang, H. Tang, S. Shi, A. Li, Z. Li, B. Schiele, and L. Wang, “Unitr: A unified and efficient multi-modal transformer for bird’s-eye- view representation,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 6792–6802

  20. [20]

    A simple, unified and effective voxel fusion framework for multi-modal 3d object detection,

    Z. Song, G. Zhang, J. Xie, L. Liu, C. Jia, and S. V . Xu, “A simple, unified and effective voxel fusion framework for multi-modal 3d object detection,”IEEE Trans. Geosci. Remote Sens, vol. 61, p. 5705412, 2023

  21. [21]

    Menet: Multi-modal mapping enhancement network for 3d object detection in autonomous driving,

    M. Liu, Y . Chen, J. Xie, Y . Zhu, Y . Zhang, L. Yao, Z. Bing, G. Zhuang, K. Huang, and J. T. Zhou, “Menet: Multi-modal mapping enhancement network for 3d object detection in autonomous driving,”IEEE Trans- actions on Intelligent Transportation Systems, vol. 25, no. 8, pp. 9397– 9410, 2024

  22. [22]

    Imvotenet: Boosting 3d object detection in point clouds with image votes,

    C. R. Qi, X. Chen, O. Litany, and L. J. Guibas, “Imvotenet: Boosting 3d object detection in point clouds with image votes,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4404–4413

  23. [23]

    Sparsefusion: Fusing multi-modal sparse representations for multi-sensor 3d object detection,

    Y . Xie, C. Xu, M.-J. Rakotosaona, P. Rim, F. Tombari, K. Keutzer, M. Tomizuka, and W. Zhan, “Sparsefusion: Fusing multi-modal sparse representations for multi-sensor 3d object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17 591–17 602

  24. [24]

    Sun rgb-d: A rgb-d scene understanding benchmark suite,

    S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 567–576

  25. [25]

    ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

    G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y . Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartzet al., “Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data,” arXiv preprint arXiv:2111.08897, 2021

  26. [26]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

  27. [27]

    W., Meng, T., Caine, B., Ngiam, J., Peng, D., and Tan, M

    Y .-C. Liu, Y .-K. Huang, H.-Y . Chiang, H.-T. Su, Z.-Y . Liu, C.-T. Chen, C.-Y . Tseng, and W. H. Hsu, “Learning from 2d: Contrastive pixel-to-point knowledge transfer for 3d pretraining,”arXiv preprint arXiv:2104.04687, 2021

  28. [28]

    Image2point: 3d point-cloud understanding with pretrained 2d convnets,

    C. Xu, S. Yang, B. Zhai, B. Wu, X. Yue, W. Zhan, P. Vajda, K. Keutzer, and M. Tomizuka, “Image2point: 3d point-cloud understanding with pretrained 2d convnets,”arXiv preprint arXiv:2106.04180, vol. 3, 2021

  29. [29]

    Hyperbolic geometry in computer vision: A survey,

    P. Fang, M. Harandi, T. Le, and D. Phung, “Hyperbolic geometry in computer vision: A survey,”arXiv preprint arXiv:2304.10764, 2023

  30. [30]

    Hyperbolic image segmentation,

    M. G. Atigh, J. Schoep, E. Acar, N. Van Noord, and P. Mettes, “Hyperbolic image segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4453– 4462

  31. [31]

    Curved geometric networks for visual anomaly recognition,

    J. Hong, P. Fang, W. Li, J. Han, L. Petersson, and M. Harandi, “Curved geometric networks for visual anomaly recognition,”IEEE transactions on neural networks and learning systems, 2023

  32. [32]

    Shmamba: Structured hyperbolic state space model for audio-visual question answering,

    Z. Yang, W. Li, and G. Cheng, “Shmamba: Structured hyperbolic state space model for audio-visual question answering,”IEEE Transactions on Audio, Speech and Language Processing, 2025

  33. [33]

    Hyperbolic uncertainty-aware few-shot incremental point cloud segmentation,

    T. Sur, S. Mukherjee, K. Rahaman, S. Chaudhuri, M. H. Khan, and B. Banerjee, “Hyperbolic uncertainty-aware few-shot incremental point cloud segmentation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 11 810–11 821

  34. [34]

    Hyperbolic contrastive learning for hierarchical 3d point cloud embedding,

    Y . Liu, P. Zhang, Z. He, M. Chen, X. Tang, and X. Wei, “Hyperbolic contrastive learning for hierarchical 3d point cloud embedding,”arXiv preprint arXiv:2501.02285, 2025

  35. [35]

    Hyperbolic- constraint point cloud reconstruction from single rgb-d images,

    W. Li, Z. Yang, W. Han, H. Man, X. Wang, and X. Fan, “Hyperbolic- constraint point cloud reconstruction from single rgb-d images,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 5, 2025, pp. 4959–4967

  36. [36]

    Second: Sparsely embedded convolutional detection,

    Y . Yan, Y . Mao, and B. Li, “Second: Sparsely embedded convolutional detection,”Sensors, vol. 18, no. 10, p. 3337, 2018

  37. [37]

    V oxelnext: Fully sparse voxelnet for 3d object detection and tracking,

    Y . Chen, J. Liu, X. Zhang, X. Qi, and J. Jia, “V oxelnext: Fully sparse voxelnet for 3d object detection and tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 21 674–21 683

  38. [38]

    Center-based 3d object detection and tracking,

    T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection and tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 784–11 793

  39. [39]

    Cotr: Compact occupancy transformer for vision-based 3d occupancy prediction,

    Q. Ma, X. Tan, Y . Qu, L. Ma, Z. Zhang, and Y . Xie, “Cotr: Compact occupancy transformer for vision-based 3d occupancy prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 936–19 945

  40. [40]

    Poincar ´e embeddings for learning hierarchical representations,

    M. Nickel and D. Kiela, “Poincar ´e embeddings for learning hierarchical representations,”Advances in neural information processing systems, vol. 30, 2017

  41. [41]

    Cagroup3d: Class-aware grouping for 3d object detection on point clouds,

    H. Wang, L. Ding, S. Dong, S. Shi, A. Li, J. Li, Z. Li, and L. Wang, “Cagroup3d: Class-aware grouping for 3d object detection on point clouds,”Advances in neural information processing systems, vol. 35, pp. 29 975–29 988, 2022

  42. [42]

    Spgroup3d: Superpoint grouping network for indoor 3d object detection,

    Y . Zhu, L. Hui, Y . Shen, and J. Xie, “Spgroup3d: Superpoint grouping network for indoor 3d object detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7811– 7819. IEEE TRANSACTIONS ON MULTIMEDIA 10

  43. [43]

    Unidet3d: Multi-dataset indoor 3d object detection,

    M. Kolodiazhnyi, A. V orontsova, M. Skripkin, D. Rukhovich, and A. Konushin, “Unidet3d: Multi-dataset indoor 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 4, 2025, pp. 4365–4373

  44. [44]

    V-detr: Detr with vertex relative position encoding for 3d object detection,

    Y . Shen, Z. Geng, Y . Yuan, Y . Lin, Z. Liu, C. Wang, H. Hu, N. Zheng, and B. Guo, “V-detr: Detr with vertex relative position encoding for 3d object detection,”arXiv preprint arXiv:2308.04409, 2023

  45. [45]

    Point-gcc: Universal self-supervised 3d scene pre-training via geometry-color contrast,

    G. Fan, Z. Qi, W. Shi, and K. Ma, “Point-gcc: Universal self-supervised 3d scene pre-training via geometry-color contrast,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 4709– 4718

  46. [46]

    Rbgnet: Ray-based grouping for 3d object detection,

    H. Wang, S. Shi, Z. Yang, R. Fang, Q. Qian, H. Li, B. Schiele, and L. Wang, “Rbgnet: Ray-based grouping for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1110–1119

  47. [47]

    One for all: Multi-domain joint training for point cloud based 3d object detection,

    Z. Wang, Y .-L. Li, H. Zhao, and S. Wang, “One for all: Multi-domain joint training for point cloud based 3d object detection,”Advances in Neural Information Processing Systems, vol. 37, pp. 56 859–56 877, 2024

  48. [48]

    3det-mamba: State space model for end-to-end 3d object detection,

    M. Li, J. Yuan, S. Chen, L. Zhang, A. Zhu, X. Chen, and T. Chen, “3det-mamba: State space model for end-to-end 3d object detection,” in Proceedings of the 38th International Conference on Neural Information Processing Systems, 2024, pp. 47 242–47 260

  49. [49]

    Unimode: Unified monocular 3d object detection,

    Z. Li, X. Xu, S. Lim, and H. Zhao, “Unimode: Unified monocular 3d object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16 561–16 570

  50. [50]

    3difftection: 3d object detection with geometry-aware diffusion features,

    C. Xu, H. Ling, S. Fidler, and O. Litany, “3difftection: 3d object detection with geometry-aware diffusion features,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 617–10 627

  51. [51]

    Pointpillars: Fast encoders for object detection from point clouds,

    A. H. Lang, S. V ora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12 697–12 705

  52. [52]

    Pv-rcnn: Point-voxel feature set abstraction for 3d object detection,

    S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv-rcnn: Point-voxel feature set abstraction for 3d object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 529–10 538

  53. [53]

    V oxel-rcnn-complex: An effective 3-d point cloud object detector for complex traffic conditions,

    H. Wang, Z. Chen, Y . Cai, L. Chen, Y . Li, M. A. Sotelo, and Z. Li, “V oxel-rcnn-complex: An effective 3-d point cloud object detector for complex traffic conditions,”IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–12, 2022

  54. [54]

    Se-ssd: Self-ensembling single-stage object detector from point cloud,

    W. Zheng, W. Tang, L. Jiang, and C.-W. Fu, “Se-ssd: Self-ensembling single-stage object detector from point cloud,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14 494–14 503

  55. [55]

    Point density-aware voxels for lidar 3d object detection,

    J. S. Hu, T. Kuai, and S. L. Waslander, “Point density-aware voxels for lidar 3d object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8469–8478

  56. [56]

    Epnet: Enhancing point features with image semantics for 3d object detection,

    T. Huang, Z. Liu, X. Chen, and X. Bai, “Epnet: Enhancing point features with image semantics for 3d object detection,” inEuropean conference on computer vision. Springer, 2020, pp. 35–52

  57. [57]

    Focal sparse convolutional networks for 3d object detection,

    Y . Chen, Y . Li, X. Zhang, J. Sun, and J. Jia, “Focal sparse convolutional networks for 3d object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5428– 5437

  58. [58]

    Cat-det: Contrastively augmented transformer for multi-modal 3d object detection,

    Y . Zhang, J. Chen, and D. Huang, “Cat-det: Contrastively augmented transformer for multi-modal 3d object detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 908–917

  59. [59]

    Hednet: A hierarchical encoder-decoder network for 3d object detection in point clouds,

    G. Zhang, C. Junnan, G. Gao, J. Li, and X. Hu, “Hednet: A hierarchical encoder-decoder network for 3d object detection in point clouds,”Ad- vances in Neural Information Processing Systems, vol. 36, pp. 53 076– 53 089, 2023

  60. [60]

    EA-LSS: Edge-aware Lift-splat-shot Framework for 3D BEV Object Detection,

    H. Hu, F. Wang, J. Su, Y . Wang, L. Hu, W. Fang, J. Xu, and Z. Zhang, “Ea-lss: Edge-aware lift-splat-shot framework for 3d bev object detection,”arXiv preprint arXiv:2303.17895, 2023

  61. [61]

    Sparselif: High- performance sparse lidar-camera fusion for 3d object detection,

    H. Zhang, L. Liang, P. Zeng, X. Song, and Z. Wang, “Sparselif: High- performance sparse lidar-camera fusion for 3d object detection,” in European Conference on Computer Vision. Springer, 2024, pp. 109– 128

  62. [62]

    Dsgn++: Exploiting visual-spatial relation for stereo-based 3d detectors,

    Y . Chen, S. Huang, S. Liu, B. Yu, and J. Jia, “Dsgn++: Exploiting visual-spatial relation for stereo-based 3d detectors,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4416– 4429, 2022. Kanglin Ningreceived a B.S. degree from the Dalian University of Technology, Dalian, China, in 2016. and received the M.S. degree fro...