pith. machine review for the scientific record. sign in

arxiv: 2605.08808 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords 3D scene captioninggeodesic attentionOblique manifoldLorentz hyperboloidpoint cloud processingdense captioningnon-Euclidean embeddings
0
0 comments X

The pith

Curvature complementarity between Oblique and Lorentz manifolds lets geodesic attention handle both precise object locations and coherent scene hierarchies in 3D captioning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Curvature-Aware Captioning method to generate descriptions of 3D scenes from sparse point cloud data. Current Euclidean approaches cannot keep fine local geometry while also capturing the exponential growth of semantic relationships across a scene. The new framework places self-attention inside an Oblique manifold to maintain consistent local features and long-range links, then uses bidirectional geodesic cross-attention inside a Lorentz hyperboloid to build hierarchical relationships among objects. Theoretical analysis shows the two manifolds complement each other to remove the usual Euclidean-hyperbolic tension. Experiments on ScanRefer and Nr3D benchmarks report gains in both localization accuracy and description richness.

Core claim

The Curvature-Aware Captioning framework integrates self-attention within Oblique space to enforce dimensional homogeneity while establishing long-range dependencies, together with bidirectional geodesic cross-attention within Lorentz space to model hierarchical semantic relationships across scene instances. Theoretical analysis confirms that the curvature complementarity between the Oblique manifold and Lorentz hyperboloid resolves the Euclidean-hyperbolic conflict, ensuring feature stability via isotropic optimization while preserving inherent hierarchical relationships.

What carries the argument

Curvature complementarity between the Oblique manifold and Lorentz hyperboloid, which lets geodesic self-attention and cross-attention balance local geometric precision with global semantic hierarchies.

If this is right

  • Object localization becomes more accurate because dimensional homogeneity is preserved in the Oblique space.
  • Scene descriptions gain coherence and richness because hierarchical relationships are modeled explicitly in the Lorentz space.
  • The method reaches state-of-the-art scores on both localization and descriptive metrics on the ScanRefer and Nr3D benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dual-manifold attention pattern could be tested on other 3D tasks such as instance segmentation where both local geometry and scene-wide structure matter.
  • The approach might reduce reliance on heavy data augmentation for point clouds if the curvature complementarity already supplies stable features.
  • Performance on dense rather than sparse inputs is not examined, so the method's behavior when point density increases remains open.

Load-bearing premise

That self-attention inside Oblique space and bidirectional geodesic cross-attention inside Lorentz space will together produce both fine localization and coherent hierarchical descriptions from sparse point clouds without creating new instabilities.

What would settle it

A direct comparison on sparse point cloud scenes in which the dual-manifold method shows no gain or a loss in object localization precision or scene-description coherence relative to standard Euclidean attention baselines.

Figures

Figures reproduced from arXiv: 2605.08808 by Mingsong Chen, Xian Wei, Xuan Tang, Yingjie Liu, ZhangYangRui, Ziyao He.

Figure 1
Figure 1. Figure 1: Left: Overview of the proposed Curvature-Aware Captioning (CAC) framework. The processing pipeline is mapped onto non [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: 3D point cloud distribution contrast: Euclidean space [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on the ScanRefer benchmark. Green highlights indicate accurate object descriptions, purple highlights denote [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of CIDEr@0.5 Mean evaluation results of [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Accurate 3D scene description is fundamental to robotic navigation and augmented reality, yet current dense captioning methods face significant limitations in processing sparse point cloud data. % Existing approaches that apply Euclidean embedding spaces struggle to simultaneously preserve fine-grained local geometric details and model exponentially growing global semantic hierarchies, leading to either inaccurate localization or disjointed, shallow scene descriptions. % In this work, we propose a novel \textbf{\textsc{Curvature-Aware Captioning}} framework, integrating novel non-Euclidean geodesic attention mechanisms, to resolve the localization-contextualization conflict. % Specifically, self-attention within Oblique space enforces dimensional homogeneity while establishing long-range dependencies. Bidirectional geodesic cross-attention within Lorentz space models hierarchical semantic relationships across scene instances, enabling simultaneous precision in object localization and coherence in scene descriptions. % Theoretical analysis confirms that the curvature complementarity between the Oblique manifold and Lorentz hyperboloid resolves the Euclidean-hyperbolic conflict, ensuring feature stability via isotropic optimization while preserving inherent hierarchical relationships. Extensive experiments on ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, with significant gains in both localization accuracy and descriptive richness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a Curvature-Aware Captioning framework for 3D scene understanding on sparse point clouds. It combines self-attention on the Oblique manifold (to enforce dimensional homogeneity and long-range dependencies) with bidirectional geodesic cross-attention on the Lorentz hyperboloid (to model hierarchical semantic relationships). The central claim is that curvature complementarity between these two non-Euclidean spaces resolves the Euclidean-hyperbolic conflict, yielding isotropic optimization, feature stability, and simultaneous gains in localization precision and scene-description coherence. Experiments are reported to achieve state-of-the-art results on the ScanRefer and Nr3D benchmarks.

Significance. If the theoretical complementarity and empirical gains are rigorously established, the work would offer a principled non-Euclidean alternative to standard Euclidean embeddings for dense captioning, potentially improving the balance between local geometry and global hierarchy in 3D vision tasks relevant to robotics and AR.

major comments (3)
  1. [§3.2] §3.2 (Theoretical Analysis): The assertion that 'curvature complementarity between the Oblique manifold and Lorentz hyperboloid resolves the Euclidean-hyperbolic conflict, ensuring feature stability via isotropic optimization' is presented without an explicit derivation, proof sketch, or stability condition (e.g., no equation showing cancellation of instabilities or the isotropic condition). This is load-bearing for the framework's justification; absent such analysis, the claimed advantage of the geodesic mechanisms over Euclidean baselines remains unverified.
  2. [§4] §4 (Experiments): The abstract states 'significant gains in both localization accuracy and descriptive richness' and SOTA performance, yet no quantitative results, error bars, ablation tables, or statistical significance tests are referenced in the provided text. Without these, the empirical support for the central claim cannot be assessed.
  3. [§3.1] §3.1 (Method): The definitions of the geodesic attention mechanisms in Oblique and Lorentz spaces are introduced as novel but lack concrete implementation details (e.g., no explicit formulation of the geodesic distance or attention weight computation) that would allow reproduction or verification of the claimed dimensional homogeneity and hierarchy preservation.
minor comments (2)
  1. The abstract contains stray LaTeX artifacts ('%') that should be removed for clarity.
  2. Notation for the Oblique manifold and Lorentz hyperboloid should be introduced consistently with standard references in the non-Euclidean geometry literature.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We will revise the manuscript to strengthen the theoretical justification, ensure all empirical claims are explicitly supported by referenced results, and provide the missing implementation details for reproducibility.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Theoretical Analysis): The assertion that 'curvature complementarity between the Oblique manifold and Lorentz hyperboloid resolves the Euclidean-hyperbolic conflict, ensuring feature stability via isotropic optimization' is presented without an explicit derivation, proof sketch, or stability condition (e.g., no equation showing cancellation of instabilities or the isotropic condition). This is load-bearing for the framework's justification; absent such analysis, the claimed advantage of the geodesic mechanisms over Euclidean baselines remains unverified.

    Authors: We agree that the current presentation of the theoretical analysis would benefit from greater rigor. In the revised manuscript we will expand §3.2 with an explicit derivation, including a proof sketch that shows how the opposing curvatures produce cancellation of instabilities and satisfy the isotropic optimization condition. The new material will contain the relevant equations and a direct comparison to Euclidean baselines. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract states 'significant gains in both localization accuracy and descriptive richness' and SOTA performance, yet no quantitative results, error bars, ablation tables, or statistical significance tests are referenced in the provided text. Without these, the empirical support for the central claim cannot be assessed.

    Authors: The full manuscript already reports quantitative results, error bars, ablation tables, and SOTA comparisons in Section 4 and the associated tables. To address the concern, we will add explicit cross-references from the abstract and introduction to these tables and will include statistical significance tests (paired t-tests with p-values) in the revised version. revision: partial

  3. Referee: [§3.1] §3.1 (Method): The definitions of the geodesic attention mechanisms in Oblique and Lorentz spaces are introduced as novel but lack concrete implementation details (e.g., no explicit formulation of the geodesic distance or attention weight computation) that would allow reproduction or verification of the claimed dimensional homogeneity and hierarchy preservation.

    Authors: We will augment §3.1 with the explicit closed-form expressions for the geodesic distances on both manifolds and the precise formulas used to compute the attention weights. Pseudocode for the bidirectional geodesic cross-attention module will also be added to facilitate reproduction. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical claim asserted without reduction to inputs

full rationale

The paper asserts that 'Theoretical analysis confirms that the curvature complementarity between the Oblique manifold and Lorentz hyperboloid resolves the Euclidean-hyperbolic conflict' but provides no equations, derivations, or self-citations in the given text that reduce this confirmation to a fitted parameter, self-definition, or prior author result by construction. The framework introduces Oblique self-attention and Lorentz cross-attention as novel components justified by the claimed complementarity, with performance claims resting on benchmark experiments rather than a closed loop. No load-bearing step exhibits the enumerated circular patterns; the derivation chain remains self-contained against external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities beyond standard manifold geometry; the central claim rests on the unshown theoretical analysis of curvature complementarity.

axioms (1)
  • domain assumption Oblique manifold and Lorentz hyperboloid possess complementary curvature properties that resolve Euclidean embedding limitations for 3D point clouds
    Invoked in the theoretical analysis paragraph of the abstract as the basis for feature stability and hierarchy preservation.
invented entities (1)
  • Geodesic attention mechanisms in Oblique and Lorentz spaces no independent evidence
    purpose: To enforce dimensional homogeneity and model hierarchical semantics in 3D scene captioning
    Introduced as the core novel components; no independent evidence outside the proposed framework is provided in the abstract.

pith-pipeline@v0.9.0 · 5520 in / 1346 out tokens · 39447 ms · 2026-05-12T02:39:07.129494+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 4 internal anchors

  1. [1]

    Joint diagonalization on the oblique manifold for independent component analysis

    P-A Absil and Kyle A Gallivan. Joint diagonalization on the oblique manifold for independent component analysis. In2006 IEEE international conference on acoustics speech and signal processing proceedings, pages V–V . IEEE, 2006. 1, 2

  2. [2]

    Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes

    Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. InEuropean conference on computer vision, pages 422–440. Springer, 2020. 6, 7

  3. [3]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

  4. [4]

    Scanqa: 3d question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 2

  5. [5]

    Geometric deep learning: going beyond euclidean data.IEEE Signal Processing Mag- azine, 34(4):18–42, 2017

    Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data.IEEE Signal Processing Mag- azine, 34(4):18–42, 2017. 1

  6. [6]

    3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds

    Daigang Cai, Lichen Zhao, Jing Zhang, Lu Sheng, and Dong Xu. 3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16464–16473, 2022. 1, 2, 6

  7. [7]

    Hyperbolic deep reinforcement learn- ing.arXiv preprint arXiv:2210.01542, 2022

    Edoardo Cetin, Benjamin Chamberlain, Michael Bronstein, and Jonathan J Hunt. Hyperbolic deep reinforcement learn- ing.arXiv preprint arXiv:2210.01542, 2022. 2

  8. [8]

    Scanrefer: 3d object localization in rgb-d scans using natural language

    Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InEuropean conference on computer vision, pages 202–221. Springer, 2020. 2, 6, 7

  9. [9]

    Scanrefer: 3d object localization in rgb-d scans using natural language

    Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. pages 202–221, 2020. 6

  10. [10]

    D 3 net: A unified speaker-listener architecture for 3d dense captioning and visual grounding

    Dave Zhenyu Chen, Qirui Wu, Matthias Nießner, and An- gel X Chang. D 3 net: A unified speaker-listener architecture for 3d dense captioning and visual grounding. InEuropean Conference on Computer Vision, pages 487–505. Springer,

  11. [11]

    End-to-end 3d dense captioning with vote2cap-detr

    Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Gang Yu, and Tao Chen. End-to-end 3d dense captioning with vote2cap-detr. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11124– 11133, 2023. 2, 6

  12. [12]

    V ote2cap-detr++: Decoupling localization and describing for end-to-end 3d dense captioning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(11):7331– 7347, 2024

    Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie Lei, Gang Yu, Taihao Li, and Tao Chen. V ote2cap-detr++: Decoupling localization and describing for end-to-end 3d dense captioning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(11):7331– 7347, 2024. 1, 2, 4, 5, 6

  13. [13]

    Executing your commands via motion diffusion in latent space

    Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023. 1

  14. [14]

    Scan2cap: Context-aware dense captioning in rgb- d scans

    Zhenyu Chen, Ali Gholami, Matthias Nießner, and Angel X Chang. Scan2cap: Context-aware dense captioning in rgb- d scans. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3193–3203,

  15. [15]

    Unit3d: A unified transformer for 3d dense captioning and visual grounding

    Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias Nießner, and Angel X Chang. Unit3d: A unified transformer for 3d dense captioning and visual grounding. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 18109–18119, 2023. 1

  16. [16]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 2, 6

  17. [17]

    Hyper- bolic image-text representations

    Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, and Shanmukha Ramakrishna Vedantam. Hyper- bolic image-text representations. InInternational Confer- ence on Machine Learning, pages 7694–7731. PMLR, 2023. 4

  18. [18]

    Hyperbolic entailment cones for learning hierarchical em- beddings

    Octavian Ganea, Gary B ´ecigneul, and Thomas Hofmann. Hyperbolic entailment cones for learning hierarchical em- beddings. InInternational conference on machine learning, pages 1646–1655. PMLR, 2018. 2

  19. [19]

    A sparse oblique-manifold nonnega- tive matrix factorization for hyperspectral unmixing.IEEE Transactions on Geoscience and Remote Sensing, 60:1–13,

    Ziyang Guo, Anyou Min, Bing Yang, Junhong Chen, Hong Li, and Junbin Gao. A sparse oblique-manifold nonnega- tive matrix factorization for hyperspectral unmixing.IEEE Transactions on Geoscience and Remote Sensing, 60:1–13,

  20. [20]

    Lorentzian distance functions in contact ge- ometry.Journal of Topology and Analysis, 16(02):205–225,

    Jakob Hedicke. Lorentzian distance functions in contact ge- ometry.Journal of Topology and Analysis, 16(02):205–225,

  21. [21]

    Pointgroup: Dual-set point grouping for 3d instance segmentation

    Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi- Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping for 3d instance segmentation. InProceedings of the IEEE/CVF conference on computer vision and Pattern recognition, pages 4867–4876, 2020. 2

  22. [22]

    More: Multi-order relation mining for dense captioning in 3d scenes

    Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. More: Multi-order relation mining for dense captioning in 3d scenes. InEuropean Conference on Computer Vision, pages 528–545. Springer, 2022. 1, 2, 6

  23. [23]

    Context-aware alignment and mutual masking for 3d- language pre-training

    Zhao Jin, Munawar Hayat, Yuwei Yang, Yulan Guo, and Yin- jie Lei. Context-aware alignment and mutual masking for 3d- language pre-training. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 10984–10994, 2023. 2, 6

  24. [24]

    Hyperbolic image embeddings

    Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya Usti- nova, Ivan Oseledets, and Victor Lempitsky. Hyperbolic image embeddings. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6418–6428, 2020. 2

  25. [25]

    Bi-directional contextual attention for 3d dense captioning

    Minjung Kim, Hyung Suk Lim, Soonyoung Lee, Bumsoo Kim, and Gunhee Kim. Bi-directional contextual attention for 3d dense captioning. InEuropean Conference on Com- puter Vision, pages 385–401. Springer, 2024. 1, 7, 8

  26. [26]

    Hype: Hyperbolic entailment fil- tering for underspecified images and texts.arXiv preprint arXiv:2404.17507, 2024

    Wonjae Kim, Sanghyuk Chun, Taekyung Kim, Dongyoon Han, and Sangdoo Yun. Hype: Hyperbolic entailment fil- tering for underspecified images and texts.arXiv preprint arXiv:2404.17507, 2024. 3

  27. [27]

    arXiv preprint arXiv:2506.16504 , year=

    Zeqiang Lai, Yunfei Zhao, Haolin Liu, Zibo Zhao, Qingxi- ang Lin, Huiwen Shi, Xianghui Yang, Mingxin Yang, Shuhui Yang, Yifei Feng, et al. Hunyuan3d 2.5: Towards high- fidelity 3d assets generation with ultimate details.arXiv preprint arXiv:2506.16504, 2025. 1

  28. [28]

    Geodesic self-attention for 3d point clouds.Advances in Neural Information Processing Systems, 35:6190–6203, 2022

    Zhengyu Li, Xuan Tang, Zihao Xu, Xihao Wang, Hui Yu, Mingsong Chen, et al. Geodesic self-attention for 3d point clouds.Advances in Neural Information Processing Systems, 35:6190–6203, 2022. 2

  29. [29]

    Learning of 3d graph convolution networks for point cloud analysis.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(8):4212–4224, 2021

    Zhi-Hao Lin, Sheng-Yu Huang, and Yu-Chiang Frank Wang. Learning of 3d graph convolution networks for point cloud analysis.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(8):4212–4224, 2021. 1

  30. [30]

    Application of hyperbolic space attention mechanisms in 3d point cloud classification

    Lamei Liu and Zhiqiang Liu. Application of hyperbolic space attention mechanisms in 3d point cloud classification. In2024 6th International Conference on Communications, Information System and Computer Engineering, pages 658–

  31. [31]

    Chuanjiang Long, Zhen Dong, Keqiang Wang, Feng Yu, Chaobin He, and Zhong-Ren Chen. Molecular weight distribution shape approach for simultaneously enhancing the stiffness, ductility and strength of isotropic semicrys- talline polymers based on linear unimodal and bimodal polyethylenes.Polymer, 275:125936, 2023. 4

  32. [32]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

  33. [33]

    Complete 3d relationships extraction modality align- ment network for 3d dense captioning.IEEE Transactions on Visualization and Computer Graphics, 30(8):4867–4880,

    Aihua Mao, Zhi Yang, Wanxin Chen, Ran Yi, and Yong-jin Liu. Complete 3d relationships extraction modality align- ment network for 3d dense captioning.IEEE Transactions on Visualization and Computer Graphics, 30(8):4867–4880,

  34. [34]

    An end-to- end transformer model for 3d object detection

    Ishan Misra, Rohit Girdhar, and Armand Joulin. An end-to- end transformer model for 3d object detection. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 2906–2917, 2021. 2, 4, 5

  35. [35]

    Re- thinking the compositionality of point clouds through regu- larization in the hyperbolic space.Advances in Neural Infor- mation Processing Systems, 35:33741–33753, 2022

    Antonio Montanaro, Diego Valsesia, and Enrico Magli. Re- thinking the compositionality of point clouds through regu- larization in the hyperbolic space.Advances in Neural Infor- mation Processing Systems, 35:33741–33753, 2022. 2

  36. [36]

    Efficient non- maximum suppression

    Alexander Neubeck and Luc Van Gool. Efficient non- maximum suppression. In18th international conference on pattern recognition, pages 850–855. IEEE, 2006. 6

  37. [37]

    Poincar ´e embeddings for learning hierarchical representations.Advances in neural information processing systems, 30, 2017

    Maximillian Nickel and Douwe Kiela. Poincar ´e embeddings for learning hierarchical representations.Advances in neural information processing systems, 30, 2017. 2

  38. [38]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    A Paszke. Pytorch: An imperative style, high-performance deep learning library.arXiv preprint arXiv:1912.01703,

  39. [39]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017. 2, 4

  40. [40]

    Transductive few-shot classification on the oblique manifold

    Guodong Qi, Huimin Yu, Zhaohui Lu, and Shuzhao Li. Transductive few-shot classification on the oblique manifold. InProceedings of the IEEE/CVF international conference on computer vision, pages 8412–8422, 2021. 1, 2

  41. [41]

    Accept the modality gap: An exploration in the hyperbolic space

    Sameera Ramasinghe, Violetta Shevchenko, Gil Avraham, and Ajanthan Thalaiyasingam. Accept the modality gap: An exploration in the hyperbolic space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27263–27272, 2024. 3

  42. [42]

    Self-critical sequence training for image captioning

    Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7008–7024,

  43. [43]

    Hyperbolic neural networks++

    Ryohei Shimizu, YUSUKE Mukuta, and Tatsuya Harada. Hyperbolic neural networks++. InInternational Conference on Learning Representations, 2021. 3

  44. [44]

    Cider: Consensus-based image description evalua- tion

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 6

  45. [46]

    Spatiality-guided transformer for 3d dense captioning on point clouds.arXiv preprint arXiv:2204.10688, 2022

    Heng Wang, Chaoyi Zhang, Jianhui Yu, and Weidong Cai. Spatiality-guided transformer for 3d dense captioning on point clouds.arXiv preprint arXiv:2204.10688, 2022. 1, 2

  46. [47]

    Hypliloc: Towards effective lidar pose regression with hyperbolic fusion

    Sijie Wang, Qiyu Kang, Rui She, Wei Wang, Kai Zhao, Yang Song, and Wee Peng Tay. Hypliloc: Towards effective lidar pose regression with hyperbolic fusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5176–5185, 2023. 2

  47. [48]

    Hecpg: Hyperbolic embedding and confident patch- guided network for point cloud matching.IEEE Transactions on Geoscience and Remote Sensing, 62:1–12, 2024

    Yifan Xie, Jihua Zhu, Shiqi Li, Naiwen Hu, and Pengcheng Shi. Hecpg: Hyperbolic embedding and confident patch- guided network for point cloud matching.IEEE Transactions on Geoscience and Remote Sensing, 62:1–12, 2024. 2

  48. [49]

    Hypformer: Exploring effi- cient transformer fully in hyperbolic space

    Menglin Yang, Harshit Verma, Delvin Ce Zhang, Jiahong Liu, Irwin King, and Rex Ying. Hypformer: Exploring effi- cient transformer fully in hyperbolic space. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Dis- covery and Data Mining, pages 3770–3781, 2024. 2

  49. [50]

    3d question answering.IEEE Transactions on Visualization and Computer Graphics, 30(3):1772–1786, 2022

    Shuquan Ye, Dongdong Chen, Songfang Han, and Jing Liao. 3d question answering.IEEE Transactions on Visualization and Computer Graphics, 30(3):1772–1786, 2022. 2

  50. [51]

    X-trans2cap: Cross- modal knowledge transfer using transformer for 3d dense captioning

    Zhihao Yuan, Xu Yan, Yinghong Liao, Yao Guo, Guan- bin Li, Shuguang Cui, and Zhen Li. X-trans2cap: Cross- modal knowledge transfer using transformer for 3d dense captioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8563– 8573, 2022. 2, 6

  51. [52]

    Hype-han: Hyperbolic hierarchical attention network for semantic embedding

    Chengkun Zhang and Junbin Gao. Hype-han: Hyperbolic hierarchical attention network for semantic embedding. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 3990–3996, 2021. 1

  52. [53]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022. 5

  53. [54]

    Assembler: Scalable 3d part assembly via anchor point diffusion.arXiv preprint arXiv:2506.17074, 2025

    Wang Zhao, Yan-Pei Cao, Jiale Xu, Yuejiang Dong, and Ying Shan. Assembler: Scalable 3d part assembly via anchor point diffusion.arXiv preprint arXiv:2506.17074, 2025. 1

  54. [55]

    Contextual modeling for 3d dense captioning on point clouds.arXiv preprint arXiv:2210.03925, 2022

    Yufeng Zhong, Long Xu, Jiebo Luo, and Lin Ma. Contextual modeling for 3d dense captioning on point clouds.arXiv preprint arXiv:2210.03925, 2022. 6

  55. [56]

    Deformable DETR: Deformable Transformers for End-to-End Object Detection

    Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable trans- formers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020. 5