pith. sign in

arxiv: 2505.17674 · v2 · pith:56KQOOUOnew · submitted 2025-05-23 · 💻 cs.CV

SVL: Spike-based Vision-language Pretraining for Efficient 3D Open-world Understanding

Pith reviewed 2026-05-22 02:07 UTC · model grok-4.3

classification 💻 cs.CV
keywords spiking neural networksvision-language pretraining3D open-world understandingzero-shot classificationmultimodal contrastive learningenergy-efficient inference3D scene understanding
0
0 comments X

The pith

A spike-based vision-language pretraining framework enables spiking neural networks to match or exceed artificial networks in zero-shot 3D classification and open-world tasks while preserving energy efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SVL, a pretraining approach that aligns 3D point clouds, images, and text through label-free contrastive learning to give spiking neural networks multimodal capabilities. This setup targets the performance gap where spiking models have lagged behind conventional networks in generalization and complex 3D understanding. If the alignment holds, spiking networks could handle zero-shot 3D classification at 85.4 percent top-1 accuracy and support downstream tasks such as detection, segmentation, and question answering without heavy computational overhead. The work emphasizes hardware-friendly inference through re-parameterization that avoids running large text encoders at test time. A sympathetic reader would see this as a route to energy-efficient 3D perception systems that operate on neuromorphic hardware.

Core claim

SVL combines Multi-scale Triple Alignment for contrastive learning across 3D, image, and text modalities with Re-parameterizable Vision-Language Integration to produce a lightweight spiking model. The resulting network reaches 85.4 percent top-1 accuracy on zero-shot 3D classification, surpassing several advanced artificial networks, and delivers consistent gains over prior spiking models on 3D classification, DVS action recognition, 3D detection, and 3D segmentation. The same pretraining also supports open-world 3D question answering, sometimes exceeding artificial-network baselines while retaining spike-driven efficiency.

What carries the argument

Multi-scale Triple Alignment (MTA), a label-free triplet contrastive objective that aligns features from 3D, image, and text modalities at multiple scales to build cross-modal representations.

If this is right

  • Spiking models gain 6.1 percent on 3D classification tasks compared with earlier spiking networks.
  • Performance rises 2.1 percent on DVS action recognition and 1.1 percent on 3D detection.
  • 3D segmentation improves by 2.1 percent while inference stays spike-driven and low-power.
  • The framework supports open-world 3D question answering without task-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The re-parameterization step could allow the same pretrained weights to run on resource-constrained edge devices that lack floating-point units.
  • Extending the triplet alignment to include temporal sequences might improve handling of dynamic 3D scenes such as video-based navigation.
  • If the alignment proves robust, similar contrastive pretraining could be applied to other spiking modalities like audio or tactile sensing.

Load-bearing premise

The contrastive alignment learned from the chosen 3D-image-text triplets will transfer to new open-world 3D tasks without domain-specific biases or undisclosed tuning.

What would settle it

A test on a 3D dataset drawn from a different distribution, such as indoor scenes after training on outdoor data, showing accuracy falling below prior spiking baselines would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2505.17674 by Bo Xu, Guoqi Li, Peixi Wu, Shaowei Gu, Xinhao Luo, Xuerui Qiu, Yaozhi Wen, Yuqi Pan.

Figure 1
Figure 1. Figure 1: Overall architecture and applications of our SVL. (a) In pretraining, we proposed Multi [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Dialogues between SVL-13B and a human user. The dialogues show SVL’s ability to [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture details of open-world multimodel learning. [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
read the original abstract

Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing SNNs still exhibit a significant performance gap compared to Artificial Neural Networks (ANNs) due to inadequate pre-training strategies. These limitations manifest as restricted generalization ability, task specificity, and a lack of multimodal understanding, particularly in challenging tasks such as multimodal question answering and zero-shot 3D classification. To overcome these challenges, we propose a Spike-based Vision-Language (SVL) pretraining framework that empowers SNNs with open-world 3D understanding while maintaining spike-driven efficiency. SVL introduces two key components: (i) Multi-scale Triple Alignment (MTA) for label-free triplet-based contrastive learning across 3D, image, and text modalities, and (ii) Re-parameterizable Vision-Language Integration (Rep-VLI) to enable lightweight inference without relying on large text encoders. Extensive experiments show that SVL achieves a top-1 accuracy of 85.4% in zero-shot 3D classification, surpassing advanced ANN models, and consistently outperforms prior SNNs on downstream tasks, including 3D classification (+6.1%), DVS action recognition (+2.1%), 3D detection (+1.1%), and 3D segmentation (+2.1%) with remarkable efficiency. Moreover, SVL enables SNNs to perform open-world 3D question answering, sometimes outperforming ANNs. To the best of our knowledge, SVL represents the first scalable, generalizable, and hardware-friendly paradigm for 3D open-world understanding, effectively bridging the gap between SNNs and ANNs in complex open-world understanding tasks. Code is available https://github.com/bollossom/SVL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SVL, a spike-based vision-language pretraining framework for SNNs targeting efficient 3D open-world understanding. It introduces Multi-scale Triple Alignment (MTA) for label-free triplet contrastive learning across 3D, image, and text modalities and Re-parameterizable Vision-Language Integration (Rep-VLI) to support lightweight inference without large text encoders. Reported results include 85.4% top-1 accuracy on zero-shot 3D classification (surpassing advanced ANN models), plus gains on downstream tasks (3D classification +6.1%, DVS action recognition +2.1%, 3D detection +1.1%, 3D segmentation +2.1%), with additional capability for open-world 3D question answering. Code is released.

Significance. If the performance numbers are robust and the efficiency gains are realized through the SNN pipeline and Rep-VLI without reliance on unreplaced large encoders at test time, the work could meaningfully advance energy-efficient multimodal 3D models and reduce the SNN-ANN gap in open-world settings. Code availability aids reproducibility.

major comments (2)
  1. [3.2] §3.2: The zero-shot 3D classification protocol yielding 85.4% top-1 accuracy does not explicitly confirm that only the re-parameterized Rep-VLI branch is active at inference while the large text encoder is disabled. This detail is load-bearing for the efficiency, hardware-friendly, and superiority-over-ANN claims, as the numerical result cannot otherwise be attributed to the proposed SNN+SVL pipeline rather than a standard CLIP-style encoder.
  2. [4] §4 (experimental results): The reported percentage improvements on downstream tasks lack accompanying details on run count, variance, or statistical tests. Without these, the cross-task outperformance claims over prior SNNs rest on single-point comparisons whose robustness cannot be assessed.
minor comments (2)
  1. [Abstract] The abstract and §1 refer to 'remarkable efficiency' without providing concrete metrics (e.g., energy per inference or latency) relative to the ANN and SNN baselines.
  2. [3.1] Notation for the multi-scale alignment weights should be introduced once and used consistently; the current description leaves their exact functional form ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [3.2] §3.2: The zero-shot 3D classification protocol yielding 85.4% top-1 accuracy does not explicitly confirm that only the re-parameterized Rep-VLI branch is active at inference while the large text encoder is disabled. This detail is load-bearing for the efficiency, hardware-friendly, and superiority-over-ANN claims, as the numerical result cannot otherwise be attributed to the proposed SNN+SVL pipeline rather than a standard CLIP-style encoder.

    Authors: We appreciate the referee for highlighting this important clarification. The Rep-VLI component is specifically introduced to re-parameterize the vision-language fusion learned during pretraining, so that the large text encoder can be removed entirely at inference. The zero-shot 3D classification protocol uses only the SNN with the re-parameterized Rep-VLI branch; the text encoder participates solely in the MTA pretraining stage. To make this explicit and remove any ambiguity, we will add a dedicated paragraph in Section 3.2 describing the inference pipeline and insert a confirming statement in the experimental setup section. These changes will appear in the revised manuscript. revision: yes

  2. Referee: [4] §4 (experimental results): The reported percentage improvements on downstream tasks lack accompanying details on run count, variance, or statistical tests. Without these, the cross-task outperformance claims over prior SNNs rest on single-point comparisons whose robustness cannot be assessed.

    Authors: We thank the referee for this observation. The current manuscript reports the percentage gains from primary experimental runs without variance or run-count information. We agree that providing these details would better substantiate the robustness of the improvements. In the revised version we will rerun the key downstream experiments with three independent random seeds, report mean and standard deviation for each metric, and add a brief note on the consistency of the observed gains across runs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework validated on external benchmarks

full rationale

The paper introduces SVL as a pretraining method with MTA contrastive alignment and Rep-VLI reparameterization, then reports measured accuracies on standard datasets (e.g., 85.4% zero-shot 3D classification, +6.1% on 3D classification). These are presented as experimental outcomes, not as mathematical predictions or first-principles derivations that reduce to the inputs by construction. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method. The work is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work to force its central claims.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the transferability of contrastive learning objectives to spiking networks and the validity of re-parameterization for removing text encoders at test time; these are treated as standard techniques rather than newly proven.

free parameters (1)
  • multi-scale alignment weights
    Scaling factors across 3D, image, and text modalities are introduced to balance the triplet loss but their specific values are not detailed in the abstract.
axioms (1)
  • domain assumption Spiking neural networks can be effectively optimized with contrastive objectives originally developed for artificial neural networks
    Invoked when applying label-free triplet alignment to SNNs without additional theoretical justification in the abstract.

pith-pipeline@v0.9.0 · 5884 in / 1215 out tokens · 62104 ms · 2026-05-22T02:07:55.234416+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 4 internal anchors

  1. [1]

    Towards spike-based machine intelligence with neuromorphic computing

    Kaushik Roy, Akhilesh Jaiswal, Priyadarshini Panda, and ruijie zhu. Towards spike-based machine intelligence with neuromorphic computing. Nature, 575(7784):607–617, 2019

  2. [2]

    Towards artificial general intelligence with hybrid tianjic chip architecture

    Jing Pei, Lei Deng, Sen Song, Mingguo Zhao, Youhui Zhang, Shuang Wu, Guanrui Wang, Zhe Zou, Zhenzhi Wu, Wei He, et al. Towards artificial general intelligence with hybrid tianjic chip architecture. Nature, 572(7767):106–111, 2019

  3. [3]

    Networks of spiking neurons: the third generation of neural network models

    Wolfgang Maass. Networks of spiking neurons: the third generation of neural network models. Neural networks, 10(9):1659–1671, 1997

  4. [4]

    Spike-based dynamic computing with asynchronous sensing-computing neuromorphic chip

    Man Yao, Ole Richter, Guangshe Zhao, Ning Qiao, Yannan Xing, Dingheng Wang, Tianxiang Hu, Wei Fang, Tugba Demirci, Michele De Marchi, et al. Spike-based dynamic computing with asynchronous sensing-computing neuromorphic chip. Nature Communications, 15(1):4464, 2024

  5. [5]

    Efficient 3d recognition with event-driven spike sparse convolution

    Xuerui Qiu, Man Yao, Jieyuan Zhang, Yuhong Chou, Ning Qiao, Shibo Zhou, Bo Xu, and Guoqi Li. Efficient 3d recognition with event-driven spike sparse convolution. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 39, pages 20086–20094, 2025

  6. [6]

    Scaling spike-driven transformer with efficient spike firing approximation training

    Man Yao, Xuerui Qiu, Tianxiang Hu, Jiakui Hu, Yuhong Chou, Keyu Tian, Jianxing Liao, Luziwei Leng, Bo Xu, and Guoqi Li. Scaling spike-driven transformer with efficient spike firing approximation training. IEEE Transactions on Pattern Analysis and Machine Intelligence, (01):1–18, 2025

  7. [7]

    Spikformer v2: Join the high accuracy club on imagenet with an snn ticket

    Zhaokun Zhou, Kaiwei Che, Wei Fang, Keyu Tian, Yuesheng Zhu, Shuicheng Yan, Yonghong Tian, and Li Yuan. Spikformer v2: Join the high accuracy club on imagenet with an snn ticket. arXiv preprint arXiv:2401.02020, 2024

  8. [8]

    Training deep spiking convolutional neural networks with stdp-based unsupervised pre-training followed by supervised fine-tuning

    Chankyu Lee, Priyadarshini Panda, Gopalakrishnan Srinivasan, and Kaushik Roy. Training deep spiking convolutional neural networks with stdp-based unsupervised pre-training followed by supervised fine-tuning. Frontiers in Neuroscience, 12:435, 2018

  9. [9]

    Spikebert: A language spikformer trained with two-stage knowl- edge distillation from bert

    Changze Lv, Tianlong Li, Jianhan Xu, Chenxi Gu, Zixuan Ling, Cenyuan Zhang, Xiaoqing Zheng, and Xuanjing Huang. Spikebert: A language spikformer trained with two-stage knowl- edge distillation from bert. arXiv preprint arXiv:2308.15122, 2023

  10. [10]

    Spikingbert: Distilling bert to train spiking language models using implicit differentiation

    Malyaban Bal and Abhronil Sengupta. Spikingbert: Distilling bert to train spiking language models using implicit differentiation. In Proceedings of the AAAI conference on artificial intelligence (AAAI), volume 38, pages 10998–11006, 2024

  11. [11]

    Spikeclip: A contrastive language-image pretrained spiking neural network

    Changze Lv, Tianlong Li, Wenhao Liu, Yufei Gu, Jianhan Xu, Cenyuan Zhang, Muling Wu, Xiaoqing Zheng, and Xuanjing Huang. Spikeclip: A contrastive language-image pretrained spiking neural network. Neural Networks, page 107475, 2025

  12. [12]

    Pointllm: Empowering large language models to understand point clouds

    Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. In European Conference on Computer Vision (ECCV), pages 131–147, 2024

  13. [13]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, and et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021

  14. [14]

    Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding

    Le Xue, Mingfei Gao, Chen Xing, Roberto Mart’in-Mart’in, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1179–1189, 2023. 10

  15. [15]

    Ulip-2: Towards scalable multimodal pre-training for 3d understanding

    Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Junnan Li, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27091–27101, 2024

  16. [16]

    Openshape: Scaling up 3d shape representation towards open-world understanding

    Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xuanlin Li, Shizhong Han, Hong Cai, Fatih Porikli, and Hao Su. Openshape: Scaling up 3d shape representation towards open-world understanding. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pages 44860–44879, 2023

  17. [17]

    A synaptic model of memory: long-term potentiation in the hippocampus

    Tim VP Bliss and Graham L Collingridge. A synaptic model of memory: long-term potentiation in the hippocampus. Nature, 361(6407):31–39, 1993

  18. [18]

    Spikformer: When spiking neural network meets transformer

    Zhaokun Zhou, Yuesheng Zhu, Chao He, Yaowei Wang, Shuicheng YAN, Yonghong Tian, and Li Yuan. Spikformer: When spiking neural network meets transformer. In The Eleventh International Conference on Learning Representations (ICLR), 2023

  19. [19]

    Pointclip: Point cloud understanding by clip

    Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Jiao Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8542–8552, 2021

  20. [20]

    Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning

    Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. IEEE/CVF International Conference on Computer Vision (ICCV), pages 2639–2650, 2022

  21. [21]

    Clip2: Contrastive language-image-point pretraining from real-world point cloud data

    Yi Zeng, Chenhan Jiang, Jiageng Mao, Jianhua Han, Chao Ye, Qingqiu Huang, Dit-Yan Yeung, Zhen Yang, Xiaodan Liang, and Hang Xu. Clip2: Contrastive language-image-point pretraining from real-world point cloud data. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15244–15253, 2023

  22. [22]

    Pointnet: Deep learning on point sets for 3d classification and segmentation

    Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 652–660, 2017

  23. [23]

    Space-time event clouds for ges- ture recognition: From rgb cameras to event cameras

    Qinyi Wang, Yexin Zhang, Junsong Yuan, and Yilong Lu. Space-time event clouds for ges- ture recognition: From rgb cameras to event cameras. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1826–1835. IEEE, 2019

  24. [24]

    3d shapenets: A deep representation for volumetric shapes

    Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1912–1920, 2015

  25. [25]

    Spiking pointnet: Spiking neural networks for point clouds

    Dayong Ren, Zhe Ma, Yuanpei Chen, Weihang Peng, Xiaode Liu, Yuhan Zhang, and Yufei Guo. Spiking pointnet: Spiking neural networks for point clouds. Advances in Neural Information Processing Systems (NeurIPS), 36:41797–41808, 2024

  26. [26]

    Point-to-spike residual learning for energy-efficient 3d point cloud classification

    Qiaoyun Wu, Quanxiao Zhang, Chunyu Tan, Yun Zhou, and Changyin Sun. Point-to-spike residual learning for energy-efficient 3d point cloud classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 38, pages 6092–6099, 2024

  27. [27]

    Spikingpoint: Rethink- ing point as spike for efficient 3d point cloud analysis

    Zhaokun Zhou, Yijie Lu, Jiaqiyu Zhan, Guibo Luo, and Yuesheng Zhu. Spikingpoint: Rethink- ing point as spike for efficient 3d point cloud analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

  28. [28]

    Brain inspired computing: A systematic survey and future trends

    Guoqi Li, Lei Deng, Huajing Tang, Gang Pan, Yonghong Tian, Kaushik Roy, and Wolfgang Maass. Brain inspired computing: A systematic survey and future trends. Authorea Preprints, 2023

  29. [29]

    Spatio-temporal backpropagation for training high-performance spiking neural networks

    Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, and Luping Shi. Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in Neuroscience, 12:331, 2018. 11

  30. [30]

    Integer-valued training and spike-driven inference spiking neural network for high-performance and energy-efficient object detection

    Xinhao Luo, Man Yao, Yuhong Chou, Bo Xu, and Guoqi Li. Integer-valued training and spike-driven inference spiking neural network for high-performance and energy-efficient object detection. arXiv preprint arXiv:2407.20708, 2024

  31. [31]

    Quantized spike-driven transformer

    Xuerui Qiu, Jieyuan Zhang, Wenjie Wei, Honglin Cao, Junsheng Guo, Rui-Jie Zhu, Yimeng Shan, Yang Yang, Malu Zhang, and Haizhou Li. Quantized spike-driven transformer. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

  32. [32]

    Deep residual learning for im- age recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

  33. [33]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  34. [34]

    Attention is all you need

    A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 2017

  35. [35]

    Representation Learning with Contrastive Predictive Coding

    Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748, 2018

  36. [36]

    Towards open world active learning for 3d object detection

    Zhuoxiao Chen, Yadan Luo, Zixin Wang, Zijian Wang, Xin Yu, and Zi Huang. Towards open world active learning for 3d object detection. arXiv preprint arXiv:2310.10391, 2023

  37. [37]

    Ego- lifter: Open-world 3d segmentation for egocentric perception

    Qiao Gu, Zhaoyang Lv, Duncan Frost, Simon Green, Julian Straub, and Chris Sweeney. Ego- lifter: Open-world 3d segmentation for egocentric perception. In European Conference on Computer Vision (ECCV), pages 382–400. Springer, 2025

  38. [38]

    Openclip, 2021

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, 2021

  39. [39]

    Point-bert: Pre-training 3d point cloud transformers with masked point modeling

    Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (ICCV), pages 19313–19322, 2022

  40. [40]

    Submanifold Sparse Convolutional Networks

    Benjamin Graham, Laurens Van der Maaten, Zhu Ruijie, and Li Guoqi. Submanifold sparse convolutional networks. arXiv preprint arXiv:1706.01307, 2017

  41. [41]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, 2023

  42. [42]

    Attention spiking neural networks

    Man Yao, Guangshe Zhao, Hengyu Zhang, Yifan Hu, Lei Deng, Yonghong Tian, Bo Xu, and Guoqi Li. Attention spiking neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):9393–9410, 2023

  43. [43]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023

  44. [44]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

    Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023

  45. [45]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pages 34892–34916, 2023. 12

  46. [46]

    Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data

    Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 1588–1597, 2019

  47. [47]

    Semantickitti: A dataset for semantic scene understanding of lidar sequences

    Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9297–9307, 2019

  48. [49]

    A low power, fully event-based gesture recognition system

    Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jeffrey McKinstry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, et al. A low power, fully event-based gesture recognition system. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7243–7252, 2017

  49. [50]

    Neuromorphic vision datasets for pedestrian detection, action recognition, and fall detection

    Shu Miao, Guang Chen, Xiangyu Ning, Yang Zi, Kejia Ren, Zhenshan Bing, and Alois Knoll. Neuromorphic vision datasets for pedestrian detection, action recognition, and fall detection. Frontiers in Neurorobotics, 13:38, 2019

  50. [51]

    Point transformer

    Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16259–16268, 2021

  51. [52]

    Efficient converted spiking neural network for 3d and 2d classification

    Yuxiang Lan, Yachao Zhang, Xu Ma, Yanyun Qu, and Yun Fu. Efficient converted spiking neural network for 3d and 2d classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9211–9220, 2023

  52. [53]

    A free lunch from ann: Towards efficient, accurate spiking neural networks calibration

    Yuhang Li, Shikuang Deng, Xin Dong, Ruihao Gong, and Shi Gu. A free lunch from ann: Towards efficient, accurate spiking neural networks calibration. In International conference on machine learning (ICML), pages 6316–6325. PMLR, 2021

  53. [54]

    Gated attention coding for training high-performance and efficient spiking neural networks

    Xuerui Qiu, Rui-Jie Zhu, Yuhong Chou, Zhaorui Wang, Liang-jian Deng, and Guoqi Li. Gated attention coding for training high-performance and efficient spiking neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , volume 38, pages 601–610, 2024

  54. [55]

    Online training through time for spiking neural networks

    Mingqing Xiao, Qingyan Meng, Zongpeng Zhang, Di He, and Zhouchen Lin. Online training through time for spiking neural networks. Advances in Neural Information Processing Systems (NeurIPS), 35:20717–20730, 2022

  55. [56]

    High-performance temporal reversible spiking neural networks with o(l) training memory and o(1) inference cost

    JiaKui Hu, Man Yao, Xuerui Qiu, Yuhong Chou, Yuxuan Cai, Ning Qiao, Yonghong Tian, Bo Xu, and Guoqi Li. High-performance temporal reversible spiking neural networks with o(l) training memory and o(1) inference cost. arXiv preprint arXiv:2405.16466, 2024

  56. [57]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (ICCV), pages 5828–5839, 2017

  57. [58]

    3d-llm: Injecting the 3d world into large language models

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pages 44860–44879, 2023

  58. [59]

    Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

    Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615, 2023. 13

  59. [60]

    Direct training for spiking neural networks: Faster, larger, better

    Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, Yuan Xie, and Luping Shi. Direct training for spiking neural networks: Faster, larger, better. In Proceedings of the AAAI conference on artificial intelligence (AAAI), volume 33, pages 1311–1318, 2019

  60. [61]

    1.1 computing’s energy problem (and what we can do about it)

    Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10–14. IEEE, 2014

  61. [62]

    Diet-snn: A low-latency spiking neural network with direct input encoding and leakage and threshold optimization

    Nitin Rathi and Kaushik Roy. Diet-snn: A low-latency spiking neural network with direct input encoding and leakage and threshold optimization. IEEE Transactions on Neural Networks and Learning Systems, 34(6):3174–3182, 2021

  62. [63]

    Vtsnn: a virtual temporal spiking neural network

    Xue-Rui Qiu, Zhao-Rui Wang, Zheng Luan, Rui-Jie Zhu, Xiao Wu, Ma-Lu Zhang, and Liang- Jian Deng. Vtsnn: a virtual temporal spiking neural network. Frontiers in Neuroscience, 17:1091097, 2023

  63. [64]

    Are we ready for autonomous driving? the kitti vision benchmark suite

    Andreas Geiger, Philip Lenz, Raquel Urtasun, and ruijie zhu. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354–3361, 2012

  64. [65]

    V oxel r-cnn: Towards high performance voxel-based 3d object detection

    Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. V oxel r-cnn: Towards high performance voxel-based 3d object detection. InProceedings of the AAAI conference on artificial intelligence (AAAI), volume 35, pages 1201–1209, 2021

  65. [66]

    Openpcdet: An open-source toolbox for 3d object detection from point clouds, 2020

    OpenPCDet Development Team. Openpcdet: An open-source toolbox for 3d object detection from point clouds, 2020

  66. [67]

    Pointcept: A codebase for point cloud perception research, 2023

    Pointcept Contributors. Pointcept: A codebase for point cloud perception research, 2023

  67. [68]

    4d spatio-temporal convnets: Minkowski convolutional neural networks

    Christopher Choy, JunYoung Gwak, Silvio Savarese, and zhu ruijie. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Rattern Recognition (CVPR), pages 3075–3084, 2019

  68. [69]

    Masked scene contrast: A scalable framework for unsupervised 3d representation learning

    Xiaoyang Wu, Xin Wen, Xihui Liu, and Hengshuang Zhao. Masked scene contrast: A scalable framework for unsupervised 3d representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Rattern Recognition (CVPR), pages 9415–9424, 2023. 14 Appendix A Backpropagation process of I-LIF There exist two primary methods of training high-pe...