SVL: Spike-based Vision-language Pretraining for Efficient 3D Open-world Understanding

Bo Xu; Guoqi Li; Peixi Wu; Shaowei Gu; Xinhao Luo; Xuerui Qiu; Yaozhi Wen; Yuqi Pan

arxiv: 2505.17674 · v2 · pith:56KQOOUOnew · submitted 2025-05-23 · 💻 cs.CV

SVL: Spike-based Vision-language Pretraining for Efficient 3D Open-world Understanding

Xuerui Qiu , Peixi Wu , Yaozhi Wen , Shaowei Gu , Yuqi Pan , Xinhao Luo , Bo XU , Guoqi Li This is my paper

Pith reviewed 2026-05-22 02:07 UTC · model grok-4.3

classification 💻 cs.CV

keywords spiking neural networksvision-language pretraining3D open-world understandingzero-shot classificationmultimodal contrastive learningenergy-efficient inference3D scene understanding

0 comments

The pith

A spike-based vision-language pretraining framework enables spiking neural networks to match or exceed artificial networks in zero-shot 3D classification and open-world tasks while preserving energy efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SVL, a pretraining approach that aligns 3D point clouds, images, and text through label-free contrastive learning to give spiking neural networks multimodal capabilities. This setup targets the performance gap where spiking models have lagged behind conventional networks in generalization and complex 3D understanding. If the alignment holds, spiking networks could handle zero-shot 3D classification at 85.4 percent top-1 accuracy and support downstream tasks such as detection, segmentation, and question answering without heavy computational overhead. The work emphasizes hardware-friendly inference through re-parameterization that avoids running large text encoders at test time. A sympathetic reader would see this as a route to energy-efficient 3D perception systems that operate on neuromorphic hardware.

Core claim

SVL combines Multi-scale Triple Alignment for contrastive learning across 3D, image, and text modalities with Re-parameterizable Vision-Language Integration to produce a lightweight spiking model. The resulting network reaches 85.4 percent top-1 accuracy on zero-shot 3D classification, surpassing several advanced artificial networks, and delivers consistent gains over prior spiking models on 3D classification, DVS action recognition, 3D detection, and 3D segmentation. The same pretraining also supports open-world 3D question answering, sometimes exceeding artificial-network baselines while retaining spike-driven efficiency.

What carries the argument

Multi-scale Triple Alignment (MTA), a label-free triplet contrastive objective that aligns features from 3D, image, and text modalities at multiple scales to build cross-modal representations.

If this is right

Spiking models gain 6.1 percent on 3D classification tasks compared with earlier spiking networks.
Performance rises 2.1 percent on DVS action recognition and 1.1 percent on 3D detection.
3D segmentation improves by 2.1 percent while inference stays spike-driven and low-power.
The framework supports open-world 3D question answering without task-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The re-parameterization step could allow the same pretrained weights to run on resource-constrained edge devices that lack floating-point units.
Extending the triplet alignment to include temporal sequences might improve handling of dynamic 3D scenes such as video-based navigation.
If the alignment proves robust, similar contrastive pretraining could be applied to other spiking modalities like audio or tactile sensing.

Load-bearing premise

The contrastive alignment learned from the chosen 3D-image-text triplets will transfer to new open-world 3D tasks without domain-specific biases or undisclosed tuning.

What would settle it

A test on a 3D dataset drawn from a different distribution, such as indoor scenes after training on outdoor data, showing accuracy falling below prior spiking baselines would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2505.17674 by Bo Xu, Guoqi Li, Peixi Wu, Shaowei Gu, Xinhao Luo, Xuerui Qiu, Yaozhi Wen, Yuqi Pan.

**Figure 2.** Figure 2: Dialogues between SVL-13B and a human user. The dialogues show SVL’s ability to [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture details of open-world multimodel learning. [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

read the original abstract

Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing SNNs still exhibit a significant performance gap compared to Artificial Neural Networks (ANNs) due to inadequate pre-training strategies. These limitations manifest as restricted generalization ability, task specificity, and a lack of multimodal understanding, particularly in challenging tasks such as multimodal question answering and zero-shot 3D classification. To overcome these challenges, we propose a Spike-based Vision-Language (SVL) pretraining framework that empowers SNNs with open-world 3D understanding while maintaining spike-driven efficiency. SVL introduces two key components: (i) Multi-scale Triple Alignment (MTA) for label-free triplet-based contrastive learning across 3D, image, and text modalities, and (ii) Re-parameterizable Vision-Language Integration (Rep-VLI) to enable lightweight inference without relying on large text encoders. Extensive experiments show that SVL achieves a top-1 accuracy of 85.4% in zero-shot 3D classification, surpassing advanced ANN models, and consistently outperforms prior SNNs on downstream tasks, including 3D classification (+6.1%), DVS action recognition (+2.1%), 3D detection (+1.1%), and 3D segmentation (+2.1%) with remarkable efficiency. Moreover, SVL enables SNNs to perform open-world 3D question answering, sometimes outperforming ANNs. To the best of our knowledge, SVL represents the first scalable, generalizable, and hardware-friendly paradigm for 3D open-world understanding, effectively bridging the gap between SNNs and ANNs in complex open-world understanding tasks. Code is available https://github.com/bollossom/SVL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SVL gives SNNs a workable pretraining route for 3D multimodal tasks via triple alignment and re-parameterization, but the headline zero-shot accuracy needs explicit confirmation that the large text encoder is actually dropped at test time.

read the letter

SVL's main move is to pretrain spiking networks with label-free contrastive alignment across 3D, image, and text at multiple scales, then add a re-parameterizable module so the text side can be folded in without keeping a heavy encoder around at inference. That combination is the concrete new piece relative to prior SNN work on 3D or single-modality tasks. The experiments report consistent lifts over earlier spiking baselines on classification, detection, segmentation, and even some open-world QA, plus the expected energy advantage. Code release helps anyone who wants to test the pipeline directly. Those are the parts that hold up from the abstract and the described setup. The soft spot sits in the zero-shot protocol. The 85.4% top-1 figure is presented as beating some ANN models, yet the efficiency story rests on Rep-VLI actually replacing the large text encoder during evaluation. If the reported number still uses the full encoder for query embeddings, the hardware-friendly claim does not follow from the architecture. I would want to see the exact inference path spelled out and an ablation that isolates the re-parameterized branch. The other gains are smaller and look plausible, but without more detail on baselines, seeds, or statistical tests they remain hard to weigh precisely. This paper is for groups working on neuromorphic hardware or low-power 3D perception who already know the SNN literature. A reader looking for concrete ways to bring vision-language ideas into spiking models will find usable pieces even if the numbers need checking. It has enough experimental scope and a clear technical direction to deserve a serious referee, mainly to pin down the evaluation details and confirm the re-parameterization actually delivers the claimed savings.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SVL, a spike-based vision-language pretraining framework for SNNs targeting efficient 3D open-world understanding. It introduces Multi-scale Triple Alignment (MTA) for label-free triplet contrastive learning across 3D, image, and text modalities and Re-parameterizable Vision-Language Integration (Rep-VLI) to support lightweight inference without large text encoders. Reported results include 85.4% top-1 accuracy on zero-shot 3D classification (surpassing advanced ANN models), plus gains on downstream tasks (3D classification +6.1%, DVS action recognition +2.1%, 3D detection +1.1%, 3D segmentation +2.1%), with additional capability for open-world 3D question answering. Code is released.

Significance. If the performance numbers are robust and the efficiency gains are realized through the SNN pipeline and Rep-VLI without reliance on unreplaced large encoders at test time, the work could meaningfully advance energy-efficient multimodal 3D models and reduce the SNN-ANN gap in open-world settings. Code availability aids reproducibility.

major comments (2)

[3.2] §3.2: The zero-shot 3D classification protocol yielding 85.4% top-1 accuracy does not explicitly confirm that only the re-parameterized Rep-VLI branch is active at inference while the large text encoder is disabled. This detail is load-bearing for the efficiency, hardware-friendly, and superiority-over-ANN claims, as the numerical result cannot otherwise be attributed to the proposed SNN+SVL pipeline rather than a standard CLIP-style encoder.
[4] §4 (experimental results): The reported percentage improvements on downstream tasks lack accompanying details on run count, variance, or statistical tests. Without these, the cross-task outperformance claims over prior SNNs rest on single-point comparisons whose robustness cannot be assessed.

minor comments (2)

[Abstract] The abstract and §1 refer to 'remarkable efficiency' without providing concrete metrics (e.g., energy per inference or latency) relative to the ANN and SNN baselines.
[3.1] Notation for the multi-scale alignment weights should be introduced once and used consistently; the current description leaves their exact functional form ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [3.2] §3.2: The zero-shot 3D classification protocol yielding 85.4% top-1 accuracy does not explicitly confirm that only the re-parameterized Rep-VLI branch is active at inference while the large text encoder is disabled. This detail is load-bearing for the efficiency, hardware-friendly, and superiority-over-ANN claims, as the numerical result cannot otherwise be attributed to the proposed SNN+SVL pipeline rather than a standard CLIP-style encoder.

Authors: We appreciate the referee for highlighting this important clarification. The Rep-VLI component is specifically introduced to re-parameterize the vision-language fusion learned during pretraining, so that the large text encoder can be removed entirely at inference. The zero-shot 3D classification protocol uses only the SNN with the re-parameterized Rep-VLI branch; the text encoder participates solely in the MTA pretraining stage. To make this explicit and remove any ambiguity, we will add a dedicated paragraph in Section 3.2 describing the inference pipeline and insert a confirming statement in the experimental setup section. These changes will appear in the revised manuscript. revision: yes
Referee: [4] §4 (experimental results): The reported percentage improvements on downstream tasks lack accompanying details on run count, variance, or statistical tests. Without these, the cross-task outperformance claims over prior SNNs rest on single-point comparisons whose robustness cannot be assessed.

Authors: We thank the referee for this observation. The current manuscript reports the percentage gains from primary experimental runs without variance or run-count information. We agree that providing these details would better substantiate the robustness of the improvements. In the revised version we will rerun the key downstream experiments with three independent random seeds, report mean and standard deviation for each metric, and add a brief note on the consistency of the observed gains across runs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework validated on external benchmarks

full rationale

The paper introduces SVL as a pretraining method with MTA contrastive alignment and Rep-VLI reparameterization, then reports measured accuracies on standard datasets (e.g., 85.4% zero-shot 3D classification, +6.1% on 3D classification). These are presented as experimental outcomes, not as mathematical predictions or first-principles derivations that reduce to the inputs by construction. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method. The work is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work to force its central claims.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the transferability of contrastive learning objectives to spiking networks and the validity of re-parameterization for removing text encoders at test time; these are treated as standard techniques rather than newly proven.

free parameters (1)

multi-scale alignment weights
Scaling factors across 3D, image, and text modalities are introduced to balance the triplet loss but their specific values are not detailed in the abstract.

axioms (1)

domain assumption Spiking neural networks can be effectively optimized with contrastive objectives originally developed for artificial neural networks
Invoked when applying label-free triplet alignment to SNNs without additional theoretical justification in the abstract.

pith-pipeline@v0.9.0 · 5884 in / 1215 out tokens · 62104 ms · 2026-05-22T02:07:55.234416+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Multi-scale Triple Alignment (MTA) for label-free triplet-based contrastive learning across 3D, image, and text modalities... LNCE(S,T) + LNCE(S,I) + LMSE(S,I)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Re-parameterizable Vision-Language Integration (Rep-VLI)... W_L_i = e^τ E_T_θ(T^t_i)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 4 internal anchors

[1]

Towards spike-based machine intelligence with neuromorphic computing

Kaushik Roy, Akhilesh Jaiswal, Priyadarshini Panda, and ruijie zhu. Towards spike-based machine intelligence with neuromorphic computing. Nature, 575(7784):607–617, 2019

work page 2019
[2]

Towards artificial general intelligence with hybrid tianjic chip architecture

Jing Pei, Lei Deng, Sen Song, Mingguo Zhao, Youhui Zhang, Shuang Wu, Guanrui Wang, Zhe Zou, Zhenzhi Wu, Wei He, et al. Towards artificial general intelligence with hybrid tianjic chip architecture. Nature, 572(7767):106–111, 2019

work page 2019
[3]

Networks of spiking neurons: the third generation of neural network models

Wolfgang Maass. Networks of spiking neurons: the third generation of neural network models. Neural networks, 10(9):1659–1671, 1997

work page 1997
[4]

Spike-based dynamic computing with asynchronous sensing-computing neuromorphic chip

Man Yao, Ole Richter, Guangshe Zhao, Ning Qiao, Yannan Xing, Dingheng Wang, Tianxiang Hu, Wei Fang, Tugba Demirci, Michele De Marchi, et al. Spike-based dynamic computing with asynchronous sensing-computing neuromorphic chip. Nature Communications, 15(1):4464, 2024

work page 2024
[5]

Efficient 3d recognition with event-driven spike sparse convolution

Xuerui Qiu, Man Yao, Jieyuan Zhang, Yuhong Chou, Ning Qiao, Shibo Zhou, Bo Xu, and Guoqi Li. Efficient 3d recognition with event-driven spike sparse convolution. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 39, pages 20086–20094, 2025

work page 2025
[6]

Scaling spike-driven transformer with efficient spike firing approximation training

Man Yao, Xuerui Qiu, Tianxiang Hu, Jiakui Hu, Yuhong Chou, Keyu Tian, Jianxing Liao, Luziwei Leng, Bo Xu, and Guoqi Li. Scaling spike-driven transformer with efficient spike firing approximation training. IEEE Transactions on Pattern Analysis and Machine Intelligence, (01):1–18, 2025

work page 2025
[7]

Spikformer v2: Join the high accuracy club on imagenet with an snn ticket

Zhaokun Zhou, Kaiwei Che, Wei Fang, Keyu Tian, Yuesheng Zhu, Shuicheng Yan, Yonghong Tian, and Li Yuan. Spikformer v2: Join the high accuracy club on imagenet with an snn ticket. arXiv preprint arXiv:2401.02020, 2024

work page arXiv 2024
[8]

Training deep spiking convolutional neural networks with stdp-based unsupervised pre-training followed by supervised fine-tuning

Chankyu Lee, Priyadarshini Panda, Gopalakrishnan Srinivasan, and Kaushik Roy. Training deep spiking convolutional neural networks with stdp-based unsupervised pre-training followed by supervised fine-tuning. Frontiers in Neuroscience, 12:435, 2018

work page 2018
[9]

Spikebert: A language spikformer trained with two-stage knowl- edge distillation from bert

Changze Lv, Tianlong Li, Jianhan Xu, Chenxi Gu, Zixuan Ling, Cenyuan Zhang, Xiaoqing Zheng, and Xuanjing Huang. Spikebert: A language spikformer trained with two-stage knowl- edge distillation from bert. arXiv preprint arXiv:2308.15122, 2023

work page arXiv 2023
[10]

Spikingbert: Distilling bert to train spiking language models using implicit differentiation

Malyaban Bal and Abhronil Sengupta. Spikingbert: Distilling bert to train spiking language models using implicit differentiation. In Proceedings of the AAAI conference on artificial intelligence (AAAI), volume 38, pages 10998–11006, 2024

work page 2024
[11]

Spikeclip: A contrastive language-image pretrained spiking neural network

Changze Lv, Tianlong Li, Wenhao Liu, Yufei Gu, Jianhan Xu, Cenyuan Zhang, Muling Wu, Xiaoqing Zheng, and Xuanjing Huang. Spikeclip: A contrastive language-image pretrained spiking neural network. Neural Networks, page 107475, 2025

work page 2025
[12]

Pointllm: Empowering large language models toÂ understand point clouds

Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empowering large language models toÂ understand point clouds. In European Conference on Computer Vision (ECCV), pages 131–147, 2024

work page 2024
[13]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, and et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021

work page 2021
[14]

Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding

Le Xue, Mingfei Gao, Chen Xing, Roberto Mart’in-Mart’in, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1179–1189, 2023. 10

work page 2023
[15]

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Junnan Li, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27091–27101, 2024

work page 2024
[16]

Openshape: Scaling up 3d shape representation towards open-world understanding

Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xuanlin Li, Shizhong Han, Hong Cai, Fatih Porikli, and Hao Su. Openshape: Scaling up 3d shape representation towards open-world understanding. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pages 44860–44879, 2023

work page 2023
[17]

A synaptic model of memory: long-term potentiation in the hippocampus

Tim VP Bliss and Graham L Collingridge. A synaptic model of memory: long-term potentiation in the hippocampus. Nature, 361(6407):31–39, 1993

work page 1993
[18]

Spikformer: When spiking neural network meets transformer

Zhaokun Zhou, Yuesheng Zhu, Chao He, Yaowei Wang, Shuicheng YAN, Yonghong Tian, and Li Yuan. Spikformer: When spiking neural network meets transformer. In The Eleventh International Conference on Learning Representations (ICLR), 2023

work page 2023
[19]

Pointclip: Point cloud understanding by clip

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Jiao Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8542–8552, 2021

work page 2021
[20]

Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. IEEE/CVF International Conference on Computer Vision (ICCV), pages 2639–2650, 2022

work page 2022
[21]

Clip2: Contrastive language-image-point pretraining from real-world point cloud data

Yi Zeng, Chenhan Jiang, Jiageng Mao, Jianhua Han, Chao Ye, Qingqiu Huang, Dit-Yan Yeung, Zhen Yang, Xiaodan Liang, and Hang Xu. Clip2: Contrastive language-image-point pretraining from real-world point cloud data. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15244–15253, 2023

work page 2023
[22]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 652–660, 2017

work page 2017
[23]

Space-time event clouds for ges- ture recognition: From rgb cameras to event cameras

Qinyi Wang, Yexin Zhang, Junsong Yuan, and Yilong Lu. Space-time event clouds for ges- ture recognition: From rgb cameras to event cameras. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1826–1835. IEEE, 2019

work page 2019
[24]

3d shapenets: A deep representation for volumetric shapes

Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1912–1920, 2015

work page 1912
[25]

Spiking pointnet: Spiking neural networks for point clouds

Dayong Ren, Zhe Ma, Yuanpei Chen, Weihang Peng, Xiaode Liu, Yuhan Zhang, and Yufei Guo. Spiking pointnet: Spiking neural networks for point clouds. Advances in Neural Information Processing Systems (NeurIPS), 36:41797–41808, 2024

work page 2024
[26]

Point-to-spike residual learning for energy-efficient 3d point cloud classification

Qiaoyun Wu, Quanxiao Zhang, Chunyu Tan, Yun Zhou, and Changyin Sun. Point-to-spike residual learning for energy-efficient 3d point cloud classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 38, pages 6092–6099, 2024

work page 2024
[27]

Spikingpoint: Rethink- ing point as spike for efficient 3d point cloud analysis

Zhaokun Zhou, Yijie Lu, Jiaqiyu Zhan, Guibo Luo, and Yuesheng Zhu. Spikingpoint: Rethink- ing point as spike for efficient 3d point cloud analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025
[28]

Brain inspired computing: A systematic survey and future trends

Guoqi Li, Lei Deng, Huajing Tang, Gang Pan, Yonghong Tian, Kaushik Roy, and Wolfgang Maass. Brain inspired computing: A systematic survey and future trends. Authorea Preprints, 2023

work page 2023
[29]

Spatio-temporal backpropagation for training high-performance spiking neural networks

Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, and Luping Shi. Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in Neuroscience, 12:331, 2018. 11

work page 2018
[30]

Integer-valued training and spike-driven inference spiking neural network for high-performance and energy-efficient object detection

Xinhao Luo, Man Yao, Yuhong Chou, Bo Xu, and Guoqi Li. Integer-valued training and spike-driven inference spiking neural network for high-performance and energy-efficient object detection. arXiv preprint arXiv:2407.20708, 2024

work page arXiv 2024
[31]

Quantized spike-driven transformer

Xuerui Qiu, Jieyuan Zhang, Wenjie Wei, Honglin Cao, Junsheng Guo, Rui-Jie Zhu, Yimeng Shan, Yang Yang, Malu Zhang, and Haizhou Li. Quantized spike-driven transformer. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025
[32]

Deep residual learning for im- age recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

work page 2016
[33]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[34]

Attention is all you need

A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[35]

Representation Learning with Contrastive Predictive Coding

Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

Towards open world active learning for 3d object detection

Zhuoxiao Chen, Yadan Luo, Zixin Wang, Zijian Wang, Xin Yu, and Zi Huang. Towards open world active learning for 3d object detection. arXiv preprint arXiv:2310.10391, 2023

work page arXiv 2023
[37]

Ego- lifter: Open-world 3d segmentation for egocentric perception

Qiao Gu, Zhaoyang Lv, Duncan Frost, Simon Green, Julian Straub, and Chris Sweeney. Ego- lifter: Open-world 3d segmentation for egocentric perception. In European Conference on Computer Vision (ECCV), pages 382–400. Springer, 2025

work page 2025
[38]

Openclip, 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, 2021

work page 2021
[39]

Point-bert: Pre-training 3d point cloud transformers with masked point modeling

Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (ICCV), pages 19313–19322, 2022

work page 2022
[40]

Submanifold Sparse Convolutional Networks

Benjamin Graham, Laurens Van der Maaten, Zhu Ruijie, and Li Guoqi. Submanifold sparse convolutional networks. arXiv preprint arXiv:1706.01307, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, 2023

work page 2023
[42]

Attention spiking neural networks

Man Yao, Guangshe Zhao, Hengyu Zhang, Yifan Hu, Lei Deng, Yonghong Tian, Bo Xu, and Guoqi Li. Attention spiking neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):9393–9410, 2023

work page 2023
[43]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023

work page 2023
[45]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pages 34892–34916, 2023. 12

work page 2023
[46]

Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data

Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 1588–1597, 2019

work page 2019
[47]

Semantickitti: A dataset for semantic scene understanding of lidar sequences

Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9297–9307, 2019

work page 2019
[49]

A low power, fully event-based gesture recognition system

Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jeffrey McKinstry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, et al. A low power, fully event-based gesture recognition system. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7243–7252, 2017

work page 2017
[50]

Neuromorphic vision datasets for pedestrian detection, action recognition, and fall detection

Shu Miao, Guang Chen, Xiangyu Ning, Yang Zi, Kejia Ren, Zhenshan Bing, and Alois Knoll. Neuromorphic vision datasets for pedestrian detection, action recognition, and fall detection. Frontiers in Neurorobotics, 13:38, 2019

work page 2019
[51]

Point transformer

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16259–16268, 2021

work page 2021
[52]

Efficient converted spiking neural network for 3d and 2d classification

Yuxiang Lan, Yachao Zhang, Xu Ma, Yanyun Qu, and Yun Fu. Efficient converted spiking neural network for 3d and 2d classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9211–9220, 2023

work page 2023
[53]

A free lunch from ann: Towards efficient, accurate spiking neural networks calibration

Yuhang Li, Shikuang Deng, Xin Dong, Ruihao Gong, and Shi Gu. A free lunch from ann: Towards efficient, accurate spiking neural networks calibration. In International conference on machine learning (ICML), pages 6316–6325. PMLR, 2021

work page 2021
[54]

Gated attention coding for training high-performance and efficient spiking neural networks

Xuerui Qiu, Rui-Jie Zhu, Yuhong Chou, Zhaorui Wang, Liang-jian Deng, and Guoqi Li. Gated attention coding for training high-performance and efficient spiking neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , volume 38, pages 601–610, 2024

work page 2024
[55]

Online training through time for spiking neural networks

Mingqing Xiao, Qingyan Meng, Zongpeng Zhang, Di He, and Zhouchen Lin. Online training through time for spiking neural networks. Advances in Neural Information Processing Systems (NeurIPS), 35:20717–20730, 2022

work page 2022
[56]

High-performance temporal reversible spiking neural networks with o(l) training memory and o(1) inference cost

JiaKui Hu, Man Yao, Xuerui Qiu, Yuhong Chou, Yuxuan Cai, Ning Qiao, Yonghong Tian, Bo Xu, and Guoqi Li. High-performance temporal reversible spiking neural networks with o(l) training memory and o(1) inference cost. arXiv preprint arXiv:2405.16466, 2024

work page arXiv 2024
[57]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (ICCV), pages 5828–5839, 2017

work page 2017
[58]

3d-llm: Injecting the 3d world into large language models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pages 44860–44879, 2023

work page 2023
[59]

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615, 2023. 13

work page arXiv 2023
[60]

Direct training for spiking neural networks: Faster, larger, better

Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, Yuan Xie, and Luping Shi. Direct training for spiking neural networks: Faster, larger, better. In Proceedings of the AAAI conference on artificial intelligence (AAAI), volume 33, pages 1311–1318, 2019

work page 2019
[61]

1.1 computing’s energy problem (and what we can do about it)

Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10–14. IEEE, 2014

work page 2014
[62]

Diet-snn: A low-latency spiking neural network with direct input encoding and leakage and threshold optimization

Nitin Rathi and Kaushik Roy. Diet-snn: A low-latency spiking neural network with direct input encoding and leakage and threshold optimization. IEEE Transactions on Neural Networks and Learning Systems, 34(6):3174–3182, 2021

work page 2021
[63]

Vtsnn: a virtual temporal spiking neural network

Xue-Rui Qiu, Zhao-Rui Wang, Zheng Luan, Rui-Jie Zhu, Xiao Wu, Ma-Lu Zhang, and Liang- Jian Deng. Vtsnn: a virtual temporal spiking neural network. Frontiers in Neuroscience, 17:1091097, 2023

work page 2023
[64]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, Raquel Urtasun, and ruijie zhu. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354–3361, 2012

work page 2012
[65]

V oxel r-cnn: Towards high performance voxel-based 3d object detection

Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. V oxel r-cnn: Towards high performance voxel-based 3d object detection. InProceedings of the AAAI conference on artificial intelligence (AAAI), volume 35, pages 1201–1209, 2021

work page 2021
[66]

Openpcdet: An open-source toolbox for 3d object detection from point clouds, 2020

OpenPCDet Development Team. Openpcdet: An open-source toolbox for 3d object detection from point clouds, 2020

work page 2020
[67]

Pointcept: A codebase for point cloud perception research, 2023

Pointcept Contributors. Pointcept: A codebase for point cloud perception research, 2023

work page 2023
[68]

4d spatio-temporal convnets: Minkowski convolutional neural networks

Christopher Choy, JunYoung Gwak, Silvio Savarese, and zhu ruijie. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Rattern Recognition (CVPR), pages 3075–3084, 2019

work page 2019
[69]

Masked scene contrast: A scalable framework for unsupervised 3d representation learning

Xiaoyang Wu, Xin Wen, Xihui Liu, and Hengshuang Zhao. Masked scene contrast: A scalable framework for unsupervised 3d representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Rattern Recognition (CVPR), pages 9415–9424, 2023. 14 Appendix A Backpropagation process of I-LIF There exist two primary methods of training high-pe...

work page 2023

[1] [1]

Towards spike-based machine intelligence with neuromorphic computing

Kaushik Roy, Akhilesh Jaiswal, Priyadarshini Panda, and ruijie zhu. Towards spike-based machine intelligence with neuromorphic computing. Nature, 575(7784):607–617, 2019

work page 2019

[2] [2]

Towards artificial general intelligence with hybrid tianjic chip architecture

Jing Pei, Lei Deng, Sen Song, Mingguo Zhao, Youhui Zhang, Shuang Wu, Guanrui Wang, Zhe Zou, Zhenzhi Wu, Wei He, et al. Towards artificial general intelligence with hybrid tianjic chip architecture. Nature, 572(7767):106–111, 2019

work page 2019

[3] [3]

Networks of spiking neurons: the third generation of neural network models

Wolfgang Maass. Networks of spiking neurons: the third generation of neural network models. Neural networks, 10(9):1659–1671, 1997

work page 1997

[4] [4]

Spike-based dynamic computing with asynchronous sensing-computing neuromorphic chip

Man Yao, Ole Richter, Guangshe Zhao, Ning Qiao, Yannan Xing, Dingheng Wang, Tianxiang Hu, Wei Fang, Tugba Demirci, Michele De Marchi, et al. Spike-based dynamic computing with asynchronous sensing-computing neuromorphic chip. Nature Communications, 15(1):4464, 2024

work page 2024

[5] [5]

Efficient 3d recognition with event-driven spike sparse convolution

Xuerui Qiu, Man Yao, Jieyuan Zhang, Yuhong Chou, Ning Qiao, Shibo Zhou, Bo Xu, and Guoqi Li. Efficient 3d recognition with event-driven spike sparse convolution. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 39, pages 20086–20094, 2025

work page 2025

[6] [6]

Scaling spike-driven transformer with efficient spike firing approximation training

Man Yao, Xuerui Qiu, Tianxiang Hu, Jiakui Hu, Yuhong Chou, Keyu Tian, Jianxing Liao, Luziwei Leng, Bo Xu, and Guoqi Li. Scaling spike-driven transformer with efficient spike firing approximation training. IEEE Transactions on Pattern Analysis and Machine Intelligence, (01):1–18, 2025

work page 2025

[7] [7]

Spikformer v2: Join the high accuracy club on imagenet with an snn ticket

Zhaokun Zhou, Kaiwei Che, Wei Fang, Keyu Tian, Yuesheng Zhu, Shuicheng Yan, Yonghong Tian, and Li Yuan. Spikformer v2: Join the high accuracy club on imagenet with an snn ticket. arXiv preprint arXiv:2401.02020, 2024

work page arXiv 2024

[8] [8]

Training deep spiking convolutional neural networks with stdp-based unsupervised pre-training followed by supervised fine-tuning

Chankyu Lee, Priyadarshini Panda, Gopalakrishnan Srinivasan, and Kaushik Roy. Training deep spiking convolutional neural networks with stdp-based unsupervised pre-training followed by supervised fine-tuning. Frontiers in Neuroscience, 12:435, 2018

work page 2018

[9] [9]

Spikebert: A language spikformer trained with two-stage knowl- edge distillation from bert

Changze Lv, Tianlong Li, Jianhan Xu, Chenxi Gu, Zixuan Ling, Cenyuan Zhang, Xiaoqing Zheng, and Xuanjing Huang. Spikebert: A language spikformer trained with two-stage knowl- edge distillation from bert. arXiv preprint arXiv:2308.15122, 2023

work page arXiv 2023

[10] [10]

Spikingbert: Distilling bert to train spiking language models using implicit differentiation

Malyaban Bal and Abhronil Sengupta. Spikingbert: Distilling bert to train spiking language models using implicit differentiation. In Proceedings of the AAAI conference on artificial intelligence (AAAI), volume 38, pages 10998–11006, 2024

work page 2024

[11] [11]

Spikeclip: A contrastive language-image pretrained spiking neural network

Changze Lv, Tianlong Li, Wenhao Liu, Yufei Gu, Jianhan Xu, Cenyuan Zhang, Muling Wu, Xiaoqing Zheng, and Xuanjing Huang. Spikeclip: A contrastive language-image pretrained spiking neural network. Neural Networks, page 107475, 2025

work page 2025

[12] [12]

Pointllm: Empowering large language models toÂ understand point clouds

Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empowering large language models toÂ understand point clouds. In European Conference on Computer Vision (ECCV), pages 131–147, 2024

work page 2024

[13] [13]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, and et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021

work page 2021

[14] [14]

Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding

Le Xue, Mingfei Gao, Chen Xing, Roberto Mart’in-Mart’in, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1179–1189, 2023. 10

work page 2023

[15] [15]

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Junnan Li, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27091–27101, 2024

work page 2024

[16] [16]

Openshape: Scaling up 3d shape representation towards open-world understanding

Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xuanlin Li, Shizhong Han, Hong Cai, Fatih Porikli, and Hao Su. Openshape: Scaling up 3d shape representation towards open-world understanding. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pages 44860–44879, 2023

work page 2023

[17] [17]

A synaptic model of memory: long-term potentiation in the hippocampus

Tim VP Bliss and Graham L Collingridge. A synaptic model of memory: long-term potentiation in the hippocampus. Nature, 361(6407):31–39, 1993

work page 1993

[18] [18]

Spikformer: When spiking neural network meets transformer

Zhaokun Zhou, Yuesheng Zhu, Chao He, Yaowei Wang, Shuicheng YAN, Yonghong Tian, and Li Yuan. Spikformer: When spiking neural network meets transformer. In The Eleventh International Conference on Learning Representations (ICLR), 2023

work page 2023

[19] [19]

Pointclip: Point cloud understanding by clip

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Jiao Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8542–8552, 2021

work page 2021

[20] [20]

Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. IEEE/CVF International Conference on Computer Vision (ICCV), pages 2639–2650, 2022

work page 2022

[21] [21]

Clip2: Contrastive language-image-point pretraining from real-world point cloud data

Yi Zeng, Chenhan Jiang, Jiageng Mao, Jianhua Han, Chao Ye, Qingqiu Huang, Dit-Yan Yeung, Zhen Yang, Xiaodan Liang, and Hang Xu. Clip2: Contrastive language-image-point pretraining from real-world point cloud data. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15244–15253, 2023

work page 2023

[22] [22]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 652–660, 2017

work page 2017

[23] [23]

Space-time event clouds for ges- ture recognition: From rgb cameras to event cameras

Qinyi Wang, Yexin Zhang, Junsong Yuan, and Yilong Lu. Space-time event clouds for ges- ture recognition: From rgb cameras to event cameras. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1826–1835. IEEE, 2019

work page 2019

[24] [24]

3d shapenets: A deep representation for volumetric shapes

Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1912–1920, 2015

work page 1912

[25] [25]

Spiking pointnet: Spiking neural networks for point clouds

Dayong Ren, Zhe Ma, Yuanpei Chen, Weihang Peng, Xiaode Liu, Yuhan Zhang, and Yufei Guo. Spiking pointnet: Spiking neural networks for point clouds. Advances in Neural Information Processing Systems (NeurIPS), 36:41797–41808, 2024

work page 2024

[26] [26]

Point-to-spike residual learning for energy-efficient 3d point cloud classification

Qiaoyun Wu, Quanxiao Zhang, Chunyu Tan, Yun Zhou, and Changyin Sun. Point-to-spike residual learning for energy-efficient 3d point cloud classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 38, pages 6092–6099, 2024

work page 2024

[27] [27]

Spikingpoint: Rethink- ing point as spike for efficient 3d point cloud analysis

Zhaokun Zhou, Yijie Lu, Jiaqiyu Zhan, Guibo Luo, and Yuesheng Zhu. Spikingpoint: Rethink- ing point as spike for efficient 3d point cloud analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025

[28] [28]

Brain inspired computing: A systematic survey and future trends

Guoqi Li, Lei Deng, Huajing Tang, Gang Pan, Yonghong Tian, Kaushik Roy, and Wolfgang Maass. Brain inspired computing: A systematic survey and future trends. Authorea Preprints, 2023

work page 2023

[29] [29]

Spatio-temporal backpropagation for training high-performance spiking neural networks

Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, and Luping Shi. Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in Neuroscience, 12:331, 2018. 11

work page 2018

[30] [30]

Integer-valued training and spike-driven inference spiking neural network for high-performance and energy-efficient object detection

Xinhao Luo, Man Yao, Yuhong Chou, Bo Xu, and Guoqi Li. Integer-valued training and spike-driven inference spiking neural network for high-performance and energy-efficient object detection. arXiv preprint arXiv:2407.20708, 2024

work page arXiv 2024

[31] [31]

Quantized spike-driven transformer

Xuerui Qiu, Jieyuan Zhang, Wenjie Wei, Honglin Cao, Junsheng Guo, Rui-Jie Zhu, Yimeng Shan, Yang Yang, Malu Zhang, and Haizhou Li. Quantized spike-driven transformer. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025

[32] [32]

Deep residual learning for im- age recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

work page 2016

[33] [33]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[34] [34]

Attention is all you need

A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017

[35] [35]

Representation Learning with Contrastive Predictive Coding

Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [36]

Towards open world active learning for 3d object detection

Zhuoxiao Chen, Yadan Luo, Zixin Wang, Zijian Wang, Xin Yu, and Zi Huang. Towards open world active learning for 3d object detection. arXiv preprint arXiv:2310.10391, 2023

work page arXiv 2023

[37] [37]

Ego- lifter: Open-world 3d segmentation for egocentric perception

Qiao Gu, Zhaoyang Lv, Duncan Frost, Simon Green, Julian Straub, and Chris Sweeney. Ego- lifter: Open-world 3d segmentation for egocentric perception. In European Conference on Computer Vision (ECCV), pages 382–400. Springer, 2025

work page 2025

[38] [38]

Openclip, 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, 2021

work page 2021

[39] [39]

Point-bert: Pre-training 3d point cloud transformers with masked point modeling

Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (ICCV), pages 19313–19322, 2022

work page 2022

[40] [40]

Submanifold Sparse Convolutional Networks

Benjamin Graham, Laurens Van der Maaten, Zhu Ruijie, and Li Guoqi. Submanifold sparse convolutional networks. arXiv preprint arXiv:1706.01307, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, 2023

work page 2023

[42] [42]

Attention spiking neural networks

Man Yao, Guangshe Zhao, Hengyu Zhang, Yifan Hu, Lei Deng, Yonghong Tian, Bo Xu, and Guoqi Li. Attention spiking neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):9393–9410, 2023

work page 2023

[43] [43]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023

work page 2023

[45] [45]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pages 34892–34916, 2023. 12

work page 2023

[46] [46]

Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data

Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 1588–1597, 2019

work page 2019

[47] [47]

Semantickitti: A dataset for semantic scene understanding of lidar sequences

Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9297–9307, 2019

work page 2019

[48] [49]

A low power, fully event-based gesture recognition system

Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jeffrey McKinstry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, et al. A low power, fully event-based gesture recognition system. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7243–7252, 2017

work page 2017

[49] [50]

Neuromorphic vision datasets for pedestrian detection, action recognition, and fall detection

Shu Miao, Guang Chen, Xiangyu Ning, Yang Zi, Kejia Ren, Zhenshan Bing, and Alois Knoll. Neuromorphic vision datasets for pedestrian detection, action recognition, and fall detection. Frontiers in Neurorobotics, 13:38, 2019

work page 2019

[50] [51]

Point transformer

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16259–16268, 2021

work page 2021

[51] [52]

Efficient converted spiking neural network for 3d and 2d classification

Yuxiang Lan, Yachao Zhang, Xu Ma, Yanyun Qu, and Yun Fu. Efficient converted spiking neural network for 3d and 2d classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9211–9220, 2023

work page 2023

[52] [53]

A free lunch from ann: Towards efficient, accurate spiking neural networks calibration

Yuhang Li, Shikuang Deng, Xin Dong, Ruihao Gong, and Shi Gu. A free lunch from ann: Towards efficient, accurate spiking neural networks calibration. In International conference on machine learning (ICML), pages 6316–6325. PMLR, 2021

work page 2021

[53] [54]

Gated attention coding for training high-performance and efficient spiking neural networks

Xuerui Qiu, Rui-Jie Zhu, Yuhong Chou, Zhaorui Wang, Liang-jian Deng, and Guoqi Li. Gated attention coding for training high-performance and efficient spiking neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , volume 38, pages 601–610, 2024

work page 2024

[54] [55]

Online training through time for spiking neural networks

Mingqing Xiao, Qingyan Meng, Zongpeng Zhang, Di He, and Zhouchen Lin. Online training through time for spiking neural networks. Advances in Neural Information Processing Systems (NeurIPS), 35:20717–20730, 2022

work page 2022

[55] [56]

High-performance temporal reversible spiking neural networks with o(l) training memory and o(1) inference cost

JiaKui Hu, Man Yao, Xuerui Qiu, Yuhong Chou, Yuxuan Cai, Ning Qiao, Yonghong Tian, Bo Xu, and Guoqi Li. High-performance temporal reversible spiking neural networks with o(l) training memory and o(1) inference cost. arXiv preprint arXiv:2405.16466, 2024

work page arXiv 2024

[56] [57]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (ICCV), pages 5828–5839, 2017

work page 2017

[57] [58]

3d-llm: Injecting the 3d world into large language models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pages 44860–44879, 2023

work page 2023

[58] [59]

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615, 2023. 13

work page arXiv 2023

[59] [60]

Direct training for spiking neural networks: Faster, larger, better

Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, Yuan Xie, and Luping Shi. Direct training for spiking neural networks: Faster, larger, better. In Proceedings of the AAAI conference on artificial intelligence (AAAI), volume 33, pages 1311–1318, 2019

work page 2019

[60] [61]

1.1 computing’s energy problem (and what we can do about it)

Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10–14. IEEE, 2014

work page 2014

[61] [62]

Diet-snn: A low-latency spiking neural network with direct input encoding and leakage and threshold optimization

Nitin Rathi and Kaushik Roy. Diet-snn: A low-latency spiking neural network with direct input encoding and leakage and threshold optimization. IEEE Transactions on Neural Networks and Learning Systems, 34(6):3174–3182, 2021

work page 2021

[62] [63]

Vtsnn: a virtual temporal spiking neural network

Xue-Rui Qiu, Zhao-Rui Wang, Zheng Luan, Rui-Jie Zhu, Xiao Wu, Ma-Lu Zhang, and Liang- Jian Deng. Vtsnn: a virtual temporal spiking neural network. Frontiers in Neuroscience, 17:1091097, 2023

work page 2023

[63] [64]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, Raquel Urtasun, and ruijie zhu. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354–3361, 2012

work page 2012

[64] [65]

V oxel r-cnn: Towards high performance voxel-based 3d object detection

Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. V oxel r-cnn: Towards high performance voxel-based 3d object detection. InProceedings of the AAAI conference on artificial intelligence (AAAI), volume 35, pages 1201–1209, 2021

work page 2021

[65] [66]

Openpcdet: An open-source toolbox for 3d object detection from point clouds, 2020

OpenPCDet Development Team. Openpcdet: An open-source toolbox for 3d object detection from point clouds, 2020

work page 2020

[66] [67]

Pointcept: A codebase for point cloud perception research, 2023

Pointcept Contributors. Pointcept: A codebase for point cloud perception research, 2023

work page 2023

[67] [68]

4d spatio-temporal convnets: Minkowski convolutional neural networks

Christopher Choy, JunYoung Gwak, Silvio Savarese, and zhu ruijie. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Rattern Recognition (CVPR), pages 3075–3084, 2019

work page 2019

[68] [69]

Masked scene contrast: A scalable framework for unsupervised 3d representation learning

Xiaoyang Wu, Xin Wen, Xihui Liu, and Hengshuang Zhao. Masked scene contrast: A scalable framework for unsupervised 3d representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Rattern Recognition (CVPR), pages 9415–9424, 2023. 14 Appendix A Backpropagation process of I-LIF There exist two primary methods of training high-pe...

work page 2023