pith. machine review for the scientific record. sign in

arxiv: 2605.11558 · v1 · submitted 2026-05-12 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

A Composite Activation Function for Learning Stable Binary Representations

Choeun Kim, Insung Kong, Kwanho Lee, Sehyun Park, Seokhun Park, Yongdai Kim

Pith reviewed 2026-05-13 01:59 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords activation functionHeaviside approximationbinary neural networksspiking neural networksgradient-based optimizationinterpretable modelsdiscrete representations
0
0 comments X

The pith

HTAF, a sigmoid-tanh composite, lets networks with binary Heaviside activations train stably via standard gradient descent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Heavy Tailed Activation Function as a smooth stand-in for the non-differentiable Heaviside step. It builds the function by composing a sigmoid with a hyperbolic tangent so that gradients stay large near zero inputs and decay slowly in the tails. This property supports reliable convergence when training spiking networks, binary networks, and deep Heaviside networks without special surrogate gradients. The same activation also produces discrete internal features that support an interpretable Implicit Concept Bottleneck Model for image classification. Experiments across standard architectures and datasets show discretization occurs stably while accuracy matches or exceeds conventional continuous models.

Core claim

HTAF is constructed as a sigmoid hyperbolic tangent composite function and theoretically maintains a large gradient mass around zero inputs while exhibiting slower gradient decay in the tail regions. Spiking Neural Networks, Binary Neural Networks and Deep Heaviside neural Networks can therefore be trained stably using HTAF with gradient-based optimization. The function further induces discrete feature representations inside Implicit Concept Bottleneck Models, which achieve prediction performance comparable to or better than standard models on image datasets.

What carries the argument

Heavy Tailed Activation Function (HTAF), a smooth composite of sigmoid and hyperbolic tangent that approximates the Heaviside step while preserving strong gradients near the origin and slow tail decay.

If this is right

  • Spiking neural networks become trainable by ordinary back-propagation without surrogate-gradient approximations.
  • Binary neural networks reach stable solutions at full precision during training before any quantization step.
  • Deep networks built entirely from Heaviside units can be optimized end-to-end with the same optimizer used for ReLU networks.
  • Implicit Concept Bottleneck Models produce discrete, human-interpretable features directly from the activation while preserving task performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same composite construction might be adapted to other non-differentiable threshold functions used in optimization or control.
  • Because gradients remain usable far from zero, HTAF could reduce the need for gradient clipping or learning-rate schedules in binary architectures.
  • Discrete internal states induced by HTAF may improve robustness to input perturbations that affect continuous activations.

Load-bearing premise

The favorable theoretical gradient profile of the composite function will actually produce stable convergence to useful binary representations during end-to-end gradient training without extra hyperparameter search or post-processing.

What would settle it

Train a binary neural network on a standard image benchmark using HTAF and measure whether the loss decreases steadily to a competitive accuracy level while the activations remain close to binary values; failure to converge or collapse to trivial solutions would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.11558 by Choeun Kim, Insung Kong, Kwanho Lee, Sehyun Park, Seokhun Park, Yongdai Kim.

Figure 1
Figure 1. Figure 1: Outputs and gradients of HTAF. (a) Output curves of HTAF with fixed α0 = 1.5 and varying α1 ∈ {6.0, 12.0, 18.0, 24.0}. (b) Ratios of gradients relative to the baseline case α0 = α1 = 1.5, obtained by varying α0 with α1 fixed at 1.5 and varying α1 with α0 fixed at 1.5. Blue curves correspond to varying α1, while purple curves correspond to varying α0. (c) Ratio of derivatives between HTAF and Scaled Sigmoid… view at source ↗
Figure 2
Figure 2. Figure 2: Plots of the lower bound of α0 and α1 times the lower bound of α0 as functions of α1 for ϵ = 0.1. Proposition 3.1 presents the explicit sufficient con￾ditions on the parameters α0 and α1 under which HTAF uniformly approximates the Heaviside func￾tion outside an arbitrarily small neighborhood of the origin. In particular, for any prescribed accuracy ϵ ∈ (0, 1/2), choosing α0 and α1 according to the stated b… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Implicit Concept Bottleneck Models. are zero everywhere. Existing approaches therefore often rely on surrogate gradients, which introduce a mismatch between the forward and backward passes, or on smooth activation-based approximations that remain susceptible to gradient vanishing. The first application of HTAF is surrogate-free training of neural networks with the Heaviside activation function,… view at source ↗
Figure 4
Figure 4. Figure 4: shows the histograms of these proportions for all methods on the ‘Deer’ class. Our method exhibits strong concentration near 0 and 1, whereas the baseline does not. This indicates that the implicit concepts learned by ICBM are activated in a class-specific manner. Additional results for other classes are in Appendix G.3 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual analysis of embedding space. (a) The selected 20 images, showing activation value one (top) and zero (bottom) at the 135th node. (b) an image of Bewick Wren which is misclassified. Prompt Task: Concept-based Visual Explanation Group A: [Names of classes whose constituent images have an activation value of 1 at the node] Group B: [Names of classes whose constituent images have an activation value of … view at source ↗
Figure 6
Figure 6. Figure 6: Plots of the AUROC and binarity score depends on various α0 and α1. F.2 Analysis of vanishing gradients F.2.1 Mitigating gradient vanishing by reducing α0 We aim to investigate whether gradient vanishing can be alleviated by reducing α0 while keeping α0α1 fixed. The DNN architecture is set to [256, 256, 128], and β0 is fixed at 22. We consider three candidates (2, 11), (8, 2.75), and (11, 2) for (α0, α1). … view at source ↗
Figure 7
Figure 7. Figure 7: presents results for a DNN architecture with hidden layer sizes (256, 256, 128) using HTAF, ReLU, and Scaled Sigmoid. Here, Batch normalization is not applied in these experiments. While gradient vanishing is observed for Scaled Sigmoid and HTAF with (8, 2.75) and (11, 2), it is not observed for HTAF with (2, 11) and ReLU. This suggests that adjusting α0 can resolve the vanishing gradient problem [PITH_FU… view at source ↗
Figure 8
Figure 8. Figure 8: Plots of the logarithm of the average L2 norm of layer 1 gradients over epochs for all models [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: presents the accuracies of Llama-2-7b and Qwen-3.5-9b as a function of k. As k increases, the overall prediction performance of the model degrades; however, the HTAF method exhibits a smaller performance drop compared to the STE method. These results suggest that HTAF provides a more stable alternative when replacing activation functions with the Heaviside function in LLMs [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 10
Figure 10. Figure 10: Images of Bewick Wren. 135th node. Bewick’s Wren is a terrestrial perching bird that typically forages in brush, shrubs, woodland edges, bark crevices, and dense vegetation, searching for small insects and other inverte￾brates. Hence, they are likely to be activated. LLM Response Distinguishing Concept Ecomorphological Niche: Terrestrial Woodland Foragers vs. Aquatic, Marine, and Aerial Specialists The pr… view at source ↗
Figure 11
Figure 11. Figure 11: Histograms of the proportions for all classes. G.4 Additional experiments on Heaviside approximation and concept meaningfulness In this section, we evaluate whether our method approximates the Heaviside activation function while achieving strong prediction performance. We also investigate whether the estimated implicit concepts contribute meaningfully to the classification decision. To measure how close t… view at source ↗
Figure 12
Figure 12. Figure 12: presents the histogram of feature extractor outputs aggregated over all feature dimensions and all samples in the test dataset. We observe that, when using HTAF, most of the feature extractor outputs are concentrated near zero or one, whereas this behavior is not observed when using ReLU [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗
read the original abstract

Activation functions play a central role in neural networks by shaping internal representations. Recently, learning binary activation representations has attracted significant attention due to their advantages in computational and memory efficiency, as well as interpretability. However, training neural networks with Heaviside activations remains challenging, as their non-differentiability obstructs standard gradient-based optimization. In this paper, we propose Heavy Tailed Activation Function (HTAF), a smooth approximation to the Heaviside function that enables stable training with gradient-based optimization. We construct HTAF as a sigmoid hyperbolic tangent composite function and theoretically show that it maintains a large gradient mass around zero inputs while exhibiting slower gradient decay in the tail regions. We show that Spiking Neural Networks, Binary Neural Networks and Deep Heaviside neural Networks can be trained stably using HTAF with gradient-based optimization. Finally, we introduce Implicit Concept Bottleneck Models (ICBMs), an interpretable image model that leverages HTAF to induce discrete feature representations. Extensive experiments across various architectures and image datasets demonstrate that ICBM enables stable discretization while achieving prediction performance comparable to or better than standard models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes the Heavy Tailed Activation Function (HTAF), constructed as a composite of sigmoid and hyperbolic tangent functions to provide a smooth, differentiable approximation to the Heaviside step function. It derives that HTAF preserves substantial gradient mass near zero inputs with slower tail decay compared to standard activations, enabling stable gradient-based training of Spiking Neural Networks, Binary Neural Networks, and Deep Heaviside Neural Networks. The work further introduces Implicit Concept Bottleneck Models (ICBMs) that leverage HTAF for discrete feature representations and reports experiments on image classification tasks showing competitive performance with improved interpretability.

Significance. If the gradient properties survive multi-layer backpropagation and the experimental gains are causally attributable to HTAF rather than tuning, the method could offer a practical route to training binary-representation networks without specialized optimizers or straight-through estimators, with downstream benefits for memory-efficient and interpretable models such as ICBMs. The explicit construction from standard components and the focus on gradient mass/decay are positive features, but the absence of analytic multi-layer bounds and controlled ablations limits the assessed advance over existing smooth Heaviside approximations.

major comments (3)
  1. [theoretical analysis section] Theoretical analysis (gradient derivation): the per-neuron derivative is shown to maintain large mass near zero and slower tail decay, yet no analytic bound or perturbation argument is supplied demonstrating that these properties are preserved after multiplication by weight matrices and summation across layers under realistic weight distributions; this gap directly undermines the claim that isolated HTAF properties suffice for stable training of SNNs/BNNs/DHNNs.
  2. [experiments section] Experiments section: reported stable convergence and performance on image datasets lack ablation controls that isolate HTAF from architecture-specific choices, learning-rate schedules, or other regularization; without such controls the causal link between the claimed gradient behavior and observed training stability cannot be verified.
  3. [ICBM section] ICBM construction: while HTAF is used to induce discrete representations, the manuscript does not quantify the degree of binarization achieved (e.g., via entropy or fraction of near-binary activations) nor compare against standard concept-bottleneck baselines with post-hoc discretization, weakening the interpretability claim.
minor comments (2)
  1. [method section] Notation for the composite function (sigmoid-tanh product) should be defined explicitly with a single equation number rather than described in prose.
  2. [figures] Figure captions for gradient plots should include the exact parameter settings used to generate the curves and a direct overlay against ReLU and tanh for visual comparison.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, outlining our planned revisions where appropriate while noting the limits of what can be addressed within the current scope.

read point-by-point responses
  1. Referee: Theoretical analysis (gradient derivation): the per-neuron derivative is shown to maintain large mass near zero and slower tail decay, yet no analytic bound or perturbation argument is supplied demonstrating that these properties are preserved after multiplication by weight matrices and summation across layers under realistic weight distributions; this gap directly undermines the claim that isolated HTAF properties suffice for stable training of SNNs/BNNs/DHNNs.

    Authors: We appreciate the referee's point on the theoretical gap. The manuscript derives the per-neuron gradient properties of HTAF as the core mechanism for maintaining gradient mass near zero with slower tail decay. While these properties are shown to enable stable multi-layer training in practice for SNNs, BNNs, and DHNNs, we acknowledge that a rigorous analytic bound or perturbation argument for arbitrary weight distributions is not provided. Deriving such bounds is non-trivial and would require substantial additional analysis beyond the paper's focus. In revision we will add a first-order perturbation argument for small weight perturbations around typical initializations, together with empirical measurements of gradient flow across layers, to better support the practical claims. revision: partial

  2. Referee: Experiments section: reported stable convergence and performance on image datasets lack ablation controls that isolate HTAF from architecture-specific choices, learning-rate schedules, or other regularization; without such controls the causal link between the claimed gradient behavior and observed training stability cannot be verified.

    Authors: We agree that the experiments would be strengthened by explicit ablations. The current results demonstrate stable convergence and competitive performance, but do not fully isolate HTAF. In the revised manuscript we will add controlled ablation studies that fix architecture, optimizer, learning-rate schedule, and regularization, comparing HTAF only against ReLU, sigmoid, tanh, and other smooth Heaviside approximations. These will directly test whether the observed stability and performance gains are attributable to HTAF's gradient properties. revision: yes

  3. Referee: ICBM construction: while HTAF is used to induce discrete representations, the manuscript does not quantify the degree of binarization achieved (e.g., via entropy or fraction of near-binary activations) nor compare against standard concept-bottleneck baselines with post-hoc discretization, weakening the interpretability claim.

    Authors: We thank the referee for this suggestion. To strengthen the interpretability claims for ICBMs, the revision will include quantitative measures of binarization: specifically, the entropy of the activation distributions and the fraction of activations lying within 0.05 of 0 or 1. We will also add direct comparisons against standard Concept Bottleneck Models that apply post-hoc discretization, reporting both predictive accuracy and the fidelity of the resulting discrete concepts. revision: yes

standing simulated objections not resolved
  • Full analytic bounds demonstrating preservation of HTAF gradient properties after multiplication by weight matrices and summation across layers under realistic weight distributions

Circularity Check

0 steps flagged

No circularity: HTAF constructed explicitly and properties derived from definition before any training claims

full rationale

The paper defines HTAF directly as a sigmoid-tanh composite, then derives its gradient mass and tail decay analytically from that closed-form expression. No step renames a fitted parameter as a prediction, imports uniqueness via self-citation, or reduces the stability claim to the construction itself. Experiments on SNNs/BNNs/DHNNs and ICBMs are presented as downstream validation rather than the source of the gradient properties. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the differentiability of sigmoid and tanh (standard) and the assumption that the composite will behave like a step function in the limit while preserving gradients. No free parameters or new physical entities are introduced in the abstract; HTAF is a mathematical construction rather than an invented physical object.

axioms (1)
  • standard math Sigmoid and hyperbolic tangent functions are continuously differentiable everywhere
    Required to form the composite HTAF and to compute its gradient analytically.
invented entities (1)
  • HTAF no independent evidence
    purpose: Smooth differentiable surrogate for the Heaviside step function
    Newly defined composite function whose properties are asserted but not independently validated outside the paper.

pith-pipeline@v0.9.0 · 5504 in / 1386 out tokens · 53957 ms · 2026-05-13T01:59:14.449366+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 6 internal anchors

  1. [1]

    kaggle, 2017

    Telco customer churn. kaggle, 2017. https://www.kaggle.com/datasets/blastchar/telco-customer- churn/data

  2. [2]

    FICO Explainable Learning Challenge, 2018

    Fico heloc. FICO Explainable Learning Challenge, 2018. https://community.fico.com/s/ explainable-machine-learning-challenge

  3. [3]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

  4. [4]

    Efficient activation function optimization through surrogate modeling.Advances in Neural Information Processing Systems, 36:6634–6661, 2023

    Garrett Bingham and Risto Miikkulainen. Efficient activation function optimization through surrogate modeling.Advances in Neural Information Processing Systems, 36:6634–6661, 2023

  5. [5]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  6. [6]

    Spiking deep convolutional neural networks for energy-efficient object recognition.International Journal of Computer Vision, 113(1):54–66, 2015

    Yongqiang Cao, Yang Chen, and Deepak Khosla. Spiking deep convolutional neural networks for energy-efficient object recognition.International Journal of Computer Vision, 113(1):54–66, 2015

  7. [7]

    Hashnet: Deep learning to hash by continuation

    Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Philip S Yu. Hashnet: Deep learning to hash by continuation. InProceedings of the IEEE international conference on computer vision, pages 5608–5617, 2017

  8. [8]

    Training for stable explanation for free.Advances in Neural Information Processing Systems, 37:3421–3457, 2024

    Chao Chen, Chenghua Guo, Rufeng Chen, Guixiang Ma, Ming Zeng, Xiangwen Liao, Xi Zhang, and Sihong Xie. Training for stable explanation for free.Advances in Neural Information Processing Systems, 37:3421–3457, 2024

  9. [9]

    Neural characteristic activation analysis and geometric parameteri- zation for relu networks.Advances in Neural Information Processing Systems, 37:97562–97586, 2024

    Wenlin Chen and Hong Ge. Neural characteristic activation analysis and geometric parameteri- zation for relu networks.Advances in Neural Information Processing Systems, 37:97562–97586, 2024

  10. [10]

    Cortez, A

    Paulo Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. Wine Quality. 2009. DOI: https://doi.org/10.24432/C56S3T

  11. [11]

    and Bengio, Y

    Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or-1.arXiv preprint arXiv:1602.02830, 2016

  12. [12]

    Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing

    Peter U Diehl, Daniel Neil, Jonathan Binas, Matthew Cook, Shih-Chii Liu, and Michael Pfeiffer. Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing. In2015 International joint conference on neural networks (IJCNN), pages 1–8. ieee, 2015

  13. [13]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  14. [14]

    Globally optimal training of neural networks with threshold activation functions.arXiv preprint arXiv:2303.03382, 2023

    Tolga Ergen, Halil Ibrahim Gulluk, Jonathan Lacotte, and Mert Pilanci. Globally optimal training of neural networks with threshold activation functions.arXiv preprint arXiv:2303.03382, 2023

  15. [15]

    Spikingjelly: An open-source machine learning infrastructure platform for spike-based intelligence.Science Advances, 9(40):eadi1480, 2023

    Wei Fang, Yanqi Chen, Jianhao Ding, Zhaofei Yu, Timothée Masquelier, Ding Chen, Liwei Huang, Huihui Zhou, Guoqi Li, and Yonghong Tian. Spikingjelly: An open-source machine learning infrastructure platform for spike-based intelligence.Science Advances, 9(40):eadi1480, 2023. 10

  16. [16]

    Deep residual learning in spiking neural networks.Advances in neural information processing systems, 34:21056–21069, 2021

    Wei Fang, Zhaofei Yu, Yanqi Chen, Tiejun Huang, Timothée Masquelier, and Yonghong Tian. Deep residual learning in spiking neural networks.Advances in neural information processing systems, 34:21056–21069, 2021

  17. [17]

    Craft: Concept recursive activation factorization for ex- plainability

    Thomas Fel, Agustin Picard, Louis Bethune, Thibaut Boissin, David Vigouroux, Julien Colin, Rémi Cadène, and Thomas Serre. Craft: Concept recursive activation factorization for ex- plainability. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2711–2721, 2023

  18. [18]

    Towards automatic concept- based explanations.Advances in neural information processing systems, 32, 2019

    Amirata Ghorbani, James Wexler, James Y Zou, and Been Kim. Towards automatic concept- based explanations.Advances in neural information processing systems, 32, 2019

  19. [19]

    Understanding the difficulty of training deep feedfor- ward neural networks

    Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedfor- ward neural networks. InProceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010

  20. [20]

    Deep sparse rectifier neural networks

    Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315–323. JMLR Workshop and Conference Proceedings, 2011

  21. [21]

    On the impact of the activation function on deep neural networks training

    Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. On the impact of the activation function on deep neural networks training. InInternational conference on machine learning, pages 2672–2680. PMLR, 2019

  22. [22]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  23. [23]

    Augmix: A simple data processing method to improve robustness and uncertainty

    Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshmi- narayanan. Augmix: A simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781, 2019

  24. [24]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  25. [25]

    Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

    Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

  26. [26]

    Quan- tized neural networks: Training neural networks with low precision weights and activations

    Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quan- tized neural networks: Training neural networks with low precision weights and activations. journal of machine learning research, 18(187):1–30, 2018

  27. [27]

    On the universal representation property of spiking neural networks.arXiv preprint arXiv:2512.16872, 2025

    Shayan Hundrieser, Philipp Tuchel, Insung Kong, and Johannes Schmidt-Hieber. On the universal representation property of spiking neural networks.arXiv preprint arXiv:2512.16872, 2025

  28. [28]

    On the approximation of the step function by some sigmoid functions.Mathematics and Computers in Simulation, 133:223–234, 2017

    A Iliev, Nikolay Kyurkchiev, and Svetoslav Markov. On the approximation of the step function by some sigmoid functions.Mathematics and Computers in Simulation, 133:223–234, 2017

  29. [29]

    Deep nonparametric regression on approximate manifolds: Nonasymptotic error bounds with polynomial prefactors.The Annals of Statistics, 51(2):691–716, 2023

    Yuling Jiao, Guohao Shen, Yuanyuan Lin, and Jian Huang. Deep nonparametric regression on approximate manifolds: Nonasymptotic error bounds with polynomial prefactors.The Annals of Statistics, 51(2):691–716, 2023

  30. [30]

    Enhancing concept localization in clip-based concept bottleneck models.arXiv preprint arXiv:2510.07115, 2025

    Rémi Kazmierczak, Steve Azzolin, Eloïse Berthier, Goran Frehse, and Gianni Franchi. Enhancing concept localization in clip-based concept bottleneck models.arXiv preprint arXiv:2510.07115, 2025

  31. [31]

    Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)

    Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). InInternational conference on machine learning, pages 2668–2677. PMLR, 2018. 11

  32. [32]

    Concept bottleneck models

    Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. InInternational conference on machine learning, pages 5338–5348. PMLR, 2020

  33. [33]

    On the rate of convergence of fully connected deep neural network regression estimates.The Annals of Statistics, 49(4):2231 – 2249, 2021

    Michael Kohler and Sophie Langer. On the rate of convergence of fully connected deep neural network regression estimates.The Annals of Statistics, 49(4):2231 – 2249, 2021

  34. [34]

    On the expressivity of deep heaviside networks.arXiv preprint arXiv:2505.00110, 2025

    Insung Kong, Juntong Chen, Sophie Langer, and Johannes Schmidt-Hieber. On the expressivity of deep heaviside networks.arXiv preprint arXiv:2505.00110, 2025

  35. [35]

    Posterior concentrations of fully-connected bayesian neural networks with general priors on the weights.Journal of Machine Learning Research, 26(94):1– 60, 2025

    Insung Kong and Yongdai Kim. Posterior concentrations of fully-connected bayesian neural networks with general priors on the weights.Journal of Machine Learning Research, 26(94):1– 60, 2025

  36. [36]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  37. [37]

    Interpretable generative models through post-hoc concept bottlenecks

    Akshay Kulkarni, Ge Yan, Chung-En Sun, Tuomas Oikarinen, and Tsui-Wei Weng. Interpretable generative models through post-hoc concept bottlenecks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8162–8171, 2025

  38. [38]

    Self- binarizing networks.arXiv preprint arXiv:1902.00730, 2019

    Fayez Lahoud, Radhakrishna Achanta, Pablo Márquez-Neila, and Sabine Süsstrunk. Self- binarizing networks.arXiv preprint arXiv:1902.00730, 2019

  39. [39]

    Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015

    Yann Le and Xuan Yang. Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015

  40. [40]

    Seeking interpretability and explainability in binary activated neural networks

    Benjamin Leblanc and Pascal Germain. Seeking interpretability and explainability in binary activated neural networks. InWorld Conference on Explainable Artificial Intelligence, pages 3–20. Springer, 2024

  41. [41]

    Differ- entiable spike: Rethinking gradient-descent for training spiking neural networks.Advances in neural information processing systems, 34:23426–23439, 2021

    Yuhang Li, Yufei Guo, Shanghang Zhang, Shikuang Deng, Yongqing Hai, and Shi Gu. Differ- entiable spike: Rethinking gradient-descent for training spiking neural networks.Advances in neural information processing systems, 34:23426–23439, 2021

  42. [42]

    Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm

    Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. InProceedings of the European conference on computer vision (ECCV), pages 722–737, 2018

  43. [43]

    Deep network approximation for smooth functions.SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021

    Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions.SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021

  44. [44]

    The era of 1-bit llms: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.17764, 1(4), 2024

    Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Lifeng Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit llms: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.17764, 1(4), 2024

  45. [45]

    Networks of spiking neurons: the third generation of neural network models

    Wolfgang Maass. Networks of spiking neurons: the third generation of neural network models. Neural networks, 10(9):1659–1671, 1997

  46. [46]

    Torchvision: Pytorch’s computer vision library

    TorchVision maintainers and contributors. Torchvision: Pytorch’s computer vision library. https://github.com/pytorch/vision, 2016

  47. [47]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018

  48. [48]

    Surrogate gradient learning in spiking neural networks.IEEE Signal Processing Magazine, 36(6):51–63, 2019

    Emre O Neftci, Hesham Mostafa, and Friedemann Zenke. Surrogate gradient learning in spiking neural networks.IEEE Signal Processing Magazine, 36(6):51–63, 2019

  49. [49]

    Smooth function approximation by deep neural networks with general activation functions.Entropy, 21(7):627, 2019

    Ilsang Ohn and Yongdai Kim. Smooth function approximation by deep neural networks with general activation functions.Entropy, 21(7):627, 2019

  50. [50]

    Label-free concept bottleneck models.arXiv preprint arXiv:2304.06129, 2023

    Tuomas Oikarinen, Subhro Das, Lam M Nguyen, and Tsui-Wei Weng. Label-free concept bottleneck models.arXiv preprint arXiv:2304.06129, 2023. 12

  51. [51]

    Optimal approximation of piecewise smooth functions using deep ReLU neural networks.Neural Networks, 108:296–330, 2018

    Philipp Petersen and Felix V oigtlaender. Optimal approximation of piecewise smooth functions using deep ReLU neural networks.Neural Networks, 108:296–330, 2018

  52. [52]

    Binary neural networks: A survey.Pattern Recognition, 105:107281, 2020

    Haotong Qin, Ruihao Gong, Xianglong Liu, Xiao Bai, Jingkuan Song, and Nicu Sebe. Binary neural networks: A survey.Pattern Recognition, 105:107281, 2020

  53. [53]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

  54. [54]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  55. [55]

    Xnor-net: Imagenet classification using binary convolutional neural networks

    Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. InEuropean conference on computer vision, pages 525–542. Springer, 2016

  56. [56]

    The perceptron: a probabilistic model for information storage and organization in the brain.Psychological review, 65(6):386, 1958

    Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain.Psychological review, 65(6):386, 1958

  57. [57]

    Conversion of continuous-valued deep networks to efficient event-driven networks for image classification.Frontiers in neuroscience, 11:682, 2017

    Bodo Rueckauer, Iulia-Alexandra Lungu, Yuhuang Hu, Michael Pfeiffer, and Shih-Chii Liu. Conversion of continuous-valued deep networks to efficient event-driven networks for image classification.Frontiers in neuroscience, 11:682, 2017

  58. [58]

    Learning representations by back-propagating errors.nature, 323(6088):533–536, 1986

    David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors.nature, 323(6088):533–536, 1986

  59. [59]

    Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  60. [60]

    Deep relu network approximation of functions on a manifold.arXiv preprint arXiv:1908.00695, 2019

    Johannes Schmidt-Hieber. Deep relu network approximation of functions on a manifold.arXiv preprint arXiv:1908.00695, 2019

  61. [61]

    Nonparametric regression using deep neural networks with ReLU activation function.The Annals of Statistics, 48(4):1875 – 1897, 2020

    Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function.The Annals of Statistics, 48(4):1875 – 1897, 2020

  62. [62]

    Going deeper in spiking neural networks: Vgg and residual architectures.Frontiers in neuroscience, 13:95, 2019

    Abhronil Sengupta, Yuting Ye, Robert Wang, Chiao Liu, and Kaushik Roy. Going deeper in spiking neural networks: Vgg and residual architectures.Frontiers in neuroscience, 13:95, 2019

  63. [63]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  64. [64]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  65. [65]

    Low curvature activations reduce overfitting in adversarial training

    Vasu Singla, Sahil Singla, Soheil Feizi, and David Jacobs. Low curvature activations reduce overfitting in adversarial training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16423–16433, 2021

  66. [66]

    Deep learning in spiking neural networks.Neural networks, 111:47–63, 2019

    Amirhossein Tavanaei, Masoud Ghodrati, Saeed Reza Kheradpisheh, Timothée Masquelier, and Anthony Maida. Deep learning in spiking neural networks.Neural networks, 111:47–63, 2019

  67. [67]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  68. [68]

    Stochastic concept bottleneck models.Advances in Neural Information Processing Systems, 37:51787–51810, 2024

    Moritz Vandenhirtz, Sonia Laguna, Riˇcards Marcinkeviˇcs, and Julia V ogt. Stochastic concept bottleneck models.Advances in Neural Information Processing Systems, 37:51787–51810, 2024

  69. [69]

    C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011. 13

  70. [70]

    Bitnet: Scaling 1-bit transformers for large language models,

    Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models.arXiv preprint arXiv:2310.11453, 2023

  71. [71]

    Exponential convergence of the deep neural network approximation for analytic functions.arXiv preprint arXiv:1807.00297, 2018

    Qingcan Wang et al. Exponential convergence of the deep neural network approximation for analytic functions.arXiv preprint arXiv:1807.00297, 2018

  72. [72]

    Nash Warwick, Sellers Tracy, Talbot Simon, Cawthorn Andrew, and Ford Wes. Abalone. UCI Machine Learning Repository, 1995. DOI: https://doi.org/10.24432/C55C7W

  73. [73]

    Adjustable bounded rectifiers: Towards deep binary representations.arXiv preprint arXiv:1511.06201, 2015

    Zhirong Wu, Dahua Lin, and Xiaoou Tang. Adjustable bounded rectifiers: Towards deep binary representations.arXiv preprint arXiv:1511.06201, 2015

  74. [74]

    Smooth adversarial training.arXiv preprint arXiv:2006.14536, 2020

    Cihang Xie, Mingxing Tan, Boqing Gong, Alan Yuille, and Quoc V Le. Smooth adversarial training.arXiv preprint arXiv:2006.14536, 2020

  75. [75]

    Optimal rates of approximation by shallow ReLUk neural networks and applications to nonparametric regression.Constructive Approximation, pages 1–32, 2024

    Yunfei Yang and Ding-Xuan Zhou. Optimal rates of approximation by shallow ReLUk neural networks and applications to nonparametric regression.Constructive Approximation, pages 1–32, 2024

  76. [76]

    Error bounds for approximations with deep ReLU networks.Neural Networks, 94:103–114, 2017

    Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks.Neural Networks, 94:103–114, 2017

  77. [77]

    Understanding straight- through estimator in training activation quantized neural nets.arXiv preprint arXiv:1903.05662, 2019

    Penghang Yin, Jiancheng Lyu, Shuai Zhang, Stanley Osher, Yingyong Qi, and Jack Xin. Under- standing straight-through estimator in training activation quantized neural nets.arXiv preprint arXiv:1903.05662, 2019

  78. [78]

    Torchcv: A pytorch-based framework for deep learning in computer vision.https://github.com/donnyyou/torchcv, 2019

    Ansheng You, Xiangtai Li, Zhen Zhu, and Yunhai Tong. Torchcv: A pytorch-based framework for deep learning in computer vision.https://github.com/donnyyou/torchcv, 2019

  79. [79]

    Learning interpretable differentiable logic networks.IEEE Transactions on Circuits and Systems for Artificial Intelligence, 2024

    Chang Yue and Niraj K Jha. Learning interpretable differentiable logic networks.IEEE Transactions on Circuits and Systems for Artificial Intelligence, 2024

  80. [80]

    When and why vision-language models behave like bags-of-words, and what to do about it? arXiv preprint arXiv:2210.01936,

    Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022

Showing first 80 references.