pith. machine review for the scientific record. sign in

arxiv: 2605.01928 · v1 · submitted 2026-05-03 · 💻 cs.LG · cs.NE· cs.RO· math.OC

Recognition: unknown

Training Non-Differentiable Networks via Optimal Transport

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:32 UTC · model grok-4.3

classification 💻 cs.LG cs.NEcs.ROmath.OC
keywords gradient-free optimizationoptimal transportnon-differentiable networksspiking neural networksquantized networksblack-box optimizationconvergence analysis
0
0 comments X

The pith

PolyStep trains non-differentiable networks to high accuracy by moving parameters via optimal-transport barycentric projections on polytope vertices using only forward passes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PolyStep, a gradient-free optimizer that replaces backpropagation for models containing hard non-differentiable components such as spiking neurons, quantization, or discrete routing. At each step it samples the loss only at vertices of a low-dimensional polytope, forms a cost matrix, and computes a softmax-weighted displacement that corresponds to the one-sided limit of a regularized optimal-transport problem. This produces updates that reach 93.4 percent accuracy on hard-LIF spiking networks, more than 60 points above other gradient-free methods and within 4.4 points of a surrogate-gradient Adam baseline. The same procedure leads all tested gradient-free competitors on quantized layers, argmax attention, staircase activations, and hard mixture-of-experts routing, while also sustaining high clause satisfaction on large MAX-SAT instances and matching evolution strategies on quantized reinforcement-learning policies. Convergence to conservative-stationary points at rate O(log T over square-root T) is proved for piecewise-smooth losses and upgraded to Clarke-stationary points for the headline architectures.

Core claim

PolyStep updates parameters by evaluating the loss at structured polytope vertices in a compressed subspace, computing softmax-weighted assignments over the resulting cost matrix, and displacing particles toward low-cost vertices via barycentric projection. This update corresponds to the one-sided limit of a regularized optimal-transport problem and inherits its geometric structure without Sinkhorn iterations. On hard-LIF spiking networks the method reaches 93.4 percent test accuracy, outperforming all gradient-free baselines by over 60 percentage points and closing to within 4.4 points of a surrogate-gradient Adam ceiling. The same procedure leads every gradient-free competitor across int8,

What carries the argument

PolyStep update: loss evaluation at polytope vertices in a compressed subspace followed by softmax-weighted barycentric projection that realizes the one-sided limit of a regularized optimal-transport problem.

If this is right

  • Hard-LIF spiking networks reach 93.4 percent test accuracy, more than 60 points above prior gradient-free methods.
  • Int8 quantization, argmax attention, staircase activations, and hard MoE routing all show leading accuracy among gradient-free optimizers.
  • MAX-SAT instances up to one million variables maintain above 92 percent clause satisfaction while evolution strategies drop 8-12 points.
  • RL policies match OpenAI-ES on classical control and retain performance after integer or binary quantization that collapses gradient-based methods.
  • Convergence to conservative-stationary points occurs at rate O(log T over square-root T) on piecewise-smooth losses and upgrades to Clarke-stationary points on the tested architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may extend directly to other black-box simulators whose only accessible information is a forward loss value.
  • If the subspace compression preserves improvement directions, PolyStep could scale to parameter counts larger than typical zeroth-order methods allow.
  • Hybrid schedules that apply PolyStep only on non-differentiable blocks and ordinary gradients elsewhere would be a natural next architecture.

Load-bearing premise

The loss is piecewise-smooth, or piecewise-constant with a bounded hitting time, so that the polytope-based updates converge to conservative-stationary or Clarke-stationary points at the stated rate.

What would settle it

An experiment in which PolyStep fails to reach within 10 percentage points of the surrogate-gradient baseline on the hard-LIF spiking-network benchmark after the reported number of forward passes would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2605.01928 by An T. Le.

Figure 1
Figure 1. Figure 1: One PolyStep optimization step. Top: forward-only pipeline. Parameters θ are compressed into P particles via subspace projection (1), polytope vertices are sampled around each particle under random rotations (2), the cost matrix is evaluated by forward passes only (3), the soft-assignment T ∗ is computed by either softmax (λ= 0) or entropic OT (λ→∞) (4), and particles update via barycentric projection (5).… view at source ↗
Figure 2
Figure 2. Figure 2: SNN benchmark summary (MNIST, hard LIF, 5 seeds). (a) Test accuracy: PolyStep (93.4%) [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MNIST training dynamics (5 seeds). (a) Convergence: PolyStep surpasses all ES baselines within [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: RL learning curves: 2 environments × 3 precision regimes (3 seeds, mean ± std). PolyStep and OpenAI-ES retain Float32 performance under INT8/Binary quantization; PPO/DQN collapse to the random floor when gradient flow is blocked. binary-precision policy that solves CartPole optimally and Acrobot to −71.6 uses only sign(·) activations (single comparator gates), trained without surrogate gradients, STE, or a… view at source ↗
Figure 5
Figure 5. Figure 5: OT solver analysis. (a) KL marginal violation decreases monotonically with [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Unitree G1 humanoid velocity-tracking learning curves on a shared environment-step axis (log [PITH_FULL_IMAGE:figures/full_fig_p043_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation diagnostics on MNIST (5 seeds, HybridSubspace rank 8). Red star ( [PITH_FULL_IMAGE:figures/full_fig_p046_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Scalability. (a) Sparse vs. dense projection memory: sparse projection eliminates memory scaling [PITH_FULL_IMAGE:figures/full_fig_p046_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: MAX-SAT scaling from 100 to 1M variables (5 seeds). (a) Clause satisfaction: PolyStep degrades [PITH_FULL_IMAGE:figures/full_fig_p047_9.png] view at source ↗
read the original abstract

Neural networks increasingly embed non-differentiable components (spiking neurons, quantized layers, discrete routing, blackbox simulators, etc.) where backpropagation is inapplicable and surrogate gradients introduce bias. We present PolyStep, a gradient-free optimizer that updates parameters using only forward passes. Each step evaluates the loss at structured polytope vertices in a compressed subspace, computes softmax-weighted assignments over the resulting cost matrix, and displaces particles toward low-cost vertices via barycentric projection. This update corresponds to the one-sided limit of a regularized optimal-transport problem, inheriting its geometric structure without Sinkhorn iterations. PolyStep trains genuinely non-differentiable models where existing gradient-free methods collapse to near-random accuracy. On hard-LIF spiking networks we reach 93.4% test accuracy, outperforming all gradient-free baselines by over 60~pp and closing to within 4.4~pp of a surrogate-gradient Adam ceiling. Across four additional non-differentiable architectures (int8 quantization, argmax attention, staircase activations, hard MoE routing) we lead every gradient-free competitor. On MAX-SAT scaling from 100 to 1M variables, we sustain above 92% clause satisfaction while evolution strategies drop 8--12~pp. On RL policy search, we match OpenAI-ES on classical control and retain performance under integer and binary quantization that collapses gradient-based methods. We prove convergence to conservative-stationary points at rate $O(\log T/\sqrt{T})$ on piecewise-smooth losses, upgraded to Clarke-stationary on the headline architectures and extended to the piecewise-constant regime via a hitting-time bound. These rates match the known zeroth-order query-complexity lower bounds that all forward-only methods inherit. Code is available at https://github.com/anindex/polystep.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PolyStep, a gradient-free optimizer for non-differentiable networks that performs updates by evaluating loss at polytope vertices in a compressed subspace, forming a cost matrix, and applying barycentric projection derived as the one-sided limit of a regularized OT problem. It reports 93.4% test accuracy on hard-LIF spiking networks (outperforming gradient-free baselines by >60 pp and within 4.4 pp of surrogate-gradient Adam), leads on int8 quantization, argmax attention, staircase activations, hard MoE, sustains >92% on MAX-SAT up to 1M variables, and matches OpenAI-ES on RL while handling quantization. It proves O(log T / sqrt(T)) convergence to conservative-stationary points on piecewise-smooth losses, upgraded to Clarke-stationary points on the headline architectures via a hitting-time bound for the piecewise-constant regime, with rates matching zeroth-order lower bounds. Code is released.

Significance. If the convergence claims and empirical results hold under full experimental protocols, the work is significant for providing a theoretically grounded, forward-only method that trains genuinely non-differentiable components (spiking, quantization, discrete routing) without surrogate bias, closing much of the gap to gradient-based methods on challenging tasks. Strengths include the OT-derived geometric update without Sinkhorn, explicit rate matching known lower bounds, and code availability for reproducibility.

major comments (2)
  1. [§4] §4 (convergence analysis), the hitting-time bound for piecewise-constant losses: the upgrade from conservative-stationary to Clarke-stationary points for hard-LIF (and similar) requires that after finitely many steps the barycentric projection lands on a vertex with no lower-loss adjacent vertex in the polytope. The method description indicates the subspace compression dimension is chosen independently of the threshold hyperplanes; without an explicit argument that this guarantees polynomial (or finite) hitting time rather than exponential growth in depth/width, the O(log T / sqrt(T)) rate does not apply to the headline architectures, leaving the 93.4% result without the claimed theoretical support.
  2. [§5] Experimental protocol (implicit in §5 and Table 1): the per-step query count for PolyStep includes evaluations at multiple polytope vertices in the compressed subspace; the reported outperformance must be accompanied by explicit total forward-pass budgets versus baselines (evolution strategies, etc.) to confirm the comparison respects the zeroth-order query-complexity lower bounds cited in the proof.
minor comments (2)
  1. [Abstract] Abstract: 'over 60 pp' and 'within 4.4 pp' should name the exact baseline accuracies and the surrogate-gradient ceiling value for immediate clarity.
  2. [§3] Notation: the definition of the compressed subspace and its dimension as a free parameter should be stated with an explicit symbol (e.g., d_sub) when first introduced, to avoid ambiguity in the polytope construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The two major comments raise important points about the rigor of the convergence claims and the fairness of the experimental comparisons. We address each below and will revise the manuscript accordingly to strengthen both the theoretical and empirical sections.

read point-by-point responses
  1. Referee: §4 (convergence analysis), the hitting-time bound for piecewise-constant losses: the upgrade from conservative-stationary to Clarke-stationary points for hard-LIF (and similar) requires that after finitely many steps the barycentric projection lands on a vertex with no lower-loss adjacent vertex in the polytope. The method description indicates the subspace compression dimension is chosen independently of the threshold hyperplanes; without an explicit argument that this guarantees polynomial (or finite) hitting time rather than exponential growth in depth/width, the O(log T / sqrt(T)) rate does not apply to the headline architectures, leaving the 93.4% result without the claimed theoretical support.

    Authors: We agree that the current write-up of the hitting-time argument would benefit from greater explicitness. The manuscript already establishes that the barycentric projection is a one-sided limit of a regularized OT problem and that, for piecewise-constant losses, the process reaches a Clarke-stationary point once a locally optimal vertex is selected. To close the gap noted by the referee, the revised version will add a new lemma (and short proof) in §4 showing that the expected hitting time is bounded by a polynomial in the compression dimension d, network depth L, and width W. The argument uses a union bound over the finitely many hyperplanes that define the polytope vertices together with the positive probability of selecting improving directions under the OT weights; this yields a finite (in fact polynomial) hitting time with high probability, after which the O(log T / sqrt(T)) rate applies. We will also clarify the practical choice of d in the method section. revision: yes

  2. Referee: Experimental protocol (implicit in §5 and Table 1): the per-step query count for PolyStep includes evaluations at multiple polytope vertices in the compressed subspace; the reported outperformance must be accompanied by explicit total forward-pass budgets versus baselines (evolution strategies, etc.) to confirm the comparison respects the zeroth-order query-complexity lower bounds cited in the proof.

    Authors: We thank the referee for emphasizing the need for transparent query accounting. The revised manuscript will add a new subsection (and accompanying table) in §5 that reports, for every experiment, (i) the number of polytope vertices evaluated per PolyStep update, (ii) the total forward-pass budget used by PolyStep, and (iii) the corresponding budgets for all gradient-free baselines (including OpenAI-ES and other evolution strategies). These numbers will be normalized both per parameter update and per epoch, allowing direct verification that the reported accuracy gains respect the zeroth-order lower bounds referenced in the theory section. revision: yes

Circularity Check

0 steps flagged

No circularity: update rule derived from external OT limit; convergence proved under explicit assumptions without self-referential reduction

full rationale

The PolyStep update is defined as the one-sided limit of a regularized optimal-transport problem, a standard construction independent of the paper's fitted values or results. Convergence to conservative- or Clarke-stationary points is proved at the stated rate on piecewise-smooth losses, with an explicit hitting-time bound invoked for the piecewise-constant case; this bound is an assumption, not a quantity derived from the target accuracy or from self-citation. No equation equates a prediction to its input by construction, no parameter is fitted on a subset and renamed a prediction, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The empirical accuracies (e.g., 93.4 % on hard-LIF) are presented as validation of the method, not as the derivation itself. The paper is therefore self-contained against external mathematical benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the loss landscape permits convergence under the stated polytope sampling and on the existence of a compressed subspace that preserves useful geometry; no new physical entities are postulated.

free parameters (1)
  • subspace compression dimension
    The dimension of the compressed subspace in which polytope vertices are evaluated is a design choice that must be selected for each architecture.
axioms (1)
  • domain assumption Loss functions are piecewise-smooth (or admit a hitting-time bound in the piecewise-constant case)
    Invoked to obtain the O(log T / sqrt(T)) convergence rate to conservative-stationary points.

pith-pipeline@v0.9.0 · 5629 in / 1282 out tokens · 73579 ms · 2026-05-10T15:32:29.759207+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 10 canonical work pages · 2 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Accelerating Motion Planning via Optimal Transport , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  2. [2]

    Hansen, Nikolaus , journal=. The

  3. [3]

    Salimans, J

    Evolution Strategies as a Scalable Alternative to Reinforcement Learning , author=. arXiv preprint arXiv:1703.03864 , year=

  4. [4]

    Neural Networks , volume=

    Parameter-exploring policy gradients , author=. Neural Networks , volume=. 2010 , publisher=

  5. [5]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Simple random search of static linear policies is competitive for reinforcement learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  6. [6]

    Zhao, Jiawei and Zhang, Zhenyu and Chen, Beidi and Wang, Zhangyang and Anandkumar, Anima and Tian, Yuandong , journal=

  7. [7]

    Liu, Yong and Zhu, Zirui and Gong, Chaoyu and Cheng, Minhao and Hsieh, Cho-Jui and You, Yang , journal=. Sparse

  8. [8]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Pointer Networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  9. [9]

    Learning a

    Selsam, Daniel and Lamm, Matthew and B\". Learning a. International Conference on Learning Representations (ICLR) , year=

  10. [10]

    European Journal of Operational Research , volume=

    Machine Learning for Combinatorial Optimization: A Methodological Tour d'Horizon , author=. European Journal of Operational Research , volume=. 2021 , publisher=

  11. [11]

    Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation , author=

  12. [12]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Sinkhorn Distances: Lightspeed Computation of Optimal Transport , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  13. [13]

    Foundations and Trends in Machine Learning , volume=

    Computational Optimal Transport: With Applications to Data Science , author=. Foundations and Trends in Machine Learning , volume=. 2019 , publisher=

  14. [14]

    and Ba, Jimmy , booktitle=

    Kingma, Diederik P. and Ba, Jimmy , booktitle=

  15. [15]

    Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks , author=

  16. [16]

    OpenAI Blog , year=

    Language Models are Unsupervised Multitask Learners , author=. OpenAI Blog , year=

  17. [17]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Fine-Tuning Language Models with Just Forward Passes , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  18. [18]

    Proceedings of the

    Gradient-Based Learning Applied to Document Recognition , author=. Proceedings of the

  19. [19]

    Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

    Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , author=. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

  20. [20]

    Proceedings of the 12th

    Very Sparse Random Projections , author=. Proceedings of the 12th. 2006 , publisher=

  21. [21]

    Notices of the

    How to Generate Random Matrices from the Classical Compact Groups , author=. Notices of the

  22. [22]

    Low-Rank

    Scetbon, Meyer and Cuturi, Marco and Peyr\'. Low-Rank. Proceedings of the 38th International Conference on Machine Learning (ICML) , year=

  23. [23]

    Journal of Machine Learning Research , volume=

    Natural Evolution Strategies , author=. Journal of Machine Learning Research , volume=

  24. [24]

    Proceedings of the

    Towards Memory- and Time-Efficient Backpropagation for Training Spiking Neural Networks , author=. Proceedings of the

  25. [25]

    Chen, Aochuan and Zhang, Yimeng and Jia, Jinghan and Diffenderfer, James and Liu, Jiancheng and Parasyris, Konstantinos and Zhang, Yihua and Zhang, Zheng and Kailkhura, Bhavya and Liu, Sijia , booktitle=

  26. [26]

    The forward-forward algorithm: Some preliminary investigations

    The Forward-Forward Algorithm: Some Preliminary Investigations , author=. arXiv preprint arXiv:2212.13345 , year=

  27. [27]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Binarized Neural Networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  28. [28]

    arXiv preprint arXiv:2508.08369 , year=

    Scaled-Dot-Product Attention as One-Sided Entropic Optimal Transport , author=. arXiv preprint arXiv:2508.08369 , year=

  29. [29]

    Proceedings of AAAI , year=

    Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting , author=. Proceedings of AAAI , year=

  30. [30]

    Li, Qinyu and Teh, Yee Whye and Pascanu, Razvan , journal=

  31. [31]

    arXiv preprint arXiv:2410.09734 , year=

    Gradient-Free Training of Quantized Neural Networks , author=. arXiv preprint arXiv:2410.09734 , year=

  32. [32]

    Xu, Changqing and Yang, Ziqiang and Liu, Yi and Liao, Xinfang and Mo, Guiqi and Zeng, Hao and Yang, Yintang , journal=

  33. [33]

    Zeroth-Order Fine-Tuning of

    Yu, Ziming and Zhou, Pan and Wang, Sike and Li, Jia and Tian, Mi and Huang, Hua , journal=. Zeroth-Order Fine-Tuning of

  34. [34]

    Ansel, Jason and Yang, Edward and He, Horace and Gimelshein, Natalia and Jain, Animesh and Voznesensky, Michael and Bao, Bin and Bell, Peter and Berard, David and Burber, Evgeni and others , booktitle=

  35. [35]

    Yue, Pengyun and Yang, Xuanlin and Xiao, Mingqing and Lin, Zhouchen , booktitle=

  36. [36]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , author=. arXiv preprint arXiv:1308.3432 , year=

  37. [37]

    Foundations of Computational Mathematics , volume=

    Random Gradient-Free Minimization of Convex Functions , author=. Foundations of Computational Mathematics , volume=

  38. [38]

    Pearlmutter, Don Syme, Frank Wood, and Philip Torr

    Gradients without Backpropagation , author=. arXiv preprint arXiv:2202.08587 , year=

  39. [39]

    Training Neural Networks Without Gradients: A Scalable

    Taylor, Gavin and Burmeister, Ryan and Xu, Zheng and Singh, Bharat and Patel, Ankit and Goldstein, Tom , booktitle=. Training Neural Networks Without Gradients: A Scalable

  40. [40]

    Learning Latent Permutations with

    Mena, Gonzalo and Belanger, David and Linderman, Scott and Snoek, Jasper , booktitle=. Learning Latent Permutations with

  41. [41]

    Mathematics of Computation , volume=

    Scaling Algorithms for Unbalanced Optimal Transport Problems , author=. Mathematics of Computation , volume=

  42. [42]

    Optimization and Nonsmooth Analysis , author=

  43. [43]

    Convex Analysis , author=

  44. [44]

    Optimizing Methods in Statistics , pages=

    A Convergence Theorem for Non-Negative Almost Supermartingales and Some Applications , author=. Optimizing Methods in Statistics , pages=

  45. [45]

    ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages=

    Online Convex Optimization in the Bandit Setting: Gradient Descent Without a Gradient , author=. ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages=

  46. [46]

    Journal of Computer and System Sciences , volume=

    Approximation Algorithms for Combinatorial Problems , author=. Journal of Computer and System Sciences , volume=

  47. [47]

    International Conference on Theory and Applications of Satisfiability Testing (SAT) , pages=

    Choosing Probability Distributions for Stochastic Local Search and the Role of Make versus Break , author=. International Conference on Theory and Applications of Satisfiability Testing (SAT) , pages=. 2013 , publisher=

  48. [48]

    Gymnasium: A Standard Interface for Reinforcement Learning Environments

    Towers, Mark and Terry, Jordan K. and Kwiatkowski, Ariel and Balis, John U. and Cola, Gianluca de and Deleu, Tristan and Goul. arXiv preprint arXiv:2407.17032 , year=

  49. [49]

    Zakka, Kevin and Liao, Qiayuan and Yi, Brent and Le Lay, Louis and Sreenath, Koushil and Abbeel, Pieter , year=

  50. [50]

    2025 , howpublished=

  51. [51]

    arXiv preprint arXiv:2310.00965 , year=

    Node Perturbation Can Effectively Train Multi-Layer Neural Networks , author=. arXiv preprint arXiv:2310.00965 , year=

  52. [52]

    International Conference on Machine Learning (ICML) , year=

    Error-Driven Input Modulation: Solving the Credit Assignment Problem without a Backward Pass , author=. International Conference on Machine Learning (ICML) , year=

  53. [53]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    A Mathematical Model for Automatic Differentiation in Machine Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  54. [54]

    SIAM Journal on Optimization , year=

    The gradient's limit of a definable family of functions admits a variational stratification , author=. SIAM Journal on Optimization , year=

  55. [55]

    Numerical Nonsmooth Optimization (Springer) , year=

    Gradient sampling methods for nonsmooth optimization , author=. Numerical Nonsmooth Optimization (Springer) , year=

  56. [56]

    Differential Topology , author=

  57. [57]

    The Annals of Statistics , volume=

    Time-uniform, nonparametric, nonasymptotic confidence sequences , author=. The Annals of Statistics , volume=

  58. [58]

    International Conference on Learning Representations (ICLR) , year=

    Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets , author=. International Conference on Learning Representations (ICLR) , year=

  59. [59]

    arXiv preprint arXiv:2505.18113 , year=

    Beyond Discreteness: Finite-Sample Analysis of Straight-Through Estimator for Quantization , author=. arXiv preprint arXiv:2505.18113 , year=

  60. [60]

    IEEE Transactions on Information Theory , volume=

    Optimal Rates for Zero-Order Convex Optimization: The Power of Two Function Evaluations , author=. IEEE Transactions on Information Theory , volume=

  61. [61]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Query Complexity of Derivative-Free Optimization , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  62. [62]

    Local search for

    Cai, Shaowei and Lei, Zhendong and Zhang, Xindi , journal=. Local search for

  63. [63]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Online Training Through Time for Spiking Neural Networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  64. [64]

    Loshchilov, Ilya , journal=

  65. [65]

    Evolutionary Computation , volume=

    Diagonal Acceleration for Covariance Matrix Adaptation Evolution Strategies , author=. Evolutionary Computation , volume=

  66. [66]

    Ignatiev, Alexey and Morgado, Antonio and Marques-Silva, Joao , journal=

  67. [67]

    2025 , note=

    Pirillo, Alberto and Colombo, Luca and Roveri, Manuel , booktitle=. 2025 , note=

  68. [68]

    Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) , year=

    Sinkformers: Transformers with Doubly Stochastic Attention , author=. Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) , year=

  69. [69]

    arXiv preprint arXiv:2504.21662 , year=

    On Advancements of the Forward-Forward Algorithm , author=. arXiv preprint arXiv:2504.21662 , year=