pith. machine review for the scientific record. sign in

arxiv: 2605.08657 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Fitting Multilinear Polynomials for Logic Gate Networks

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords logic gate networksmultilinear polynomialsvector quantizationstraight-through estimatorgradient starvationcombinational circuitsdeep networksparameter reduction
0
0 comments X

The pith

Logic gate networks learn by fitting four-coefficient multilinear polynomials per neuron instead of selecting from a sixteen-gate codebook.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that each 2-input Boolean gate corresponds to a unique multilinear polynomial in four dimensions, turning the choice of gate into a vector-quantization task inside that low-dimensional space. The baseline softmax method over the sixteen gates produces null gradients on most directions because the codebook is rank-four, and the paper proves no affine reparameterization rescues the interaction coefficients under straight-through estimation. Replacing the selection with the covariance Jacobian of soft vector quantization couples those starved coefficients to the always-active constant channel, restoring usable gradients. This change cuts parameters from sixteen to four per neuron and yields stable performance as depth grows, whereas the baseline collapses.

Core claim

Working directly in the four-dimensional space of multilinear polynomials that span the sixteen possible 2-input Boolean gates reduces each logic neuron to four parameters. The covariance Jacobian of soft-VQ selection supplies gradients to the interaction coefficients by coupling them to the constant term, bypassing the starvation that occurs under STE with softmax or affine reparameterizations. On seven datasets this four-parameter formulation matches or exceeds the sixteen-parameter baseline, with the performance gap widening monotonically with interaction demand and remaining stable at twelve layers while the baseline drops sharply.

What carries the argument

The covariance Jacobian of soft-VQ selection in the four-dimensional multilinear polynomial space, which couples interaction coefficients to the constant channel to restore gradients.

If this is right

  • Each neuron requires only four learnable parameters rather than sixteen.
  • The performance advantage over softmax selection increases with the interaction demand of the task.
  • Accuracy remains nearly constant when depth rises from shallow to twelve layers on CIFAR-10 and MNIST.
  • No affine product reparameterization can restore the missing interaction-coefficient gradients under straight-through estimation.
  • At least one four-parameter variant matches or exceeds the baseline across all seven evaluated datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same low-rank codebook structure may appear in other discrete selection problems inside neural networks, suggesting the covariance-Jacobian fix could transfer.
  • Explicit four-parameter gate representations could enable direct mapping to hardware logic circuits without additional discretization steps.
  • Testing on networks deeper than twelve layers or on tasks with still higher interaction complexity would reveal whether the gradient coupling scales without bound.

Load-bearing premise

The observed rank-four structure of the gate codebook together with the coupling supplied by the covariance Jacobian will keep providing usable gradients for arbitrary interaction demands and network depths.

What would settle it

A dataset or depth at which the covariance-Jacobian method produces vanishing gradients or collapses in accuracy while the interaction structure remains unchanged.

Figures

Figures reproduced from arXiv: 2605.08657 by Youngsung Kim.

Figure 1
Figure 1. Figure 1: Both methods differ in how the coefficient vector [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy vs depth (3 seeds). Soft-Mix collapses on both MNIST and CIFAR-10; CovJac holds (−0.5pp on CIFAR-10 at L=12) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: CIFAR-10 (k=128k, L=6, 3 seeds). (a) Accuracy over training. (b) cab gradient ratio: CovJac reaches 14×, STE stays at theoretical 0.25. (c) Commitment (std(cab)): STE commits fast, CovJac slow, Soft-Mix never. (d) Commitment vs accuracy: STE commits to wrong gates and stalls; CovJac learns first, then commits; Soft-Mix improves without committing. 0 10 20 30 40 50 Iteration (×1k) 0 5 10 Layer 0.0 0.2 0.4 0… view at source ↗
Figure 4
Figure 4. Figure 4: Signal survival sl per layer on MNIST (L=12, 3 seeds). Soft-Mix (red): sl ≈0.3 (71% can￾celled), confirming Proposition 2. Ours (blue, teal): sl ≈1.0 (no can￾cellation). SM G-ST STE CJ 0 20 40 60 80 100 Gate type (%) 29% 34% 35% 26% 45% 40% 39% 38% 22% 18% 16% 36% Adult SM G-ST STE CJ 38% 39% 35% 31% 41% 38% 40% 41% 15% 14% 16% 27% MNIST SM G-ST STE CJ 34% 36% 33% 30% 44% 42% 37% 44% 18% 18% 21% 26% CIFAR-… view at source ↗
Figure 6
Figure 6. Figure 6: CovJac−STE gap grows monotonically with interaction demand across 7 datasets. cab starvation (25% coverage) hurts more as tasks require more interactive gates. E.2 Discretization gap The discretization gap (DG) measures the accuracy difference between the training-time forward pass and the deployed hard Boolean circuit: DG = Atrain − Ahard. For Multilinear-STE, DG=0 by construction (both phases use the qua… view at source ↗
Figure 7
Figure 7. Figure 7: The 16-gate codebook in (ca, cb, cab) space. Blue (c0=0): FALSE, A, B, AND, OR, XOR and variants. Red (c0=1): TRUE, ¬A, ¬B, NAND, NOR, XNOR and variants. Lines connect complementary pairs (g ↔ 1−g). Shape indicates |cab|: ◦ separable, □ weak, ⋄ strong interaction. 0 1 c0 2 1 0 1 2 cab {FALSE,A,B} {TRUE,¬A,¬B} {OR,A B,B A} AND {XNOR,A B,B A} NAND XOR NOR M-STE (ours) 27% 16% 10% 12% 10% 6% 9% 10% 0 1 c0 {FA… view at source ↗
Figure 8
Figure 8. Figure 8: Snapped gate assignments on CIFAR-10 (k=128k, L=6). Bubble size ∝ fraction of neurons selecting each gate. CovJac selects more XOR/XNOR (strong interaction) gates; Soft-Mix clusters near separable gates. on MNIST (k=64k) accuracy drops to 98.03% (vs 98.30% fixed), and on CIFAR-10 (k=128k) it drops to 56.12±0.84% (vs 58.97% fixed, 3 seeds). The learned τ collapses to ∼0.01–0.03 per layer, with high seed var… view at source ↗
read the original abstract

We study learnable logic gate networks that stack layers of 2-input Boolean gates to build combinational circuits. Every 2-input gate has a unique multilinear polynomial with 4 coefficients, so the 16 Boolean gates form a codebook of prototypes in a 4-dimensional space, reducing training to a vector-quantization problem. The baseline method, Soft-Mix, learns a 16-dimensional softmax over gate identities, but the codebook has rank~4: 11 of 15 simplex directions carry nullspace gradient, and at uniform initialization the backward signal vanishes exactly. We prove that no affine product reparameterization fixes the resulting interaction-coefficient starvation under STE, and show that the covariance Jacobian of soft-VQ selection bypasses it by coupling the starved coefficient to the always-active constant channel. Working in the 4-dimensional polynomial space reduces each neuron from 16 to 4 parameters. On seven datasets, at least one 4-parameter method matches or exceeds Soft-Mix on every dataset; the CovJac advantage over STE grows monotonically with interaction demand across all seven datasets. At depth, Soft-Mix collapses ($-37.3$pp on CIFAR-10 at 12 layers) while CovJac holds ($-0.5$pp on CIFAR-10, stable on MNIST).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. This paper studies learnable logic gate networks by representing 2-input Boolean gates as multilinear polynomials, forming a 4-dimensional codebook that reduces parameters per neuron to 4. It identifies that the Soft-Mix baseline suffers from nullspace gradients under the straight-through estimator due to the rank-4 structure of the codebook, proves that no affine product reparameterization can resolve the resulting interaction-coefficient starvation, and proposes using the covariance Jacobian from soft vector quantization to couple the starved coefficients to the constant channel. Experiments on seven datasets demonstrate that 4-parameter methods are competitive with or better than Soft-Mix, with the CovJac method showing increasing advantage as interaction demand grows and maintaining performance at greater network depths where Soft-Mix collapses.

Significance. The work provides a theoretically motivated approach to improving gradient flow in logic gate networks through polynomial reparameterization and Jacobian-based selection. If the proof holds and the empirical findings are confirmed with additional controls, it could enable more efficient and deeper combinational circuit learning models, with clear benefits in parameter efficiency and depth scalability on tasks like image classification.

major comments (3)
  1. Abstract: the claim that 'we prove that no affine product reparameterization fixes the resulting interaction-coefficient starvation under STE' is load-bearing for the motivation of CovJac, yet the full derivation is absent from the provided text and must be supplied with all intermediate steps to allow verification.
  2. Empirical results: the reported consistent gains across seven datasets and the depth-dependent divergence (e.g., -37.3pp collapse for Soft-Mix on CIFAR-10) lack error bars, run counts, and ablations isolating the covariance Jacobian contribution, which are required to substantiate robustness beyond the tested interaction demands.
  3. Method description: the covariance Jacobian bypass mechanism (coupling starved coefficients to the constant channel) is asserted but requires explicit matrix derivation and Jacobian form to confirm it indeed supplies usable gradients where STE nullspace gradients vanish.
minor comments (2)
  1. Abstract: the statement that 'at least one 4-parameter method matches or exceeds Soft-Mix on every dataset' should name the specific methods and include the corresponding quantitative margins for the depth and interaction-demand claims.
  2. Notation: define STE, VQ, and CovJac at first use and clarify the exact mapping from the 4D polynomial coefficients to the 16 gate prototypes with an example equation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each of the major comments below and commit to revising the manuscript to incorporate the requested clarifications and additions.

read point-by-point responses
  1. Referee: Abstract: the claim that 'we prove that no affine product reparameterization fixes the resulting interaction-coefficient starvation under STE' is load-bearing for the motivation of CovJac, yet the full derivation is absent from the provided text and must be supplied with all intermediate steps to allow verification.

    Authors: We recognize that the full derivation of the proof was not provided with sufficient detail in the manuscript text. We will supply the complete proof, including all intermediate steps, in the revised version to allow full verification of the claim. revision: yes

  2. Referee: Empirical results: the reported consistent gains across seven datasets and the depth-dependent divergence (e.g., -37.3pp collapse for Soft-Mix on CIFAR-10) lack error bars, run counts, and ablations isolating the covariance Jacobian contribution, which are required to substantiate robustness beyond the tested interaction demands.

    Authors: We agree that error bars, run counts, and isolating ablations are important for substantiating the results. We will add these to the revised manuscript, including standard deviations from multiple runs and ablations comparing the covariance Jacobian contribution. revision: yes

  3. Referee: Method description: the covariance Jacobian bypass mechanism (coupling starved coefficients to the constant channel) is asserted but requires explicit matrix derivation and Jacobian form to confirm it indeed supplies usable gradients where STE nullspace gradients vanish.

    Authors: We agree that an explicit derivation is needed. In the revision, we will include the detailed matrix derivation of the covariance Jacobian and explain its form to show how it supplies gradients to the starved coefficients. revision: yes

Circularity Check

0 steps flagged

No significant circularity: mathematical derivations and reparameterizations are self-contained

full rationale

The paper's core chain consists of (1) representing 2-input gates as unique multilinear polynomials (a standard algebraic fact, not derived from data), (2) observing the rank-4 structure of the resulting 16-point codebook (direct linear-algebra consequence of the 4-coefficient basis), (3) a stated proof that no affine product reparameterization resolves STE nullspace gradients, and (4) derivation of the covariance Jacobian from the soft-VQ selection rule itself, which couples coefficients to the constant term by construction of the polynomial representation. None of these steps reduces a claimed prediction or advantage to a fitted parameter or to a self-citation whose content is unverified. The experimental results on seven datasets are presented as empirical validation of the bypass mechanism rather than as the source of the mechanism. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the method appears to rest on the algebraic structure of multilinear polynomials over Boolean inputs and the rank deficiency of the 16-gate codebook.

pith-pipeline@v0.9.0 · 5523 in / 1064 out tokens · 55538 ms · 2026-05-12T01:55:50.400593+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

  1. [1]

    Deep differentiable logic gate networks

    Felix Petersen, Christian Borgelt, Hilde Kuehne, and Oliver Deussen. Deep differentiable logic gate networks. InAdvances in Neural Information Processing Systems, volume 35, 2022

  2. [2]

    Convo- lutional differentiable logic gate networks

    Felix Petersen, Hilde Kuehne, Christian Borgelt, Julian Welzel, and Stefano Ermon. Convo- lutional differentiable logic gate networks. InAdvances in Neural Information Processing Systems, 2024. arXiv:2411.04732

  3. [3]

    Constantinides

    Marta Andronic and George A. Constantinides. PolyLUT: Learning piecewise polynomials for ultra-low latency FPGA LUT-based inference. InInternational Conference on Field Programmable Technology (ICFPT), pages 60–68, 2023

  4. [4]

    Alan T. L. Bacellar, Zachary Susskind, Mauricio Breternitz Jr, Eugene John, Lizy Kurian John, Priscila Machado Vieira Lima, and Felipe M. G. França. Differentiable weightless neural networks. InInternational Conference on Machine Learning (ICML), 2024. Differentiable training ofk-input weightless lookup tables on FPGA-targeted classifiers

  5. [5]

    Springer, 2012

    Stasys Jukna.Boolean Function Complexity: Advances and Frontiers, volume 27 ofAlgorithms and Combinatorics. Springer, 2012. doi: 10.1007/978-3-642-24508-4

  6. [6]

    Hammer and Sergiu Rudeanu.Boolean Methods in Operations Research and Related Areas

    Peter L. Hammer and Sergiu Rudeanu.Boolean Methods in Operations Research and Related Areas. Springer, Berlin, Heidelberg, 1968. doi: 10.1007/978-3-642-85823-9

  7. [7]

    Multilinear extensions of games.Management Science, 18(5-Part-2):P64–P79,

    Guillermo Owen. Multilinear extensions of games.Management Science, 18(5-Part-2):P64–P79,

  8. [8]

    doi: 10.1287/mnsc.18.5.64

  9. [9]

    DARTS: Differentiable architecture search

    Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. InInternational Conference on Learning Representations, 2019

  10. [10]

    Align forward, adapt backward: Closing the discretization gap in logic gate networks.arXiv preprint arXiv:2603.14157, 2026

    Youngsung Kim. Align forward, adapt backward: Closing the discretization gap in logic gate networks.arXiv preprint arXiv:2603.14157, 2026

  11. [11]

    Towards narrowing the generalization gap in deep boolean networks.arXiv preprint arXiv:2409.05905, 2024

    Youngsung Kim. Towards narrowing the generalization gap in deep boolean networks.arXiv preprint arXiv:2409.05905, 2024

  12. [12]

    Categorical reparameterization with Gumbel-Softmax

    Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-Softmax. InInternational Conference on Learning Representations, 2017

  13. [13]

    Maddison, Andriy Mnih, and Yee Whye Teh

    Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous re- laxation of discrete random variables. InInternational Conference on Learning Representations, 2017

  14. [14]

    Estimating or propagating gradients through stochastic neurons for conditional computation, 2013

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013

  15. [15]

    Deep stochastic logic gate networks.IEEE Access, 11:122488–122501, 2023

    Youngsung Kim. Deep stochastic logic gate networks.IEEE Access, 11:122488–122501, 2023. doi: 10.1109/ACCESS.2023.3328622. 10

  16. [16]

    Mind the gap: Removing the discretization gap in differentiable logic gate networks

    Shakir Yousefi, Andreas Plesner, Till Aczel, and Roger Wattenhofer. Mind the gap: Removing the discretization gap in differentiable logic gate networks. InAdvances in Neural Information Processing Systems, 2025. arXiv:2506.07500

  17. [17]

    Light differentiable logic gate networks

    Lukas Rüttgers, Till Aczel, Andreas Plesner, and Roger Wattenhofer. Light differentiable logic gate networks. InInternational Conference on Learning Representations, 2026. arXiv:2510.03250

  18. [18]

    Cambridge University Press, Cambridge, UK,

    Ryan O’Donnell.Analysis of Boolean Functions. Cambridge University Press, Cambridge, UK,

  19. [19]

    doi: 10.1017/CBO9781139814782

  20. [20]

    Ilya M. Sobol’. Sensitivity estimates for nonlinear mathematical models.Mathematical Modelling and Computational Experiments, 1(4):407–414, 1993

  21. [21]

    Art B. Owen. Variance components and generalized Sobol’ indices.SIAM/ASA Journal on Uncertainty Quantification, 1(1):19–41, 2013. doi: 10.1137/120876782

  22. [22]

    Neural discrete representation learning

    Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InAdvances in Neural Information Processing Systems, volume 30, 2017

  23. [23]

    Soft-to-hard vector quantization for end-to-end learning compressible representations

    Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Timofte, Luca Benini, and Luc Van Gool. Soft-to-hard vector quantization for end-to-end learning compressible representations. InAdvances in Neural Information Processing Systems, volume 30, 2017

  24. [24]

    Straightening out the straight- through estimator: Overcoming optimization challenges in vector quantized networks

    Minyoung Huh, Brian Cheung, Pulkit Agrawal, and Phillip Isola. Straightening out the straight- through estimator: Overcoming optimization challenges in vector quantized networks. In International Conference on Machine Learning, 2023

  25. [25]

    Hammer.Boolean Functions: Theory, Algorithms, and Applications, volume 142 ofEncyclopedia of Mathematics and its Applications

    Yves Crama and Peter L. Hammer.Boolean Functions: Theory, Algorithms, and Applications, volume 142 ofEncyclopedia of Mathematics and its Applications. Cambridge University Press,

  26. [26]

    doi: 10.1017/CBO9780511852008

  27. [27]

    Bishop.Pattern Recognition and Machine Learning

    Christopher M. Bishop.Pattern Recognition and Machine Learning. Springer, 2006

  28. [28]

    UCI machine learning repository, 2019

    Dheeru Dua and Casey Graff. UCI machine learning repository, 2019

  29. [29]

    Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

  30. [30]

    Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y . Ng. Reading digits in natural images with unsupervised feature learning. InNIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011

  31. [31]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  32. [32]

    Thrun et al

    Sebastian B. Thrun et al. The MONK’s problems: A performance comparison of different learning algorithms. Technical Report CMU-CS-91-197, Carnegie Mellon University, 1991

  33. [33]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInterna- tional Conference on Learning Representations, 2015

  34. [34]

    Kluwer Academic Publishers, Dordrecht, 2000

    Erich Peter Klement, Radko Mesiar, and Endre Pap.Triangular Norms, volume 8 ofTrends in Logic. Kluwer Academic Publishers, Dordrecht, 2000. doi: 10.1007/978-94-015-9540-7

  35. [35]

    2022 , month = jan, journal =

    Emile van Krieken, Erman Acar, and Frank van Harmelen. Analyzing differentiable fuzzy logic operators.Artificial Intelligence, 302:103602, 2022. doi: 10.1016/j.artint.2021.103602

  36. [36]

    ProbLog: A probabilistic Prolog and its application in link discovery

    Luc De Raedt, Angelika Kimmig, and Hannu Toivonen. ProbLog: A probabilistic Prolog and its application in link discovery. InInternational Joint Conference on Artificial Intelligence (IJCAI), pages 2462–2467, 2007

  37. [37]

    Probabilistic circuits: A unifying framework for tractable probabilistic models

    YooJung Choi, Antonio Vergari, and Guy Van den Broeck. Probabilistic circuits: A unifying framework for tractable probabilistic models. 2020. 11

  38. [38]

    Binarized neural networks

    Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. InAdvances in Neural Information Processing Systems, volume 29, 2016

  39. [39]

    XNOR-Net: Ima- geNet classification using binary convolutional neural networks

    Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR-Net: Ima- geNet classification using binary convolutional neural networks. InEuropean Conference on Computer Vision (ECCV), pages 525–542, 2016

  40. [40]

    Tiny ImageNet visual recognition challenge

    Ya Le and Xuan Yang. Tiny ImageNet visual recognition challenge. 2015. CS231N Course Report, Stanford University. A Full Proofs Notation.For each of the 16 Boolean functions gj :{0,1} 2 → {0,1} (indexed j= 0, . . . ,15 by the truth-table bit pattern), we write its unique multilinear representation gj(a, b) =c 0,j +c a,j a+c b,j b+c ab,j ab, with coefficie...

  41. [41]

    At initialization, every neuron has π(l) near uniform, so by the near-uniform bound after Proposition 2, its input Jacobian∂z (l)/∂(a(l), b(l))has small norm

  42. [42]

    , L to already have non-trivial input Jacobians

    A layer l can commit (move its π(l) away from uniform) only if it receives a non-trivial π- gradient, which requires ∂L/∂z (l) to be non-trivial, which in turn requires layersl+1, . . . , L to already have non-trivial input Jacobians

  43. [43]

    all intermediate layers are simultaneously near-uniform throughout training

    Hence commitment proceeds greedily from the output layer toward the input layer. At sufficient depth, early layers may never commit within a finite training budget. Assumptions used.(a) No residual/skip connections; (b) near-uniform initialization of gate logits; (c) the analysis is dynamical, not a static worst-case bound; (d) the gradient path of intere...