arxiv: 2605.08657 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Fitting Multilinear Polynomials for Logic Gate Networks

Youngsung Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords logic gate networksmultilinear polynomialsvector quantizationstraight-through estimatorgradient starvationcombinational circuitsdeep networksparameter reduction

0 comments

The pith

Logic gate networks learn by fitting four-coefficient multilinear polynomials per neuron instead of selecting from a sixteen-gate codebook.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that each 2-input Boolean gate corresponds to a unique multilinear polynomial in four dimensions, turning the choice of gate into a vector-quantization task inside that low-dimensional space. The baseline softmax method over the sixteen gates produces null gradients on most directions because the codebook is rank-four, and the paper proves no affine reparameterization rescues the interaction coefficients under straight-through estimation. Replacing the selection with the covariance Jacobian of soft vector quantization couples those starved coefficients to the always-active constant channel, restoring usable gradients. This change cuts parameters from sixteen to four per neuron and yields stable performance as depth grows, whereas the baseline collapses.

Core claim

Working directly in the four-dimensional space of multilinear polynomials that span the sixteen possible 2-input Boolean gates reduces each logic neuron to four parameters. The covariance Jacobian of soft-VQ selection supplies gradients to the interaction coefficients by coupling them to the constant term, bypassing the starvation that occurs under STE with softmax or affine reparameterizations. On seven datasets this four-parameter formulation matches or exceeds the sixteen-parameter baseline, with the performance gap widening monotonically with interaction demand and remaining stable at twelve layers while the baseline drops sharply.

What carries the argument

The covariance Jacobian of soft-VQ selection in the four-dimensional multilinear polynomial space, which couples interaction coefficients to the constant channel to restore gradients.

If this is right

Each neuron requires only four learnable parameters rather than sixteen.
The performance advantage over softmax selection increases with the interaction demand of the task.
Accuracy remains nearly constant when depth rises from shallow to twelve layers on CIFAR-10 and MNIST.
No affine product reparameterization can restore the missing interaction-coefficient gradients under straight-through estimation.
At least one four-parameter variant matches or exceeds the baseline across all seven evaluated datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same low-rank codebook structure may appear in other discrete selection problems inside neural networks, suggesting the covariance-Jacobian fix could transfer.
Explicit four-parameter gate representations could enable direct mapping to hardware logic circuits without additional discretization steps.
Testing on networks deeper than twelve layers or on tasks with still higher interaction complexity would reveal whether the gradient coupling scales without bound.

Load-bearing premise

The observed rank-four structure of the gate codebook together with the coupling supplied by the covariance Jacobian will keep providing usable gradients for arbitrary interaction demands and network depths.

What would settle it

A dataset or depth at which the covariance-Jacobian method produces vanishing gradients or collapses in accuracy while the interaction structure remains unchanged.

Figures

Figures reproduced from arXiv: 2605.08657 by Youngsung Kim.

**Figure 2.** Figure 2: Accuracy vs depth (3 seeds). Soft-Mix collapses on both MNIST and CIFAR-10; CovJac holds (−0.5pp on CIFAR-10 at L=12) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: CIFAR-10 (k=128k, L=6, 3 seeds). (a) Accuracy over training. (b) cab gradient ratio: CovJac reaches 14×, STE stays at theoretical 0.25. (c) Commitment (std(cab)): STE commits fast, CovJac slow, Soft-Mix never. (d) Commitment vs accuracy: STE commits to wrong gates and stalls; CovJac learns first, then commits; Soft-Mix improves without committing. 0 10 20 30 40 50 Iteration (×1k) 0 5 10 Layer 0.0 0.2 0.4 0… view at source ↗

**Figure 4.** Figure 4: Signal survival sl per layer on MNIST (L=12, 3 seeds). Soft-Mix (red): sl ≈0.3 (71% cancelled), confirming Proposition 2. Ours (blue, teal): sl ≈1.0 (no cancellation). SM G-ST STE CJ 0 20 40 60 80 100 Gate type (%) 29% 34% 35% 26% 45% 40% 39% 38% 22% 18% 16% 36% Adult SM G-ST STE CJ 38% 39% 35% 31% 41% 38% 40% 41% 15% 14% 16% 27% MNIST SM G-ST STE CJ 34% 36% 33% 30% 44% 42% 37% 44% 18% 18% 21% 26% CIFAR-… view at source ↗

**Figure 6.** Figure 6: CovJac−STE gap grows monotonically with interaction demand across 7 datasets. cab starvation (25% coverage) hurts more as tasks require more interactive gates. E.2 Discretization gap The discretization gap (DG) measures the accuracy difference between the training-time forward pass and the deployed hard Boolean circuit: DG = Atrain − Ahard. For Multilinear-STE, DG=0 by construction (both phases use the qua… view at source ↗

**Figure 7.** Figure 7: The 16-gate codebook in (ca, cb, cab) space. Blue (c0=0): FALSE, A, B, AND, OR, XOR and variants. Red (c0=1): TRUE, ¬A, ¬B, NAND, NOR, XNOR and variants. Lines connect complementary pairs (g ↔ 1−g). Shape indicates |cab|: ◦ separable, □ weak, ⋄ strong interaction. 0 1 c0 2 1 0 1 2 cab {FALSE,A,B} {TRUE,¬A,¬B} {OR,A B,B A} AND {XNOR,A B,B A} NAND XOR NOR M-STE (ours) 27% 16% 10% 12% 10% 6% 9% 10% 0 1 c0 {FA… view at source ↗

**Figure 8.** Figure 8: Snapped gate assignments on CIFAR-10 (k=128k, L=6). Bubble size ∝ fraction of neurons selecting each gate. CovJac selects more XOR/XNOR (strong interaction) gates; Soft-Mix clusters near separable gates. on MNIST (k=64k) accuracy drops to 98.03% (vs 98.30% fixed), and on CIFAR-10 (k=128k) it drops to 56.12±0.84% (vs 58.97% fixed, 3 seeds). The learned τ collapses to ∼0.01–0.03 per layer, with high seed var… view at source ↗

read the original abstract

We study learnable logic gate networks that stack layers of 2-input Boolean gates to build combinational circuits. Every 2-input gate has a unique multilinear polynomial with 4 coefficients, so the 16 Boolean gates form a codebook of prototypes in a 4-dimensional space, reducing training to a vector-quantization problem. The baseline method, Soft-Mix, learns a 16-dimensional softmax over gate identities, but the codebook has rank~4: 11 of 15 simplex directions carry nullspace gradient, and at uniform initialization the backward signal vanishes exactly. We prove that no affine product reparameterization fixes the resulting interaction-coefficient starvation under STE, and show that the covariance Jacobian of soft-VQ selection bypasses it by coupling the starved coefficient to the always-active constant channel. Working in the 4-dimensional polynomial space reduces each neuron from 16 to 4 parameters. On seven datasets, at least one 4-parameter method matches or exceeds Soft-Mix on every dataset; the CovJac advantage over STE grows monotonically with interaction demand across all seven datasets. At depth, Soft-Mix collapses ($-37.3$pp on CIFAR-10 at 12 layers) while CovJac holds ($-0.5$pp on CIFAR-10, stable on MNIST).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper fixes a gradient starvation issue in logic-gate networks by shifting to a 4D multilinear polynomial basis and adding a covariance Jacobian that couples starved coefficients to the constant channel.

read the letter

The main point is that standard softmax-over-gates training for these stacked Boolean networks runs into nullspace gradients because the 16-gate codebook has rank 4. The authors show no affine product reparameterization resolves the resulting starvation of interaction coefficients under the straight-through estimator, then introduce the covariance Jacobian of soft vector quantization to bypass it by tying the starved terms to the always-active constant channel. That reduction also drops each neuron from 16 parameters to 4, which is a direct consequence of working in the polynomial basis for 2-input gates.

Referee Report

3 major / 2 minor

Summary. This paper studies learnable logic gate networks by representing 2-input Boolean gates as multilinear polynomials, forming a 4-dimensional codebook that reduces parameters per neuron to 4. It identifies that the Soft-Mix baseline suffers from nullspace gradients under the straight-through estimator due to the rank-4 structure of the codebook, proves that no affine product reparameterization can resolve the resulting interaction-coefficient starvation, and proposes using the covariance Jacobian from soft vector quantization to couple the starved coefficients to the constant channel. Experiments on seven datasets demonstrate that 4-parameter methods are competitive with or better than Soft-Mix, with the CovJac method showing increasing advantage as interaction demand grows and maintaining performance at greater network depths where Soft-Mix collapses.

Significance. The work provides a theoretically motivated approach to improving gradient flow in logic gate networks through polynomial reparameterization and Jacobian-based selection. If the proof holds and the empirical findings are confirmed with additional controls, it could enable more efficient and deeper combinational circuit learning models, with clear benefits in parameter efficiency and depth scalability on tasks like image classification.

major comments (3)

Abstract: the claim that 'we prove that no affine product reparameterization fixes the resulting interaction-coefficient starvation under STE' is load-bearing for the motivation of CovJac, yet the full derivation is absent from the provided text and must be supplied with all intermediate steps to allow verification.
Empirical results: the reported consistent gains across seven datasets and the depth-dependent divergence (e.g., -37.3pp collapse for Soft-Mix on CIFAR-10) lack error bars, run counts, and ablations isolating the covariance Jacobian contribution, which are required to substantiate robustness beyond the tested interaction demands.
Method description: the covariance Jacobian bypass mechanism (coupling starved coefficients to the constant channel) is asserted but requires explicit matrix derivation and Jacobian form to confirm it indeed supplies usable gradients where STE nullspace gradients vanish.

minor comments (2)

Abstract: the statement that 'at least one 4-parameter method matches or exceeds Soft-Mix on every dataset' should name the specific methods and include the corresponding quantitative margins for the depth and interaction-demand claims.
Notation: define STE, VQ, and CovJac at first use and clarify the exact mapping from the 4D polynomial coefficients to the 16 gate prototypes with an example equation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each of the major comments below and commit to revising the manuscript to incorporate the requested clarifications and additions.

read point-by-point responses

Referee: Abstract: the claim that 'we prove that no affine product reparameterization fixes the resulting interaction-coefficient starvation under STE' is load-bearing for the motivation of CovJac, yet the full derivation is absent from the provided text and must be supplied with all intermediate steps to allow verification.

Authors: We recognize that the full derivation of the proof was not provided with sufficient detail in the manuscript text. We will supply the complete proof, including all intermediate steps, in the revised version to allow full verification of the claim. revision: yes
Referee: Empirical results: the reported consistent gains across seven datasets and the depth-dependent divergence (e.g., -37.3pp collapse for Soft-Mix on CIFAR-10) lack error bars, run counts, and ablations isolating the covariance Jacobian contribution, which are required to substantiate robustness beyond the tested interaction demands.

Authors: We agree that error bars, run counts, and isolating ablations are important for substantiating the results. We will add these to the revised manuscript, including standard deviations from multiple runs and ablations comparing the covariance Jacobian contribution. revision: yes
Referee: Method description: the covariance Jacobian bypass mechanism (coupling starved coefficients to the constant channel) is asserted but requires explicit matrix derivation and Jacobian form to confirm it indeed supplies usable gradients where STE nullspace gradients vanish.

Authors: We agree that an explicit derivation is needed. In the revision, we will include the detailed matrix derivation of the covariance Jacobian and explain its form to show how it supplies gradients to the starved coefficients. revision: yes

Circularity Check

0 steps flagged

No significant circularity: mathematical derivations and reparameterizations are self-contained

full rationale

The paper's core chain consists of (1) representing 2-input gates as unique multilinear polynomials (a standard algebraic fact, not derived from data), (2) observing the rank-4 structure of the resulting 16-point codebook (direct linear-algebra consequence of the 4-coefficient basis), (3) a stated proof that no affine product reparameterization resolves STE nullspace gradients, and (4) derivation of the covariance Jacobian from the soft-VQ selection rule itself, which couples coefficients to the constant term by construction of the polynomial representation. None of these steps reduces a claimed prediction or advantage to a fitted parameter or to a self-citation whose content is unverified. The experimental results on seven datasets are presented as empirical validation of the bypass mechanism rather than as the source of the mechanism. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the method appears to rest on the algebraic structure of multilinear polynomials over Boolean inputs and the rank deficiency of the 16-gate codebook.

pith-pipeline@v0.9.0 · 5523 in / 1064 out tokens · 55538 ms · 2026-05-12T01:55:50.400593+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

[1]

Deep differentiable logic gate networks

Felix Petersen, Christian Borgelt, Hilde Kuehne, and Oliver Deussen. Deep differentiable logic gate networks. InAdvances in Neural Information Processing Systems, volume 35, 2022

work page 2022
[2]

Convo- lutional differentiable logic gate networks

Felix Petersen, Hilde Kuehne, Christian Borgelt, Julian Welzel, and Stefano Ermon. Convo- lutional differentiable logic gate networks. InAdvances in Neural Information Processing Systems, 2024. arXiv:2411.04732

work page arXiv 2024
[3]

Constantinides

Marta Andronic and George A. Constantinides. PolyLUT: Learning piecewise polynomials for ultra-low latency FPGA LUT-based inference. InInternational Conference on Field Programmable Technology (ICFPT), pages 60–68, 2023

work page 2023
[4]

Alan T. L. Bacellar, Zachary Susskind, Mauricio Breternitz Jr, Eugene John, Lizy Kurian John, Priscila Machado Vieira Lima, and Felipe M. G. França. Differentiable weightless neural networks. InInternational Conference on Machine Learning (ICML), 2024. Differentiable training ofk-input weightless lookup tables on FPGA-targeted classifiers

work page 2024
[5]

Springer, 2012

Stasys Jukna.Boolean Function Complexity: Advances and Frontiers, volume 27 ofAlgorithms and Combinatorics. Springer, 2012. doi: 10.1007/978-3-642-24508-4

work page doi:10.1007/978-3-642-24508-4 2012
[6]

Hammer and Sergiu Rudeanu.Boolean Methods in Operations Research and Related Areas

Peter L. Hammer and Sergiu Rudeanu.Boolean Methods in Operations Research and Related Areas. Springer, Berlin, Heidelberg, 1968. doi: 10.1007/978-3-642-85823-9

work page doi:10.1007/978-3-642-85823-9 1968
[7]

Multilinear extensions of games.Management Science, 18(5-Part-2):P64–P79,

Guillermo Owen. Multilinear extensions of games.Management Science, 18(5-Part-2):P64–P79,

work page
[8]

doi: 10.1287/mnsc.18.5.64

work page doi:10.1287/mnsc.18.5.64
[9]

DARTS: Differentiable architecture search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. InInternational Conference on Learning Representations, 2019

work page 2019
[10]

Align forward, adapt backward: Closing the discretization gap in logic gate networks.arXiv preprint arXiv:2603.14157, 2026

Youngsung Kim. Align forward, adapt backward: Closing the discretization gap in logic gate networks.arXiv preprint arXiv:2603.14157, 2026

work page arXiv 2026
[11]

Towards narrowing the generalization gap in deep boolean networks.arXiv preprint arXiv:2409.05905, 2024

Youngsung Kim. Towards narrowing the generalization gap in deep boolean networks.arXiv preprint arXiv:2409.05905, 2024

work page arXiv 2024
[12]

Categorical reparameterization with Gumbel-Softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-Softmax. InInternational Conference on Learning Representations, 2017

work page 2017
[13]

Maddison, Andriy Mnih, and Yee Whye Teh

Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous re- laxation of discrete random variables. InInternational Conference on Learning Representations, 2017

work page 2017
[14]

Estimating or propagating gradients through stochastic neurons for conditional computation, 2013

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013

work page 2013
[15]

Deep stochastic logic gate networks.IEEE Access, 11:122488–122501, 2023

Youngsung Kim. Deep stochastic logic gate networks.IEEE Access, 11:122488–122501, 2023. doi: 10.1109/ACCESS.2023.3328622. 10

work page doi:10.1109/access.2023.3328622 2023
[16]

Mind the gap: Removing the discretization gap in differentiable logic gate networks

Shakir Yousefi, Andreas Plesner, Till Aczel, and Roger Wattenhofer. Mind the gap: Removing the discretization gap in differentiable logic gate networks. InAdvances in Neural Information Processing Systems, 2025. arXiv:2506.07500

work page arXiv 2025
[17]

Light differentiable logic gate networks

Lukas Rüttgers, Till Aczel, Andreas Plesner, and Roger Wattenhofer. Light differentiable logic gate networks. InInternational Conference on Learning Representations, 2026. arXiv:2510.03250

work page arXiv 2026
[18]

Cambridge University Press, Cambridge, UK,

Ryan O’Donnell.Analysis of Boolean Functions. Cambridge University Press, Cambridge, UK,

work page
[19]

doi: 10.1017/CBO9781139814782

work page doi:10.1017/cbo9781139814782
[20]

Ilya M. Sobol’. Sensitivity estimates for nonlinear mathematical models.Mathematical Modelling and Computational Experiments, 1(4):407–414, 1993

work page 1993
[21]

Art B. Owen. Variance components and generalized Sobol’ indices.SIAM/ASA Journal on Uncertainty Quantification, 1(1):19–41, 2013. doi: 10.1137/120876782

work page doi:10.1137/120876782 2013
[22]

Neural discrete representation learning

Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[23]

Soft-to-hard vector quantization for end-to-end learning compressible representations

Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Timofte, Luca Benini, and Luc Van Gool. Soft-to-hard vector quantization for end-to-end learning compressible representations. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[24]

Straightening out the straight- through estimator: Overcoming optimization challenges in vector quantized networks

Minyoung Huh, Brian Cheung, Pulkit Agrawal, and Phillip Isola. Straightening out the straight- through estimator: Overcoming optimization challenges in vector quantized networks. In International Conference on Machine Learning, 2023

work page 2023
[25]

Hammer.Boolean Functions: Theory, Algorithms, and Applications, volume 142 ofEncyclopedia of Mathematics and its Applications

Yves Crama and Peter L. Hammer.Boolean Functions: Theory, Algorithms, and Applications, volume 142 ofEncyclopedia of Mathematics and its Applications. Cambridge University Press,

work page
[26]

doi: 10.1017/CBO9780511852008

work page doi:10.1017/cbo9780511852008
[27]

Bishop.Pattern Recognition and Machine Learning

Christopher M. Bishop.Pattern Recognition and Machine Learning. Springer, 2006

work page 2006
[28]

UCI machine learning repository, 2019

Dheeru Dua and Casey Graff. UCI machine learning repository, 2019

work page 2019
[29]

Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

work page 1998
[30]

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y . Ng. Reading digits in natural images with unsupervised feature learning. InNIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011

work page 2011
[31]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

work page 2009
[32]

Thrun et al

Sebastian B. Thrun et al. The MONK’s problems: A performance comparison of different learning algorithms. Technical Report CMU-CS-91-197, Carnegie Mellon University, 1991

work page 1991
[33]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInterna- tional Conference on Learning Representations, 2015

work page 2015
[34]

Kluwer Academic Publishers, Dordrecht, 2000

Erich Peter Klement, Radko Mesiar, and Endre Pap.Triangular Norms, volume 8 ofTrends in Logic. Kluwer Academic Publishers, Dordrecht, 2000. doi: 10.1007/978-94-015-9540-7

work page doi:10.1007/978-94-015-9540-7 2000
[35]

2022 , month = jan, journal =

Emile van Krieken, Erman Acar, and Frank van Harmelen. Analyzing differentiable fuzzy logic operators.Artificial Intelligence, 302:103602, 2022. doi: 10.1016/j.artint.2021.103602

work page doi:10.1016/j.artint.2021.103602 2022
[36]

ProbLog: A probabilistic Prolog and its application in link discovery

Luc De Raedt, Angelika Kimmig, and Hannu Toivonen. ProbLog: A probabilistic Prolog and its application in link discovery. InInternational Joint Conference on Artificial Intelligence (IJCAI), pages 2462–2467, 2007

work page 2007
[37]

Probabilistic circuits: A unifying framework for tractable probabilistic models

YooJung Choi, Antonio Vergari, and Guy Van den Broeck. Probabilistic circuits: A unifying framework for tractable probabilistic models. 2020. 11

work page 2020
[38]

Binarized neural networks

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. InAdvances in Neural Information Processing Systems, volume 29, 2016

work page 2016
[39]

XNOR-Net: Ima- geNet classification using binary convolutional neural networks

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR-Net: Ima- geNet classification using binary convolutional neural networks. InEuropean Conference on Computer Vision (ECCV), pages 525–542, 2016

work page 2016
[40]

Tiny ImageNet visual recognition challenge

Ya Le and Xuan Yang. Tiny ImageNet visual recognition challenge. 2015. CS231N Course Report, Stanford University. A Full Proofs Notation.For each of the 16 Boolean functions gj :{0,1} 2 → {0,1} (indexed j= 0, . . . ,15 by the truth-table bit pattern), we write its unique multilinear representation gj(a, b) =c 0,j +c a,j a+c b,j b+c ab,j ab, with coefficie...

work page 2015
[41]

At initialization, every neuron has π(l) near uniform, so by the near-uniform bound after Proposition 2, its input Jacobian∂z (l)/∂(a(l), b(l))has small norm

work page
[42]

, L to already have non-trivial input Jacobians

A layer l can commit (move its π(l) away from uniform) only if it receives a non-trivial π- gradient, which requires ∂L/∂z (l) to be non-trivial, which in turn requires layersl+1, . . . , L to already have non-trivial input Jacobians

work page
[43]

all intermediate layers are simultaneously near-uniform throughout training

Hence commitment proceeds greedily from the output layer toward the input layer. At sufficient depth, early layers may never commit within a finite training budget. Assumptions used.(a) No residual/skip connections; (b) near-uniform initialization of gate logits; (c) the analysis is dynamical, not a static worst-case bound; (d) the gradient path of intere...

work page 1928