On Symmetry and Initialization for Neural Networks
Pith reviewed 2026-05-25 11:45 UTC · model grok-4.3
The pith
Symmetric initial conditions let one-hidden-layer networks learn symmetric functions efficiently with SGD.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When the target is symmetric, symmetric initialization makes the two layers interact so that SGD produces generalization guarantees efficiently; the same does not occur under random initialization.
What carries the argument
The interaction between the hidden layer and the output layer once weights are initialized symmetrically.
If this is right
- Standard SGD converges efficiently and yields generalization guarantees under the symmetric initialization.
- Random initialization does not produce the same guarantees for the same symmetric targets.
- The convergence proof follows from the specific interaction between the two layers.
- Symmetry should be incorporated into the design of neural networks.
Where Pith is reading between the lines
- The same symmetry-aware initialization idea could be tested on deeper networks by propagating the symmetry through additional layers.
- Tasks with other forms of structure, such as invariance under permutations, might admit analogous initialization schemes.
- Detecting symmetry in a data set before training could become a practical preprocessing step.
Load-bearing premise
The target functions are symmetric and the network has exactly one hidden layer.
What would settle it
Run SGD on a one-hidden-layer network with symmetric initialization for a known symmetric target and check whether the predicted generalization bound appears; the same experiment with random initialization should not produce the bound.
read the original abstract
This work provides an additional step in the theoretical understanding of neural networks. We consider neural networks with one hidden layer and show that when learning symmetric functions, one can choose initial conditions so that standard SGD training efficiently produces generalization guarantees. We empirically verify this and show that this does not hold when the initial conditions are chosen at random. The proof of convergence investigates the interaction between the two layers of the network. Our results highlight the importance of using symmetry in the design of neural networks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper considers one-hidden-layer neural networks and claims that, when the target functions are symmetric, specific initial conditions can be chosen so that standard SGD training efficiently yields generalization guarantees. This is supported by empirical verification (contrasted against random initialization) and a proof that analyzes the interaction between the hidden and output layers under the symmetric initialization.
Significance. If the result holds within its stated scope, it provides a concrete illustration of how symmetry can be exploited at initialization to obtain training dynamics that produce generalization bounds, highlighting the role of layer interactions. The combination of a convergence proof and empirical checks on symmetric targets is a strength; the work is scoped explicitly to one hidden layer and exactly symmetric functions rather than claiming broader applicability.
minor comments (1)
- The abstract and introduction would benefit from an explicit statement of the precise class of symmetric functions considered and the form of the generalization bound obtained.
Simulated Author's Rebuttal
We thank the referee for their positive review and recommendation to accept the manuscript. We appreciate the recognition of the paper's scope and the value of combining the convergence analysis with empirical verification on symmetric targets.
Circularity Check
No significant circularity; derivation self-contained
full rationale
The paper presents a theoretical result for one-hidden-layer networks on symmetric targets, with initialization chosen to exploit symmetry for SGD convergence and generalization bounds via explicit layer interaction analysis. The abstract and scope description indicate the proof relies on the stated symmetry assumptions and network architecture rather than any fitted parameters, self-referential definitions, or load-bearing self-citations. No equations or steps reduce by construction to inputs; the contrast with random initialization and empirical verification further support independent content. This is a standard non-finding for scoped theoretical work without the enumerated circular patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1: ... network with one hidden layer, cn neurons ... poly(n) SGD updates ... generalization guarantees for symmetric functions S_n
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 3 ... marg(Y) ≥ Ω(1/n) ... initialization ... ReLU ... hidden layer embedding
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 4 ... hidden layer ... moves at most O(R²_X h² R t^{3/2}) ... output neuron reaches good state
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Pro vable limitations of deep learning,
[Abbe and Sandon(2018)] Emmanuel Abbe and Colin Sandon. Pro vable limitations of deep learning,
work page 2018
-
[2]
[Ajtai(1983)] M. Ajtai. ∑11-formulae on finite structures. Annals of Pure and Applied Logic , 24(1), pages 1–48,
work page 1983
-
[3]
Learnin g and generalization in overparam- eterized neural networks, going beyond two layers
[Allen-Zhu et al.(2018a)] Zeyuan Allen-Zhu, Y uanzhi Li, an d Yingyu Liang. Learning and generalization in over- parameterized neural networks, going beyond two layers. CoRR, abs/1811.04918, 2018a. [Allen-Zhu et al.(2018b)] Zeyuan Allen-Zhu, Y uanzhi Li, an d Zhao Song. A convergence theory for deep learning via over-parameterization. CoRR, abs/1811.03962...
-
[4]
Understanding Deep Neural Networks with Rectified Linear Units
[Arora et al.(2016)] Raman Arora, Amitabh Basu, Poorya Mian jy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. CoRR, abs/1611.01491,
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[5]
[Arora et al.(2019)] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiy uan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. CoRR, abs/1901.08584,
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[6]
[Arslanov et al.(2016)] Marat Arslanov, Zhazira E. Amirgal ieva, and Chingiz A. Kenshimov. N-bit parity neural networks with minimum number of threshold neurons. Open Engineering, 6, 01
work page 2016
-
[7]
[Bartlett et al.(2017)] Peter Bartlett, Dylan J. Foster, an d Matus Telgarsky. Spectrally-normalized margin bounds for neural networks,
work page 2017
-
[8]
SGD learns over-parameterized networks that provably generalize on l inearly separable data
[Brutzkus et al.(2017)] Alon Brutzkus, Amir Globerson, Era n Malach, and Shai Shalev-Shwartz. SGD learns over-parameterized networks that provably generalize on l inearly separable data. In ICLR,
work page 2017
-
[9]
A no te on lazy training in supervised differentiable programming, 12,
[Chizat and Bach(2018)] Lenaic Chizat and Francis Bach. A no te on lazy training in supervised differentiable programming, 12,
work page 2018
-
[10]
[Cohen and Welling(2016)] Taco S. Cohen and Max Welling. Gro up equivariant convolutional networks,
work page 2016
-
[11]
Links between perceptrons, mlps and svms
[Collobert and Bengio(2004)] Ronan Collobert and Samy Beng io. Links between perceptrons, mlps and svms. In Proceedings of the Twenty-first International Conference o n Machine Learning, ICML ’04, page 23,
work page 2004
-
[12]
Sgd learns the conjugate ker nel class of the network
[Daniely(2017)] Amit Daniely. Sgd learns the conjugate ker nel class of the network. In Advances in Neural Infor- mation Processing Systems 30 , pages 2422–2430,
work page 2017
-
[13]
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
[Du et al.(2018)] Simon S. Du, Xiyu Zhai, Barnab´ as P´ oczos, and Aarti Singh. Gradient descent provably opti- mizes over-parameterized neural networks. CoRR, abs/1810.02054,
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
The Po wer of Depth for Feedforward Neural Networks
[Eldan and Shamir(2016)] Ronen Eldan and Ohad Shamir. The Po wer of Depth for Feedforward Neural Networks. In JMLR 49, pages 1–34,
work page 2016
-
[15]
Elsayed, Dilip Krishn an, Hossein Mobahi, Kevin Regan, and Samy Bengio
[Elsayed et al.(2018)] Gamaleldin F. Elsayed, Dilip Krishn an, Hossein Mobahi, Kevin Regan, and Samy Bengio. Large margin deep networks for classification. In NIPS, pages 850–860,
work page 2018
-
[16]
[Furst et al.(1981)] Merrick Furst, James B. Saxe, and Micha el Sipser. Parity, circuits, and the polynomial-time hierarchy. In FOCS, pages 260–270,
work page 1981
-
[17]
[Gens and Domingos(2014)] Robert Gens and Pedro M Domingos. Deep symmetry networks. In Advances in Neural Information Processing Systems 27 , pages 2537–2545,
work page 2014
-
[18]
MIT Press, Cambridge, MA, USA,
[H˚ astad(1987)] Johan H˚ astad.Computational Limitations of Small-depth Circuits . MIT Press, Cambridge, MA, USA,
work page 1987
-
[19]
Neural tangent kernel: Convergence and generalization in neural networks
16 IDO NACHUM AND AMIR YEHUDA YOFF [Jacot et al.(2018)] Arthur Jacot, Franck Gabriel, and Cl´ e ment Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In NIPS, pages 8580–8589,
work page 2018
-
[20]
Efficient noise-tolerant l earning from statistical queries
[Kearns(1998)] Michael Kearns. Efficient noise-tolerant l earning from statistical queries. J. ACM, 45 (6), pages 983–1006,
work page 1998
-
[21]
Le arning algorithms with optimal stability in neural networks
[Krauth and Mezard(1987)] Werner Krauth and Marc Mezard. Le arning algorithms with optimal stability in neural networks. J. Phys., A20, pages L745–L752,
work page 1987
- [22]
-
[23]
Learning o verparameterized neural networks via stochastic gradient descent on structured data,
[Li and Liang(2018)] Y uanzhi Li and Yingyu Liang. Learning o verparameterized neural networks via stochastic gradient descent on structured data,
work page 2018
-
[24]
[Littlestone and Warmuth(1986)] Nick Littlestone and Manf red K. Warmuth. Relating data compression and learnability. Technical report,
work page 1986
-
[25]
Large-margin softmax loss for convolutional neural networks
[Liu et al.(2016)] Weiyang Liu, Y andong Wen, Zhiding Y u, andMeng Meng Y ang. Large-margin softmax loss for convolutional neural networks. In ICML,
work page 2016
-
[26]
A solution for the n-bit parity problem using a single translated multiplicative ne uron
[Masato Iyoda et al.(2003)] Eduardo Masato Iyoda, Hajime No buhara, and Kaoru Hirota. A solution for the n-bit parity problem using a single translated multiplicative ne uron. Neural Processing Letters , 18:233–238, 12
work page 2003
-
[27]
[Minsky and Papert(1988)] Marvin L. Minsky and Seymour A. Pa pert. Perceptrons: Expanded Edition . MIT Press, Cambridge, MA, USA,
work page 1988
-
[28]
On the Perceptron's Compression
[Moran et al.(2018)] Shay Moran, Ido Nachum, Itai Panasoff, and Amir Y ehudayoff. On the perceptron’s com- pression. CoRR, abs/1806.05403,
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [29]
-
[30]
Ran dom features for large-scale kernel machines
[Rahimi and Recht(2008)] Ali Rahimi and Benjamin Recht. Ran dom features for large-scale kernel machines. In J. C. Platt, D. Koller, Y . Singer, and S. T. Roweis, editors , Advances in Neural Information Processing Systems 20, pages 1177–1184,
work page 2008
-
[31]
[Romero and Alquezar(2002)] E. Romero and R. Alquezar. Maxi mizing the margin with feedforward neural net- works. In Proceedings of the 2002 International Joint Conference on N eural Networks. IJCNN’02 (Cat. No.02CH37290), volume 1, pages 743–748,
work page 2002
-
[32]
[Rosenblatt(1958)] F. Rosenblatt. The perceptron: A proba bilistic model for information storage and organization in the brain. Psychological Review, pages 65–386,
work page 1958
-
[33]
Understanding machine learn- ing: From theory to algorithms
[Shalev-Shwartz and Ben-David(2014)] Shai Shalev-Shwart z and Shai Ben-David. Understanding machine learn- ing: From theory to algorithms . Cambridge university press,
work page 2014
-
[34]
Distribution-Specific Hardness of Learning Neural Networks
[Shamir(2016)] Ohad Shamir. Distribution-specific hardne ss of learning neural networks. CoRR, abs/1609.01037,
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[35]
[Sokolic et al.(2016)] Jure Sokolic, Raja Giryes, Guillerm o Sapiro, and Miguel R. D. Rodrigues. Margin preser- vation of deep neural networks. CoRR, abs/1605.08254,
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[36]
[Sokolic et al.(2017)] Jure Sokolic, Raja Giryes, Guillerm o Sapiro, and Miguel R. D. Rodrigues. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65, pages 4265–4280,
work page 2017
-
[37]
On the Complexity of Learning Neural Networks
[Song et al.(2017)] Le Song, Santosh V empala, John Wilmes, a nd Bo Xie. On the complexity of learning neural networks. CoRR, abs/1707.04615,
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
No bad local minima: Data independent training error guarantees for multilayer neural networks
[Soudry and Carmon(2016)] Daniel Soudry and Y air Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks
work page 2016
-
[39]
On the Depth of Deep Neural Networks: A Theoretical View
[Sun et al.(2015)] Shizhao Sun, Wei Chen, Liwei Wang, and Tie -Y an Liu. Large margin deep neural networks: Theory and algorithms. CoRR, abs/1506.05232,
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[40]
Representation Benefi ts of Deep Feedforward Networks
[Telgarsky(2016)] Matus Telgarsky. Representation Benefi ts of Deep Feedforward Networks. In JMLR, 49, pages 1 – 23,
work page 2016
-
[41]
Solving parity-n problems with feedforward neural networks
[Wilamowski et al.(2003)] Bogdan Wilamowski, David Hunter , and Aleksander Malinowski. Solving parity-n problems with feedforward neural networks. In IJCNN, pages 2546 – 2551, 08
work page 2003
-
[42]
[Arslanov et al.(2002)] M Z. Arslanov, D U. Ashigaliev, and E sraa Ismail. N-bit parity ordered neural networks. Neurocomputing, 48:1053–1056, 10
work page 2002
-
[43]
[Zaheer et al.(2017)] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdi- nov, and Alexander Smola. Deep sets,
work page 2017
-
[44]
Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks
ON SYMMETRY AND INITIALIZA TION FOR NEURAL NETWORKS 17 [Zou et al.(2018)] Difan Zou, Y uan Cao, Dongruo Zhou, and Qua nquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. CoRR, abs/1811.08888,
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.