pith. sign in

arxiv: 1907.00560 · v1 · pith:ADRRBSTDnew · submitted 2019-07-01 · 💻 cs.LG · stat.ML

On Symmetry and Initialization for Neural Networks

Pith reviewed 2026-05-25 11:45 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords neural networkssymmetric functionsinitializationstochastic gradient descentgeneralization boundsone hidden layersymmetry
0
0 comments X

The pith

Symmetric initial conditions let one-hidden-layer networks learn symmetric functions efficiently with SGD.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that, when the target function is symmetric, choosing initial weights that respect the same symmetry lets standard stochastic gradient descent reach both fast convergence and generalization bounds on a one-hidden-layer network. A reader would care because this shows that initialization can be tuned to the structure of the problem rather than left to chance, turning a hard optimization task into one with provable guarantees. The argument rests on tracking how the hidden layer and output layer interact once symmetry is built into the starting point. Random initialization lacks this interaction and therefore fails to deliver the same bounds.

Core claim

When the target is symmetric, symmetric initialization makes the two layers interact so that SGD produces generalization guarantees efficiently; the same does not occur under random initialization.

What carries the argument

The interaction between the hidden layer and the output layer once weights are initialized symmetrically.

If this is right

  • Standard SGD converges efficiently and yields generalization guarantees under the symmetric initialization.
  • Random initialization does not produce the same guarantees for the same symmetric targets.
  • The convergence proof follows from the specific interaction between the two layers.
  • Symmetry should be incorporated into the design of neural networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same symmetry-aware initialization idea could be tested on deeper networks by propagating the symmetry through additional layers.
  • Tasks with other forms of structure, such as invariance under permutations, might admit analogous initialization schemes.
  • Detecting symmetry in a data set before training could become a practical preprocessing step.

Load-bearing premise

The target functions are symmetric and the network has exactly one hidden layer.

What would settle it

Run SGD on a one-hidden-layer network with symmetric initialization for a known symmetric target and check whether the predicted generalization bound appears; the same experiment with random initialization should not produce the bound.

read the original abstract

This work provides an additional step in the theoretical understanding of neural networks. We consider neural networks with one hidden layer and show that when learning symmetric functions, one can choose initial conditions so that standard SGD training efficiently produces generalization guarantees. We empirically verify this and show that this does not hold when the initial conditions are chosen at random. The proof of convergence investigates the interaction between the two layers of the network. Our results highlight the importance of using symmetry in the design of neural networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper considers one-hidden-layer neural networks and claims that, when the target functions are symmetric, specific initial conditions can be chosen so that standard SGD training efficiently yields generalization guarantees. This is supported by empirical verification (contrasted against random initialization) and a proof that analyzes the interaction between the hidden and output layers under the symmetric initialization.

Significance. If the result holds within its stated scope, it provides a concrete illustration of how symmetry can be exploited at initialization to obtain training dynamics that produce generalization bounds, highlighting the role of layer interactions. The combination of a convergence proof and empirical checks on symmetric targets is a strength; the work is scoped explicitly to one hidden layer and exactly symmetric functions rather than claiming broader applicability.

minor comments (1)
  1. The abstract and introduction would benefit from an explicit statement of the precise class of symmetric functions considered and the form of the generalization bound obtained.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. We appreciate the recognition of the paper's scope and the value of combining the convergence analysis with empirical verification on symmetric targets.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper presents a theoretical result for one-hidden-layer networks on symmetric targets, with initialization chosen to exploit symmetry for SGD convergence and generalization bounds via explicit layer interaction analysis. The abstract and scope description indicate the proof relies on the stated symmetry assumptions and network architecture rather than any fitted parameters, self-referential definitions, or load-bearing self-citations. No equations or steps reduce by construction to inputs; the contrast with random initialization and empirical verification further support independent content. This is a standard non-finding for scoped theoretical work without the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only: no free parameters, axioms, or invented entities are explicitly introduced or fitted; the central claim rests on the domain assumption of symmetric target functions and one-hidden-layer architecture.

pith-pipeline@v0.9.0 · 5593 in / 987 out tokens · 18973 ms · 2026-05-25T11:45:36.010461+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 9 internal anchors

  1. [1]

    Pro vable limitations of deep learning,

    [Abbe and Sandon(2018)] Emmanuel Abbe and Colin Sandon. Pro vable limitations of deep learning,

  2. [2]

    [Ajtai(1983)] M. Ajtai. ∑11-formulae on finite structures. Annals of Pure and Applied Logic , 24(1), pages 1–48,

  3. [3]

    Learnin g and generalization in overparam- eterized neural networks, going beyond two layers

    [Allen-Zhu et al.(2018a)] Zeyuan Allen-Zhu, Y uanzhi Li, an d Yingyu Liang. Learning and generalization in over- parameterized neural networks, going beyond two layers. CoRR, abs/1811.04918, 2018a. [Allen-Zhu et al.(2018b)] Zeyuan Allen-Zhu, Y uanzhi Li, an d Zhao Song. A convergence theory for deep learning via over-parameterization. CoRR, abs/1811.03962...

  4. [4]

    Understanding Deep Neural Networks with Rectified Linear Units

    [Arora et al.(2016)] Raman Arora, Amitabh Basu, Poorya Mian jy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. CoRR, abs/1611.01491,

  5. [5]

    Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

    [Arora et al.(2019)] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiy uan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. CoRR, abs/1901.08584,

  6. [6]

    Amirgal ieva, and Chingiz A

    [Arslanov et al.(2016)] Marat Arslanov, Zhazira E. Amirgal ieva, and Chingiz A. Kenshimov. N-bit parity neural networks with minimum number of threshold neurons. Open Engineering, 6, 01

  7. [7]

    Foster, an d Matus Telgarsky

    [Bartlett et al.(2017)] Peter Bartlett, Dylan J. Foster, an d Matus Telgarsky. Spectrally-normalized margin bounds for neural networks,

  8. [8]

    SGD learns over-parameterized networks that provably generalize on l inearly separable data

    [Brutzkus et al.(2017)] Alon Brutzkus, Amir Globerson, Era n Malach, and Shai Shalev-Shwartz. SGD learns over-parameterized networks that provably generalize on l inearly separable data. In ICLR,

  9. [9]

    A no te on lazy training in supervised differentiable programming, 12,

    [Chizat and Bach(2018)] Lenaic Chizat and Francis Bach. A no te on lazy training in supervised differentiable programming, 12,

  10. [10]

    Cohen and Max Welling

    [Cohen and Welling(2016)] Taco S. Cohen and Max Welling. Gro up equivariant convolutional networks,

  11. [11]

    Links between perceptrons, mlps and svms

    [Collobert and Bengio(2004)] Ronan Collobert and Samy Beng io. Links between perceptrons, mlps and svms. In Proceedings of the Twenty-first International Conference o n Machine Learning, ICML ’04, page 23,

  12. [12]

    Sgd learns the conjugate ker nel class of the network

    [Daniely(2017)] Amit Daniely. Sgd learns the conjugate ker nel class of the network. In Advances in Neural Infor- mation Processing Systems 30 , pages 2422–2430,

  13. [13]

    Gradient Descent Provably Optimizes Over-parameterized Neural Networks

    [Du et al.(2018)] Simon S. Du, Xiyu Zhai, Barnab´ as P´ oczos, and Aarti Singh. Gradient descent provably opti- mizes over-parameterized neural networks. CoRR, abs/1810.02054,

  14. [14]

    The Po wer of Depth for Feedforward Neural Networks

    [Eldan and Shamir(2016)] Ronen Eldan and Ohad Shamir. The Po wer of Depth for Feedforward Neural Networks. In JMLR 49, pages 1–34,

  15. [15]

    Elsayed, Dilip Krishn an, Hossein Mobahi, Kevin Regan, and Samy Bengio

    [Elsayed et al.(2018)] Gamaleldin F. Elsayed, Dilip Krishn an, Hossein Mobahi, Kevin Regan, and Samy Bengio. Large margin deep networks for classification. In NIPS, pages 850–860,

  16. [16]

    Saxe, and Micha el Sipser

    [Furst et al.(1981)] Merrick Furst, James B. Saxe, and Micha el Sipser. Parity, circuits, and the polynomial-time hierarchy. In FOCS, pages 260–270,

  17. [17]

    Deep symmetry networks

    [Gens and Domingos(2014)] Robert Gens and Pedro M Domingos. Deep symmetry networks. In Advances in Neural Information Processing Systems 27 , pages 2537–2545,

  18. [18]

    MIT Press, Cambridge, MA, USA,

    [H˚ astad(1987)] Johan H˚ astad.Computational Limitations of Small-depth Circuits . MIT Press, Cambridge, MA, USA,

  19. [19]

    Neural tangent kernel: Convergence and generalization in neural networks

    16 IDO NACHUM AND AMIR YEHUDA YOFF [Jacot et al.(2018)] Arthur Jacot, Franck Gabriel, and Cl´ e ment Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In NIPS, pages 8580–8589,

  20. [20]

    Efficient noise-tolerant l earning from statistical queries

    [Kearns(1998)] Michael Kearns. Efficient noise-tolerant l earning from statistical queries. J. ACM, 45 (6), pages 983–1006,

  21. [21]

    Le arning algorithms with optimal stability in neural networks

    [Krauth and Mezard(1987)] Werner Krauth and Marc Mezard. Le arning algorithms with optimal stability in neural networks. J. Phys., A20, pages L745–L752,

  22. [22]

    Lecun, L

    [Lecun et al.(1998)] Y . Lecun, L. Bottou, Y . Bengio, and P . Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE , 86 (11), pages 2278–2324,

  23. [23]

    Learning o verparameterized neural networks via stochastic gradient descent on structured data,

    [Li and Liang(2018)] Y uanzhi Li and Yingyu Liang. Learning o verparameterized neural networks via stochastic gradient descent on structured data,

  24. [24]

    [Littlestone and Warmuth(1986)] Nick Littlestone and Manf red K. Warmuth. Relating data compression and learnability. Technical report,

  25. [25]

    Large-margin softmax loss for convolutional neural networks

    [Liu et al.(2016)] Weiyang Liu, Y andong Wen, Zhiding Y u, andMeng Meng Y ang. Large-margin softmax loss for convolutional neural networks. In ICML,

  26. [26]

    A solution for the n-bit parity problem using a single translated multiplicative ne uron

    [Masato Iyoda et al.(2003)] Eduardo Masato Iyoda, Hajime No buhara, and Kaoru Hirota. A solution for the n-bit parity problem using a single translated multiplicative ne uron. Neural Processing Letters , 18:233–238, 12

  27. [27]

    Minsky and Seymour A

    [Minsky and Papert(1988)] Marvin L. Minsky and Seymour A. Pa pert. Perceptrons: Expanded Edition . MIT Press, Cambridge, MA, USA,

  28. [28]

    On the Perceptron's Compression

    [Moran et al.(2018)] Shay Moran, Ido Nachum, Itai Panasoff, and Amir Y ehudayoff. On the perceptron’s com- pression. CoRR, abs/1806.05403,

  29. [29]

    Novikoff

    [Novikoff(1962)] Albert B.J. Novikoff. On convergence pro ofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata , volume 12, pages 615–622,

  30. [30]

    Ran dom features for large-scale kernel machines

    [Rahimi and Recht(2008)] Ali Rahimi and Benjamin Recht. Ran dom features for large-scale kernel machines. In J. C. Platt, D. Koller, Y . Singer, and S. T. Roweis, editors , Advances in Neural Information Processing Systems 20, pages 1177–1184,

  31. [31]

    Romero and R

    [Romero and Alquezar(2002)] E. Romero and R. Alquezar. Maxi mizing the margin with feedforward neural net- works. In Proceedings of the 2002 International Joint Conference on N eural Networks. IJCNN’02 (Cat. No.02CH37290), volume 1, pages 743–748,

  32. [32]

    Rosenblatt

    [Rosenblatt(1958)] F. Rosenblatt. The perceptron: A proba bilistic model for information storage and organization in the brain. Psychological Review, pages 65–386,

  33. [33]

    Understanding machine learn- ing: From theory to algorithms

    [Shalev-Shwartz and Ben-David(2014)] Shai Shalev-Shwart z and Shai Ben-David. Understanding machine learn- ing: From theory to algorithms . Cambridge university press,

  34. [34]

    Distribution-Specific Hardness of Learning Neural Networks

    [Shamir(2016)] Ohad Shamir. Distribution-specific hardne ss of learning neural networks. CoRR, abs/1609.01037,

  35. [35]

    [Sokolic et al.(2016)] Jure Sokolic, Raja Giryes, Guillerm o Sapiro, and Miguel R. D. Rodrigues. Margin preser- vation of deep neural networks. CoRR, abs/1605.08254,

  36. [36]

    [Sokolic et al.(2017)] Jure Sokolic, Raja Giryes, Guillerm o Sapiro, and Miguel R. D. Rodrigues. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65, pages 4265–4280,

  37. [37]

    On the Complexity of Learning Neural Networks

    [Song et al.(2017)] Le Song, Santosh V empala, John Wilmes, a nd Bo Xie. On the complexity of learning neural networks. CoRR, abs/1707.04615,

  38. [38]

    No bad local minima: Data independent training error guarantees for multilayer neural networks

    [Soudry and Carmon(2016)] Daniel Soudry and Y air Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks

  39. [39]

    On the Depth of Deep Neural Networks: A Theoretical View

    [Sun et al.(2015)] Shizhao Sun, Wei Chen, Liwei Wang, and Tie -Y an Liu. Large margin deep neural networks: Theory and algorithms. CoRR, abs/1506.05232,

  40. [40]

    Representation Benefi ts of Deep Feedforward Networks

    [Telgarsky(2016)] Matus Telgarsky. Representation Benefi ts of Deep Feedforward Networks. In JMLR, 49, pages 1 – 23,

  41. [41]

    Solving parity-n problems with feedforward neural networks

    [Wilamowski et al.(2003)] Bogdan Wilamowski, David Hunter , and Aleksander Malinowski. Solving parity-n problems with feedforward neural networks. In IJCNN, pages 2546 – 2551, 08

  42. [42]

    Arslanov, D U

    [Arslanov et al.(2002)] M Z. Arslanov, D U. Ashigaliev, and E sraa Ismail. N-bit parity ordered neural networks. Neurocomputing, 48:1053–1056, 10

  43. [43]

    Deep sets,

    [Zaheer et al.(2017)] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdi- nov, and Alexander Smola. Deep sets,

  44. [44]

    Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

    ON SYMMETRY AND INITIALIZA TION FOR NEURAL NETWORKS 17 [Zou et al.(2018)] Difan Zou, Y uan Cao, Dongruo Zhou, and Qua nquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. CoRR, abs/1811.08888,