On Symmetry and Initialization for Neural Networks

Amir Yehudayoff; Ido Nachum

arxiv: 1907.00560 · v1 · pith:ADRRBSTDnew · submitted 2019-07-01 · 💻 cs.LG · stat.ML

On Symmetry and Initialization for Neural Networks

Ido Nachum , Amir Yehudayoff This is my paper

Pith reviewed 2026-05-25 11:45 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords neural networkssymmetric functionsinitializationstochastic gradient descentgeneralization boundsone hidden layersymmetry

0 comments

The pith

Symmetric initial conditions let one-hidden-layer networks learn symmetric functions efficiently with SGD.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that, when the target function is symmetric, choosing initial weights that respect the same symmetry lets standard stochastic gradient descent reach both fast convergence and generalization bounds on a one-hidden-layer network. A reader would care because this shows that initialization can be tuned to the structure of the problem rather than left to chance, turning a hard optimization task into one with provable guarantees. The argument rests on tracking how the hidden layer and output layer interact once symmetry is built into the starting point. Random initialization lacks this interaction and therefore fails to deliver the same bounds.

Core claim

When the target is symmetric, symmetric initialization makes the two layers interact so that SGD produces generalization guarantees efficiently; the same does not occur under random initialization.

What carries the argument

The interaction between the hidden layer and the output layer once weights are initialized symmetrically.

If this is right

Standard SGD converges efficiently and yields generalization guarantees under the symmetric initialization.
Random initialization does not produce the same guarantees for the same symmetric targets.
The convergence proof follows from the specific interaction between the two layers.
Symmetry should be incorporated into the design of neural networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same symmetry-aware initialization idea could be tested on deeper networks by propagating the symmetry through additional layers.
Tasks with other forms of structure, such as invariance under permutations, might admit analogous initialization schemes.
Detecting symmetry in a data set before training could become a practical preprocessing step.

Load-bearing premise

The target functions are symmetric and the network has exactly one hidden layer.

What would settle it

Run SGD on a one-hidden-layer network with symmetric initialization for a known symmetric target and check whether the predicted generalization bound appears; the same experiment with random initialization should not produce the bound.

read the original abstract

This work provides an additional step in the theoretical understanding of neural networks. We consider neural networks with one hidden layer and show that when learning symmetric functions, one can choose initial conditions so that standard SGD training efficiently produces generalization guarantees. We empirically verify this and show that this does not hold when the initial conditions are chosen at random. The proof of convergence investigates the interaction between the two layers of the network. Our results highlight the importance of using symmetry in the design of neural networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper considers one-hidden-layer neural networks and claims that, when the target functions are symmetric, specific initial conditions can be chosen so that standard SGD training efficiently yields generalization guarantees. This is supported by empirical verification (contrasted against random initialization) and a proof that analyzes the interaction between the hidden and output layers under the symmetric initialization.

Significance. If the result holds within its stated scope, it provides a concrete illustration of how symmetry can be exploited at initialization to obtain training dynamics that produce generalization bounds, highlighting the role of layer interactions. The combination of a convergence proof and empirical checks on symmetric targets is a strength; the work is scoped explicitly to one hidden layer and exactly symmetric functions rather than claiming broader applicability.

minor comments (1)

The abstract and introduction would benefit from an explicit statement of the precise class of symmetric functions considered and the form of the generalization bound obtained.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. We appreciate the recognition of the paper's scope and the value of combining the convergence analysis with empirical verification on symmetric targets.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper presents a theoretical result for one-hidden-layer networks on symmetric targets, with initialization chosen to exploit symmetry for SGD convergence and generalization bounds via explicit layer interaction analysis. The abstract and scope description indicate the proof relies on the stated symmetry assumptions and network architecture rather than any fitted parameters, self-referential definitions, or load-bearing self-citations. No equations or steps reduce by construction to inputs; the contrast with random initialization and empirical verification further support independent content. This is a standard non-finding for scoped theoretical work without the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only: no free parameters, axioms, or invented entities are explicitly introduced or fitted; the central claim rests on the domain assumption of symmetric target functions and one-hidden-layer architecture.

pith-pipeline@v0.9.0 · 5593 in / 987 out tokens · 18973 ms · 2026-05-25T11:45:36.010461+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1: ... network with one hidden layer, cn neurons ... poly(n) SGD updates ... generalization guarantees for symmetric functions S_n
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 3 ... marg(Y) ≥ Ω(1/n) ... initialization ... ReLU ... hidden layer embedding
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 4 ... hidden layer ... moves at most O(R²_X h² R t^{3/2}) ... output neuron reaches good state

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 9 internal anchors

[1]

Pro vable limitations of deep learning,

[Abbe and Sandon(2018)] Emmanuel Abbe and Colin Sandon. Pro vable limitations of deep learning,

work page 2018
[2]

[Ajtai(1983)] M. Ajtai. ∑11-formulae on ﬁnite structures. Annals of Pure and Applied Logic , 24(1), pages 1–48,

work page 1983
[3]

Learnin g and generalization in overparam- eterized neural networks, going beyond two layers

[Allen-Zhu et al.(2018a)] Zeyuan Allen-Zhu, Y uanzhi Li, an d Yingyu Liang. Learning and generalization in over- parameterized neural networks, going beyond two layers. CoRR, abs/1811.04918, 2018a. [Allen-Zhu et al.(2018b)] Zeyuan Allen-Zhu, Y uanzhi Li, an d Zhao Song. A convergence theory for deep learning via over-parameterization. CoRR, abs/1811.03962...

work page arXiv 2014
[4]

Understanding Deep Neural Networks with Rectified Linear Units

[Arora et al.(2016)] Raman Arora, Amitabh Basu, Poorya Mian jy, and Anirbit Mukherjee. Understanding deep neural networks with rectiﬁed linear units. CoRR, abs/1611.01491,

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

[Arora et al.(2019)] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiy uan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. CoRR, abs/1901.08584,

work page internal anchor Pith review Pith/arXiv arXiv 2019
[6]

Amirgal ieva, and Chingiz A

[Arslanov et al.(2016)] Marat Arslanov, Zhazira E. Amirgal ieva, and Chingiz A. Kenshimov. N-bit parity neural networks with minimum number of threshold neurons. Open Engineering, 6, 01

work page 2016
[7]

Foster, an d Matus Telgarsky

[Bartlett et al.(2017)] Peter Bartlett, Dylan J. Foster, an d Matus Telgarsky. Spectrally-normalized margin bounds for neural networks,

work page 2017
[8]

SGD learns over-parameterized networks that provably generalize on l inearly separable data

[Brutzkus et al.(2017)] Alon Brutzkus, Amir Globerson, Era n Malach, and Shai Shalev-Shwartz. SGD learns over-parameterized networks that provably generalize on l inearly separable data. In ICLR,

work page 2017
[9]

A no te on lazy training in supervised differentiable programming, 12,

[Chizat and Bach(2018)] Lenaic Chizat and Francis Bach. A no te on lazy training in supervised differentiable programming, 12,

work page 2018
[10]

Cohen and Max Welling

[Cohen and Welling(2016)] Taco S. Cohen and Max Welling. Gro up equivariant convolutional networks,

work page 2016
[11]

Links between perceptrons, mlps and svms

[Collobert and Bengio(2004)] Ronan Collobert and Samy Beng io. Links between perceptrons, mlps and svms. In Proceedings of the Twenty-ﬁrst International Conference o n Machine Learning, ICML ’04, page 23,

work page 2004
[12]

Sgd learns the conjugate ker nel class of the network

[Daniely(2017)] Amit Daniely. Sgd learns the conjugate ker nel class of the network. In Advances in Neural Infor- mation Processing Systems 30 , pages 2422–2430,

work page 2017
[13]

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

[Du et al.(2018)] Simon S. Du, Xiyu Zhai, Barnab´ as P´ oczos, and Aarti Singh. Gradient descent provably opti- mizes over-parameterized neural networks. CoRR, abs/1810.02054,

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

The Po wer of Depth for Feedforward Neural Networks

[Eldan and Shamir(2016)] Ronen Eldan and Ohad Shamir. The Po wer of Depth for Feedforward Neural Networks. In JMLR 49, pages 1–34,

work page 2016
[15]

Elsayed, Dilip Krishn an, Hossein Mobahi, Kevin Regan, and Samy Bengio

[Elsayed et al.(2018)] Gamaleldin F. Elsayed, Dilip Krishn an, Hossein Mobahi, Kevin Regan, and Samy Bengio. Large margin deep networks for classiﬁcation. In NIPS, pages 850–860,

work page 2018
[16]

Saxe, and Micha el Sipser

[Furst et al.(1981)] Merrick Furst, James B. Saxe, and Micha el Sipser. Parity, circuits, and the polynomial-time hierarchy. In FOCS, pages 260–270,

work page 1981
[17]

Deep symmetry networks

[Gens and Domingos(2014)] Robert Gens and Pedro M Domingos. Deep symmetry networks. In Advances in Neural Information Processing Systems 27 , pages 2537–2545,

work page 2014
[18]

MIT Press, Cambridge, MA, USA,

[H˚ astad(1987)] Johan H˚ astad.Computational Limitations of Small-depth Circuits . MIT Press, Cambridge, MA, USA,

work page 1987
[19]

Neural tangent kernel: Convergence and generalization in neural networks

16 IDO NACHUM AND AMIR YEHUDA YOFF [Jacot et al.(2018)] Arthur Jacot, Franck Gabriel, and Cl´ e ment Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In NIPS, pages 8580–8589,

work page 2018
[20]

Efﬁcient noise-tolerant l earning from statistical queries

[Kearns(1998)] Michael Kearns. Efﬁcient noise-tolerant l earning from statistical queries. J. ACM, 45 (6), pages 983–1006,

work page 1998
[21]

Le arning algorithms with optimal stability in neural networks

[Krauth and Mezard(1987)] Werner Krauth and Marc Mezard. Le arning algorithms with optimal stability in neural networks. J. Phys., A20, pages L745–L752,

work page 1987
[22]

Lecun, L

[Lecun et al.(1998)] Y . Lecun, L. Bottou, Y . Bengio, and P . Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE , 86 (11), pages 2278–2324,

work page 1998
[23]

Learning o verparameterized neural networks via stochastic gradient descent on structured data,

[Li and Liang(2018)] Y uanzhi Li and Yingyu Liang. Learning o verparameterized neural networks via stochastic gradient descent on structured data,

work page 2018
[24]

[Littlestone and Warmuth(1986)] Nick Littlestone and Manf red K. Warmuth. Relating data compression and learnability. Technical report,

work page 1986
[25]

Large-margin softmax loss for convolutional neural networks

[Liu et al.(2016)] Weiyang Liu, Y andong Wen, Zhiding Y u, andMeng Meng Y ang. Large-margin softmax loss for convolutional neural networks. In ICML,

work page 2016
[26]

A solution for the n-bit parity problem using a single translated multiplicative ne uron

[Masato Iyoda et al.(2003)] Eduardo Masato Iyoda, Hajime No buhara, and Kaoru Hirota. A solution for the n-bit parity problem using a single translated multiplicative ne uron. Neural Processing Letters , 18:233–238, 12

work page 2003
[27]

Minsky and Seymour A

[Minsky and Papert(1988)] Marvin L. Minsky and Seymour A. Pa pert. Perceptrons: Expanded Edition . MIT Press, Cambridge, MA, USA,

work page 1988
[28]

On the Perceptron's Compression

[Moran et al.(2018)] Shay Moran, Ido Nachum, Itai Panasoff, and Amir Y ehudayoff. On the perceptron’s com- pression. CoRR, abs/1806.05403,

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Novikoff

[Novikoff(1962)] Albert B.J. Novikoff. On convergence pro ofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata , volume 12, pages 615–622,

work page 1962
[30]

Ran dom features for large-scale kernel machines

[Rahimi and Recht(2008)] Ali Rahimi and Benjamin Recht. Ran dom features for large-scale kernel machines. In J. C. Platt, D. Koller, Y . Singer, and S. T. Roweis, editors , Advances in Neural Information Processing Systems 20, pages 1177–1184,

work page 2008
[31]

Romero and R

[Romero and Alquezar(2002)] E. Romero and R. Alquezar. Maxi mizing the margin with feedforward neural net- works. In Proceedings of the 2002 International Joint Conference on N eural Networks. IJCNN’02 (Cat. No.02CH37290), volume 1, pages 743–748,

work page 2002
[32]

Rosenblatt

[Rosenblatt(1958)] F. Rosenblatt. The perceptron: A proba bilistic model for information storage and organization in the brain. Psychological Review, pages 65–386,

work page 1958
[33]

Understanding machine learn- ing: From theory to algorithms

[Shalev-Shwartz and Ben-David(2014)] Shai Shalev-Shwart z and Shai Ben-David. Understanding machine learn- ing: From theory to algorithms . Cambridge university press,

work page 2014
[34]

Distribution-Specific Hardness of Learning Neural Networks

[Shamir(2016)] Ohad Shamir. Distribution-speciﬁc hardne ss of learning neural networks. CoRR, abs/1609.01037,

work page internal anchor Pith review Pith/arXiv arXiv 2016
[35]

[Sokolic et al.(2016)] Jure Sokolic, Raja Giryes, Guillerm o Sapiro, and Miguel R. D. Rodrigues. Margin preser- vation of deep neural networks. CoRR, abs/1605.08254,

work page internal anchor Pith review Pith/arXiv arXiv 2016
[36]

[Sokolic et al.(2017)] Jure Sokolic, Raja Giryes, Guillerm o Sapiro, and Miguel R. D. Rodrigues. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65, pages 4265–4280,

work page 2017
[37]

On the Complexity of Learning Neural Networks

[Song et al.(2017)] Le Song, Santosh V empala, John Wilmes, a nd Bo Xie. On the complexity of learning neural networks. CoRR, abs/1707.04615,

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

No bad local minima: Data independent training error guarantees for multilayer neural networks

[Soudry and Carmon(2016)] Daniel Soudry and Y air Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks

work page 2016
[39]

On the Depth of Deep Neural Networks: A Theoretical View

[Sun et al.(2015)] Shizhao Sun, Wei Chen, Liwei Wang, and Tie -Y an Liu. Large margin deep neural networks: Theory and algorithms. CoRR, abs/1506.05232,

work page internal anchor Pith review Pith/arXiv arXiv 2015
[40]

Representation Beneﬁ ts of Deep Feedforward Networks

[Telgarsky(2016)] Matus Telgarsky. Representation Beneﬁ ts of Deep Feedforward Networks. In JMLR, 49, pages 1 – 23,

work page 2016
[41]

Solving parity-n problems with feedforward neural networks

[Wilamowski et al.(2003)] Bogdan Wilamowski, David Hunter , and Aleksander Malinowski. Solving parity-n problems with feedforward neural networks. In IJCNN, pages 2546 – 2551, 08

work page 2003
[42]

Arslanov, D U

[Arslanov et al.(2002)] M Z. Arslanov, D U. Ashigaliev, and E sraa Ismail. N-bit parity ordered neural networks. Neurocomputing, 48:1053–1056, 10

work page 2002
[43]

Deep sets,

[Zaheer et al.(2017)] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdi- nov, and Alexander Smola. Deep sets,

work page 2017
[44]

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

ON SYMMETRY AND INITIALIZA TION FOR NEURAL NETWORKS 17 [Zou et al.(2018)] Difan Zou, Y uan Cao, Dongruo Zhou, and Qua nquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. CoRR, abs/1811.08888,

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

Pro vable limitations of deep learning,

[Abbe and Sandon(2018)] Emmanuel Abbe and Colin Sandon. Pro vable limitations of deep learning,

work page 2018

[2] [2]

[Ajtai(1983)] M. Ajtai. ∑11-formulae on ﬁnite structures. Annals of Pure and Applied Logic , 24(1), pages 1–48,

work page 1983

[3] [3]

Learnin g and generalization in overparam- eterized neural networks, going beyond two layers

[Allen-Zhu et al.(2018a)] Zeyuan Allen-Zhu, Y uanzhi Li, an d Yingyu Liang. Learning and generalization in over- parameterized neural networks, going beyond two layers. CoRR, abs/1811.04918, 2018a. [Allen-Zhu et al.(2018b)] Zeyuan Allen-Zhu, Y uanzhi Li, an d Zhao Song. A convergence theory for deep learning via over-parameterization. CoRR, abs/1811.03962...

work page arXiv 2014

[4] [4]

Understanding Deep Neural Networks with Rectified Linear Units

[Arora et al.(2016)] Raman Arora, Amitabh Basu, Poorya Mian jy, and Anirbit Mukherjee. Understanding deep neural networks with rectiﬁed linear units. CoRR, abs/1611.01491,

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

[Arora et al.(2019)] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiy uan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. CoRR, abs/1901.08584,

work page internal anchor Pith review Pith/arXiv arXiv 2019

[6] [6]

Amirgal ieva, and Chingiz A

[Arslanov et al.(2016)] Marat Arslanov, Zhazira E. Amirgal ieva, and Chingiz A. Kenshimov. N-bit parity neural networks with minimum number of threshold neurons. Open Engineering, 6, 01

work page 2016

[7] [7]

Foster, an d Matus Telgarsky

[Bartlett et al.(2017)] Peter Bartlett, Dylan J. Foster, an d Matus Telgarsky. Spectrally-normalized margin bounds for neural networks,

work page 2017

[8] [8]

SGD learns over-parameterized networks that provably generalize on l inearly separable data

[Brutzkus et al.(2017)] Alon Brutzkus, Amir Globerson, Era n Malach, and Shai Shalev-Shwartz. SGD learns over-parameterized networks that provably generalize on l inearly separable data. In ICLR,

work page 2017

[9] [9]

A no te on lazy training in supervised differentiable programming, 12,

[Chizat and Bach(2018)] Lenaic Chizat and Francis Bach. A no te on lazy training in supervised differentiable programming, 12,

work page 2018

[10] [10]

Cohen and Max Welling

[Cohen and Welling(2016)] Taco S. Cohen and Max Welling. Gro up equivariant convolutional networks,

work page 2016

[11] [11]

Links between perceptrons, mlps and svms

[Collobert and Bengio(2004)] Ronan Collobert and Samy Beng io. Links between perceptrons, mlps and svms. In Proceedings of the Twenty-ﬁrst International Conference o n Machine Learning, ICML ’04, page 23,

work page 2004

[12] [12]

Sgd learns the conjugate ker nel class of the network

[Daniely(2017)] Amit Daniely. Sgd learns the conjugate ker nel class of the network. In Advances in Neural Infor- mation Processing Systems 30 , pages 2422–2430,

work page 2017

[13] [13]

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

[Du et al.(2018)] Simon S. Du, Xiyu Zhai, Barnab´ as P´ oczos, and Aarti Singh. Gradient descent provably opti- mizes over-parameterized neural networks. CoRR, abs/1810.02054,

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

The Po wer of Depth for Feedforward Neural Networks

[Eldan and Shamir(2016)] Ronen Eldan and Ohad Shamir. The Po wer of Depth for Feedforward Neural Networks. In JMLR 49, pages 1–34,

work page 2016

[15] [15]

Elsayed, Dilip Krishn an, Hossein Mobahi, Kevin Regan, and Samy Bengio

[Elsayed et al.(2018)] Gamaleldin F. Elsayed, Dilip Krishn an, Hossein Mobahi, Kevin Regan, and Samy Bengio. Large margin deep networks for classiﬁcation. In NIPS, pages 850–860,

work page 2018

[16] [16]

Saxe, and Micha el Sipser

[Furst et al.(1981)] Merrick Furst, James B. Saxe, and Micha el Sipser. Parity, circuits, and the polynomial-time hierarchy. In FOCS, pages 260–270,

work page 1981

[17] [17]

Deep symmetry networks

[Gens and Domingos(2014)] Robert Gens and Pedro M Domingos. Deep symmetry networks. In Advances in Neural Information Processing Systems 27 , pages 2537–2545,

work page 2014

[18] [18]

MIT Press, Cambridge, MA, USA,

[H˚ astad(1987)] Johan H˚ astad.Computational Limitations of Small-depth Circuits . MIT Press, Cambridge, MA, USA,

work page 1987

[19] [19]

Neural tangent kernel: Convergence and generalization in neural networks

16 IDO NACHUM AND AMIR YEHUDA YOFF [Jacot et al.(2018)] Arthur Jacot, Franck Gabriel, and Cl´ e ment Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In NIPS, pages 8580–8589,

work page 2018

[20] [20]

Efﬁcient noise-tolerant l earning from statistical queries

[Kearns(1998)] Michael Kearns. Efﬁcient noise-tolerant l earning from statistical queries. J. ACM, 45 (6), pages 983–1006,

work page 1998

[21] [21]

Le arning algorithms with optimal stability in neural networks

[Krauth and Mezard(1987)] Werner Krauth and Marc Mezard. Le arning algorithms with optimal stability in neural networks. J. Phys., A20, pages L745–L752,

work page 1987

[22] [22]

Lecun, L

[Lecun et al.(1998)] Y . Lecun, L. Bottou, Y . Bengio, and P . Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE , 86 (11), pages 2278–2324,

work page 1998

[23] [23]

Learning o verparameterized neural networks via stochastic gradient descent on structured data,

[Li and Liang(2018)] Y uanzhi Li and Yingyu Liang. Learning o verparameterized neural networks via stochastic gradient descent on structured data,

work page 2018

[24] [24]

[Littlestone and Warmuth(1986)] Nick Littlestone and Manf red K. Warmuth. Relating data compression and learnability. Technical report,

work page 1986

[25] [25]

Large-margin softmax loss for convolutional neural networks

[Liu et al.(2016)] Weiyang Liu, Y andong Wen, Zhiding Y u, andMeng Meng Y ang. Large-margin softmax loss for convolutional neural networks. In ICML,

work page 2016

[26] [26]

A solution for the n-bit parity problem using a single translated multiplicative ne uron

[Masato Iyoda et al.(2003)] Eduardo Masato Iyoda, Hajime No buhara, and Kaoru Hirota. A solution for the n-bit parity problem using a single translated multiplicative ne uron. Neural Processing Letters , 18:233–238, 12

work page 2003

[27] [27]

Minsky and Seymour A

[Minsky and Papert(1988)] Marvin L. Minsky and Seymour A. Pa pert. Perceptrons: Expanded Edition . MIT Press, Cambridge, MA, USA,

work page 1988

[28] [28]

On the Perceptron's Compression

[Moran et al.(2018)] Shay Moran, Ido Nachum, Itai Panasoff, and Amir Y ehudayoff. On the perceptron’s com- pression. CoRR, abs/1806.05403,

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Novikoff

[Novikoff(1962)] Albert B.J. Novikoff. On convergence pro ofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata , volume 12, pages 615–622,

work page 1962

[30] [30]

Ran dom features for large-scale kernel machines

[Rahimi and Recht(2008)] Ali Rahimi and Benjamin Recht. Ran dom features for large-scale kernel machines. In J. C. Platt, D. Koller, Y . Singer, and S. T. Roweis, editors , Advances in Neural Information Processing Systems 20, pages 1177–1184,

work page 2008

[31] [31]

Romero and R

[Romero and Alquezar(2002)] E. Romero and R. Alquezar. Maxi mizing the margin with feedforward neural net- works. In Proceedings of the 2002 International Joint Conference on N eural Networks. IJCNN’02 (Cat. No.02CH37290), volume 1, pages 743–748,

work page 2002

[32] [32]

Rosenblatt

[Rosenblatt(1958)] F. Rosenblatt. The perceptron: A proba bilistic model for information storage and organization in the brain. Psychological Review, pages 65–386,

work page 1958

[33] [33]

Understanding machine learn- ing: From theory to algorithms

[Shalev-Shwartz and Ben-David(2014)] Shai Shalev-Shwart z and Shai Ben-David. Understanding machine learn- ing: From theory to algorithms . Cambridge university press,

work page 2014

[34] [34]

Distribution-Specific Hardness of Learning Neural Networks

[Shamir(2016)] Ohad Shamir. Distribution-speciﬁc hardne ss of learning neural networks. CoRR, abs/1609.01037,

work page internal anchor Pith review Pith/arXiv arXiv 2016

[35] [35]

[Sokolic et al.(2016)] Jure Sokolic, Raja Giryes, Guillerm o Sapiro, and Miguel R. D. Rodrigues. Margin preser- vation of deep neural networks. CoRR, abs/1605.08254,

work page internal anchor Pith review Pith/arXiv arXiv 2016

[36] [36]

[Sokolic et al.(2017)] Jure Sokolic, Raja Giryes, Guillerm o Sapiro, and Miguel R. D. Rodrigues. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65, pages 4265–4280,

work page 2017

[37] [37]

On the Complexity of Learning Neural Networks

[Song et al.(2017)] Le Song, Santosh V empala, John Wilmes, a nd Bo Xie. On the complexity of learning neural networks. CoRR, abs/1707.04615,

work page internal anchor Pith review Pith/arXiv arXiv 2017

[38] [38]

No bad local minima: Data independent training error guarantees for multilayer neural networks

[Soudry and Carmon(2016)] Daniel Soudry and Y air Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks

work page 2016

[39] [39]

On the Depth of Deep Neural Networks: A Theoretical View

[Sun et al.(2015)] Shizhao Sun, Wei Chen, Liwei Wang, and Tie -Y an Liu. Large margin deep neural networks: Theory and algorithms. CoRR, abs/1506.05232,

work page internal anchor Pith review Pith/arXiv arXiv 2015

[40] [40]

Representation Beneﬁ ts of Deep Feedforward Networks

[Telgarsky(2016)] Matus Telgarsky. Representation Beneﬁ ts of Deep Feedforward Networks. In JMLR, 49, pages 1 – 23,

work page 2016

[41] [41]

Solving parity-n problems with feedforward neural networks

[Wilamowski et al.(2003)] Bogdan Wilamowski, David Hunter , and Aleksander Malinowski. Solving parity-n problems with feedforward neural networks. In IJCNN, pages 2546 – 2551, 08

work page 2003

[42] [42]

Arslanov, D U

[Arslanov et al.(2002)] M Z. Arslanov, D U. Ashigaliev, and E sraa Ismail. N-bit parity ordered neural networks. Neurocomputing, 48:1053–1056, 10

work page 2002

[43] [43]

Deep sets,

[Zaheer et al.(2017)] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdi- nov, and Alexander Smola. Deep sets,

work page 2017

[44] [44]

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

ON SYMMETRY AND INITIALIZA TION FOR NEURAL NETWORKS 17 [Zou et al.(2018)] Difan Zou, Y uan Cao, Dongruo Zhou, and Qua nquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. CoRR, abs/1811.08888,

work page internal anchor Pith review Pith/arXiv arXiv 2018