Beyond Structural Symmetries: Linear Mode Connectivity via Neuron Identifiability

Daniel Herbst; Stefanie Jegelka; Vincent B\"urgin; Ya-Wei Eileen Lin

arxiv: 2606.04754 · v1 · pith:3IYJAYMAnew · submitted 2026-06-03 · 💻 cs.LG

Beyond Structural Symmetries: Linear Mode Connectivity via Neuron Identifiability

Vincent B\"urgin , Daniel Herbst , Ya-Wei Eileen Lin , Stefanie Jegelka This is my paper

Pith reviewed 2026-06-28 07:24 UTC · model grok-4.3

classification 💻 cs.LG

keywords linear mode connectivityneuron identifiabilityparameter symmetrieseffective function classesneural network loss landscaperepresentation mergingasymmetric models

0 comments

The pith

Neuron identifiability enables representation merging without alignment and yields linear low-loss paths even in asymmetric networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework of effective function classes to capture what functions each neuron can realize on its inputs along with the cost of doing so. It uses this to formalize neuron identifiability across independent training runs as a form of effective symmetry breaking. The central result is that this identifiability produces large families of approximately equivalent solutions and permits direct merging of representations that admits a linear low-loss path. A sympathetic reader cares because the argument moves beyond fixed structural symmetries to explain more of the observed connectivity in the loss landscape of trained networks.

Core claim

Our analysis shows that neural networks can admit large families of approximately equivalent solutions even in structurally asymmetric models. We further show that neuron identifiability enables representation merging without prior alignment, and characterize when such merging admits a linear low-loss path. These findings highlight the role of effective function classes in affecting the loss landscape.

What carries the argument

Effective function classes, defined as the set of functions a neuron can realize on its input support together with the norm cost of realizing them, which is used to formalize effective symmetry breaking via neuron identifiability.

If this is right

Structurally asymmetric models still contain large families of approximately equivalent solutions.
Representation merging becomes possible without any prior alignment step when neurons are identifiable.
Linear low-loss paths between merged solutions exist under conditions tied to the effective function classes.
The loss landscape connectivity is shaped by effective function classes beyond fixed architectural symmetries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training runs may converge to solutions that differ mainly in how identifiable neurons are assigned rather than in the functions they compute.
Techniques that rely on post-training alignment could be simplified or replaced when identifiability already holds.
The same framework might be applied to understand connectivity after pruning or in continual learning settings where representations are reused.

Load-bearing premise

The formalization of effective function classes accurately captures the relevant interplay between parameters, data, and representations that determines practical symmetries and merging behavior.

What would settle it

An experiment or counter-example in which identifiable neurons across runs do not permit merging without alignment or in which the merged path fails to remain low-loss.

Figures

Figures reproduced from arXiv: 2606.04754 by Daniel Herbst, Stefanie Jegelka, Vincent B\"urgin, Ya-Wei Eileen Lin.

**Figure 1.** Figure 1: Illustration of neuron identifiability. (Left) Structural parameter symmetry broken but functions remain indistinguishable on data X . (Right) Neurons identifiable, effective symmetry breaking enables merging representations without alignment. tical despite differing in their parameters. Formally, common architectures admit large parameter symmetry groups (Hecht-Nielsen, 1990; Zhao et al., 2026), shapin… view at source ↗

**Figure 2.** Figure 2: Effective function classes and learned features (run A/run B) with varying levels of symmetry. Col. 1: In a fully symmetric MLP, each neuron can implement the same functions on X . Col. 2: Pruning weights affects Si and can introduce anisotropy. Col. 3: Fixed weights via F introduce functional biases. ment costs over nontrivial permutations are equivalent to there being a unique minimum-complexity assignme… view at source ↗

**Figure 3.** Figure 3: Activation matching objectives of optimal, identity, and random permutations for networks trained with different values of the fixed weight scale σF. We average the objectives of different layers and use post-norm, post-activation function values. 6. Experimental Results In this section, we empirically investigate the effects that our theory predicts. We check which variables influence the effectiveness of… view at source ↗

**Figure 4.** Figure 4: Aligned and unaligned training accuracy LMC interpolation for standard models and asymmetric models (average of 8 model pairs each). Alignment using activation matching. Q1: How do existing symmetry breaking architectures perform in terms of permutation-aligned and unaligned LMC? We train standard and asymmetric models on MNIST and CIFAR-10: MLPs, W-MLPs (Lim et al., 2024b), and syre-MLPs (Ziyin et al., … view at source ↗

**Figure 6.** Figure 6: LMC dependence on coherence ν(U) in F = 0 setting: LMC barriers on a synthetic dataset with varying ν(U). Q4: How does the input subspace coherence ν(U) control the effectiveness of symmetry breaking? Thm. 4.4 suggests that on highly coherent data, (i.e., when the data’s principal directions are aligned well with the standard basis axes and hence ν(U) ≈ 1), the Gram matrix Si can become more anisotropic. I… view at source ↗

**Figure 7.** Figure 7: reveals that, in fact, the identity does not produce a higher objective than random permutations, and therefore, similar to W-asym. networks with σF = 0, symmetries are not broken effectively. All layers After layer 1 After layer 2 After layer 3 MLP 0 50 100 0.0 0.5 0 50 100 0.0 0.5 0 50 100 0.0 0.5 0 50 100 0.0 0.5 W-MLP σ F = 0 0 50 100 0.0 0.2 0.4 0.6 0 50 100 0.0 0.2 0.4 0.6 0 50 100 0.0 0.2 0.4 0.6 0 … view at source ↗

**Figure 8.** Figure 8: Activation matching objectives per layer (W-MLP on MNIST), sweeping over fixed weight scale σF. 0 .01 .02 .05 .1 .2 .5 1 2 5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Optimal Identity Random (a) BG 1 residual stream 0 .01 .02 .05 .1 .2 .5 1 2 5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Optimal Identity Random (b) Layer 1.1 inner 0 .01 .02 .05 .1 .2 .5 1 2 5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Optimal Identity Random (c) Layer 1.2… view at source ↗

**Figure 9.** Figure 9: Activation matching objectives separated by layer/activation matching point (W-ResNet on CIFAR-10). In our W-ResNets, activation matching points lie within each of the three block groups’ (BG) residual streams (used to estimate a global permutation for the residual stream that multiple layers write into), and between the two lin. layers of the three inner two-layer MLPs within each block group. σ F = 0 (a)… view at source ↗

**Figure 10.** Figure 10: Activation matching objectives per layer (W-asymmetric MLPs on MNIST), sweeping over sparsity parameter (proportion of fixed weights) for both σF = 0 (standard sparse training) and σF = 1. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_10.png] view at source ↗

**Figure 11.** Figure 11: Minimum pairwise projected center distance γout (solid, mean ± std) vs. predicted asymptotic rate ckm−2/k (dotted). We also examine how the minimum distance between projected neuron centers changes with layer width m. For randomly sampled F we observe the expected scaling behavior from Thm. 4.6. To this end, we sweep the hidden dimension m and intrinsic dimension k ∈ {2, 8, 32} on the Gaussian mixture d… view at source ↗

**Figure 12.** Figure 12: Distributions of neuron swap costs by architecture on MNIST. Plotted are signed sqrt. transformed ∆out (ij) for disjoint consecutive pairs (i, j) of neurons, estimated via Mahalanobis distance (10) and ridge regression (Def. E.3, β = 0.01). Mahalanobis estimate Ridge regression estimate [PITH_FULL_IMAGE:figures/full_fig_p043_12.png] view at source ↗

**Figure 13.** Figure 13: Neuron swap costs for W-MLPs with σF ∈ {0, 1} on Gaussian mixture data. Plotted are signed sqrt.transformed ∆out (ij) for disj. consecutive pairs (i, j) of neurons, estimated via Mahalanobis dist. (10) and ridge regression (Def. E.3, β = 0.01). 43 [PITH_FULL_IMAGE:figures/full_fig_p043_13.png] view at source ↗

**Figure 14.** Figure 14: Neuron swap costs ∆out (ij) (signed square-root transformed) estimated via Mahalanobis distance on Gaussian mixture data, for varying intrinsic dimension k and fixed weight scale σF (same as [PITH_FULL_IMAGE:figures/full_fig_p044_14.png] view at source ↗

**Figure 15.** Figure 15: Neuron swap costs ∆out (ij) (signed square-root transformed) estimated via ridge regression, β = 0.01 (see Def. E.3) on Gaussian mixture data, for varying intrinsic dimension k and fixed weight scale σF. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_15.png] view at source ↗

**Figure 16.** Figure 16: Neuron swap costs ∆out (ij) (signed square-root transformed) per layer for standard MLP trained on MNIST, estimated via Mahalanobis distance (10) vs. ridge regression (Def. E.3, β = 0.01). Layers 1 – 3 have hidden dimension 512, shown are costs for first 128 × 128 neuron pairs. Mahalanobis estimate neuron j neuron i neuron j neuron j 0.0 1.5 3.0 ridge regression estimate neuron i Layer 1 Layer 2 -4.5 -3.0… view at source ↗

**Figure 17.** Figure 17: Neuron swap costs ∆out (ij) (signed square-root transformed) per layer for W-asymmetric MLP (σF = 0) trained on MNIST, estimated via Mahalanobis distance (10) and ridge regression (Def. E.3, β = 0.01). Layers 1 – 3 have hidden dimension 512, shown are costs for first 128 × 128 neuron pairs. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_17.png] view at source ↗

**Figure 18.** Figure 18: Neuron swap costs ∆out (ij) (signed square-root transformed) per layer for W-asymmetric MLP (σF = 1) trained on MNIST, estimated via Mahalanobis distance (10) and ridge regression (Def. E.3, β = 0.01). Layers 1 – 3 have hidden dimension 512, shown are costs for first 128 × 128 neuron pairs. Mahalanobis estimate neuron j neuron i neuron j neuron j 0 10 20 30 ridge regression estimate neuron i Layer 1 Layer… view at source ↗

**Figure 19.** Figure 19: Neuron swap costs ∆out (ij) (signed square-root transformed) per layer for syre-MLP (σF = 1) trained on MNIST, estimated via Mahalanobis distance (10) and ridge regression (Def. E.3, β = 0.01). Layers 1 – 3 have hidden dimension 512, shown are costs for first 128 × 128 neuron pairs. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_19.png] view at source ↗

**Figure 20.** Figure 20: LMC of Transformer and WTransformer on CIFAR-10, measured by training accuracy along the interpolation path [PITH_FULL_IMAGE:figures/full_fig_p047_20.png] view at source ↗

**Figure 21.** Figure 21: Subspace coherences ν(U) computed for model inputs, outputs, and hidden representations, on MNIST data, for standard and asymmetric MLP variants over training. without effective symmetry breaking (MLP and W-asymmetric MLP with zero fixed weights) exhibit low subspace coherence, in particular in later layers, while W-asymmetric MLPs and syre-MLPs with large fixed weights both tend to exhibit higher subspac… view at source ↗

read the original abstract

Many striking phenomena in deep learning, such as linear mode connectivity and the structured behavior of training dynamics, are closely tied to parameter symmetries: transformations that leave the realized function unchanged. Despite growing attention to parameter symmetries, the exact interplay between parameters, data, and representations remains underexplored. To investigate this, we develop a theoretical framework of effective function classes, i.e., the set of functions a neuron can realize on its input support, and the norm cost of realizing them. We then formalize effective symmetry breaking via neuron identifiability across independent training runs. Our analysis shows that neural networks can admit large families of approximately equivalent solutions even in structurally asymmetric models. We further show that neuron identifiability enables representation merging without prior alignment, and characterize when such merging admits a linear low-loss path. These findings highlight the role of effective function classes in affecting the loss landscape.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new angle is using effective function classes to formalize neuron identifiability and symmetry breaking beyond structural symmetries, but the claims hinge on an untested definition that may miss neuron interactions.

read the letter

The core idea here is a framework built around effective function classes for individual neurons—the set of functions they can realize on their input support plus the norm cost—to formalize how identifiability breaks symmetries across independent runs. This leads to their claim that networks have large families of roughly equivalent solutions even without structural asymmetry, and that this setup allows representation merging without alignment while preserving a linear low-loss path.

That moves past the usual permutation and scaling symmetries in the mode connectivity literature by trying to tie things more directly to data and representations. The attempt to make symmetry discussions more data-dependent is a reasonable step.

The soft spot is the load-bearing assumption that the per-neuron effective function class definition accurately determines practical symmetries and merging behavior. If it overlooks interactions across neurons or fails to handle effects from ReLUs and realistic data distributions, the symmetry-breaking argument and the characterization of linear paths would not follow. The abstract gives no derivations or checks on this point, so it is difficult to judge whether the math closes the gap.

Citations look standard for the area. No obvious circularity shows up.

This is for readers working on loss landscapes and model merging who want a more representation-oriented take on symmetries. It could be worth sending to a serious referee to see if the formalization and any experiments hold up, even if revisions are needed.

Referee Report

2 major / 2 minor

Summary. The paper develops a theoretical framework centered on effective function classes—the set of functions realizable by a neuron on its input support together with the associated norm cost—to formalize neuron identifiability across independent training runs. It claims this framework reveals large families of approximately equivalent solutions even in structurally asymmetric networks, enables representation merging without prior alignment, and characterizes conditions under which such merging admits a linear low-loss path, thereby highlighting the role of effective function classes in shaping the loss landscape beyond structural symmetries.

Significance. If the central claims hold, the work provides a data- and representation-dependent lens on symmetries that could explain linear mode connectivity and model merging in settings where structural symmetries are absent. The introduction of effective function classes as an analytical tool is a conceptual contribution, though the manuscript does not appear to deliver machine-checked proofs, reproducible code, or explicit falsifiable predictions that would strengthen its impact.

major comments (2)

[theoretical framework section] The central argument that neuron identifiability yields large families of approximately equivalent solutions and enables unaligned linear merging rests on the claim that effective function classes accurately capture the interplay between parameters, data, and representations. No derivation or counter-example verification is supplied showing that this formalization remains valid under ReLU nonlinearities or realistic data distributions that may induce higher-order interactions across neurons (see the definition and subsequent analysis of effective function classes).
[analysis of linear paths] The characterization of when merging admits a linear low-loss path is presented as following from neuron identifiability, yet the manuscript provides no explicit test or bound demonstrating that the effective-function-class construction is independent of the specific data support or norm costs in a way that would survive perturbations to the input distribution.

minor comments (2)

[abstract and introduction] The abstract and introduction use the term 'approximately equivalent solutions' without a precise quantitative definition (e.g., in terms of function distance or loss difference) that is later tied back to the effective-function-class norm cost.
[theoretical framework] Notation for the effective function class and its norm cost should be introduced with an explicit equation or set notation to avoid ambiguity when the framework is applied to merging.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our theoretical framework. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses

Referee: [theoretical framework section] The central argument that neuron identifiability yields large families of approximately equivalent solutions and enables unaligned linear merging rests on the claim that effective function classes accurately capture the interplay between parameters, data, and representations. No derivation or counter-example verification is supplied showing that this formalization remains valid under ReLU nonlinearities or realistic data distributions that may induce higher-order interactions across neurons (see the definition and subsequent analysis of effective function classes).

Authors: We agree that explicit verification strengthens the framework. The current manuscript defines effective function classes in a manner intended to be general, but does not include a dedicated derivation for ReLU or counter-example checks against higher-order neuron interactions. In the revision we will add a subsection deriving the effective function class for ReLU neurons on finite support and include a simple counter-example illustrating robustness (or breakdown) under cross-neuron interactions. revision: yes
Referee: [analysis of linear paths] The characterization of when merging admits a linear low-loss path is presented as following from neuron identifiability, yet the manuscript provides no explicit test or bound demonstrating that the effective-function-class construction is independent of the specific data support or norm costs in a way that would survive perturbations to the input distribution.

Authors: The manuscript characterizes linear paths via neuron identifiability but does not supply perturbation bounds on the effective-function-class construction. We will add an explicit stability bound (in the form of a Lipschitz-style estimate) showing how changes in input distribution affect the norm cost and the resulting linear-path guarantee, together with a brief numerical illustration on a synthetic distribution shift. revision: yes

Circularity Check

0 steps flagged

No circularity: new framework definitions do not reduce claims to inputs by construction

full rationale

The provided abstract introduces 'effective function classes' as a novel theoretical construct to formalize neuron identifiability and symmetry breaking. No equations, fitted parameters, or self-citations are present that would make any 'prediction' equivalent to the inputs by definition. The claims about large families of equivalent solutions and linear merging follow from the stated framework rather than assuming the result. Per hard rules, absent any quotable reduction (self-definitional, fitted-input, or self-citation load-bearing), the derivation is treated as self-contained with score 0. Full text reference does not alter this as no circular steps are exhibitable from given material.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the new definitions of effective function classes and neuron identifiability; no free parameters are mentioned, but the framework assumes standard neural network symmetries exist and that norm costs meaningfully distinguish realizable functions.

axioms (1)

domain assumption Parameter symmetries exist that leave the realized function unchanged.
Stated in the opening sentence as the starting point for the phenomena studied.

invented entities (2)

effective function classes no independent evidence
purpose: The set of functions a neuron can realize on its input support together with the norm cost of realizing them.
Newly introduced to investigate the interplay between parameters, data, and representations.
neuron identifiability no independent evidence
purpose: Formalization of effective symmetry breaking across independent training runs.
New concept used to enable merging without prior alignment.

pith-pipeline@v0.9.1-grok · 5689 in / 1398 out tokens · 13412 ms · 2026-06-28T07:24:16.054283+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

134 extracted references · 20 canonical work pages · 4 internal anchors

[1]

The Eleventh International Conference on Learning Representations , year=

Git Re-Basin: Merging Models modulo Permutation Symmetries , author=. The Eleventh International Conference on Learning Representations , year=
[2]

arXiv preprint arXiv:2305.03053 , year=

Zipit! merging models from different tasks without training , author=. arXiv preprint arXiv:2305.03053 , year=

work page arXiv
[3]

International Mathematics Research Notices , volume=

Small Ball Probabilities for Linear Images of High-Dimensional Distributions , author=. International Mathematics Research Notices , volume=
[4]

Advances in Neural Information Processing Systems , volume=

On the power and limitations of random features for understanding neural networks , author=. Advances in Neural Information Processing Systems , volume=
[5]

Keller Jordan and Hanie Sedghi and Olga Saukh and Rahim Entezari and Behnam Neyshabur , booktitle=
[6]

Re-basin via implicit

Pe. Re-basin via implicit. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[7]

arXiv preprint arXiv:2002.06440 , year=

Federated learning with matched averaging , author=. arXiv preprint arXiv:2002.06440 , year=

work page arXiv 2002
[8]

Editing Models with Task Arithmetic

Editing models with task arithmetic , author=. arXiv preprint arXiv:2212.04089 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Advances in Neural Information Processing Systems , volume=

Symmetry teleportation for accelerated optimization , author=. Advances in Neural Information Processing Systems , volume=
[10]

Vershynin, Roman , year =. High-
[11]

The Twelfth International Conference on Learning Representations , year=

Improving Convergence and Generalization Using Parameter Symmetries , author=. The Twelfth International Conference on Learning Representations , year=
[12]

Advances in Neural Information Processing Systems , volume=

The empirical impact of neural parameter symmetries, or lack thereof , author=. Advances in Neural Information Processing Systems , volume=
[13]

Small nonlinearities in activation functions create bad local minima in neural networks , abstract =

Yun, Chulhee and Sra, Suvrit and Jadbabaie, Ali , month = sep, year =. Small nonlinearities in activation functions create bad local minima in neural networks , abstract =
[14]

The Thirty Second Annual Conference on Learning Theory , year=

How do infinite width bounded norm networks look in function space? , author=. The Thirty Second Annual Conference on Learning Theory , year=
[15]

The Thirty Seventh Annual Conference on Learning Theory , pages=

Exact mean square linear stability analysis for SGD , author=. The Thirty Seventh Annual Conference on Learning Theory , pages=. 2024 , organization=

2024
[16]

Advances in Neural Information Processing Systems , volume=

Rank diminishing in deep neural networks , author=. Advances in Neural Information Processing Systems , volume=
[17]

Proceedings of the National Academy of Sciences , volume=

Prevalence of neural collapse during the terminal phase of deep learning training , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

2020
[18]

Transactions on Machine Learning Research , issn=

The Low-Rank Simplicity Bias in Deep Networks , author=. Transactions on Machine Learning Research , issn=
[19]

Advances in Neural Information Processing Systems , volume=

Intrinsic dimension of data representations in deep neural networks , author=. Advances in Neural Information Processing Systems , volume=
[20]

International Conference on Learning Representations , year=

The Intrinsic Dimension of Images and Its Impact on Learning , author=. International Conference on Learning Representations , year=
[21]

Journal of Machine Learning Research , volume=

All models are wrong, but many are useful: Learning a variable's importance by studying an entire class of prediction models simultaneously , author=. Journal of Machine Learning Research , volume=
[22]

Machine Learning , volume=

Parameter identifiability of a deep feedforward ReLU neural network , author=. Machine Learning , volume=. 2023 , publisher=

2023
[23]

European Conference on Computer Vision , pages=

Predicting is not understanding: Recognizing and addressing underspecification in machine learning , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022
[24]

Journal of Machine Learning Research , volume=

Underspecification presents challenges for credibility in modern machine learning , author=. Journal of Machine Learning Research , volume=
[25]

The Thirteenth International Conference on Learning Representations , year=

Remove Symmetries to Control Model Expressivity and Improve Optimization , author=. The Thirteenth International Conference on Learning Representations , year=
[26]

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape , author=. arXiv preprint arXiv:1907.02911 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1907
[27]

International Conference on Machine Learning , pages=

Sharp minima can generalize for deep nets , author=. International Conference on Machine Learning , pages=. 2017 , organization=

2017
[28]

International Conference on Machine Learning , pages=

Deep networks on toroids: removing symmetries reveals the structure of flat regions in the landscape geometry , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[29]

International Conference on Machine Learning , pages=

Similarity of neural network representations revisited , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019
[30]

Deep Variational Canonical Correlation Analysis

Deep variational canonical correlation analysis , author=. arXiv preprint arXiv:1610.03454 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

NIPS workshop on bayesian deep learning , volume=

Improving the identifiability of neural networks for Bayesian inference , author=. NIPS workshop on bayesian deep learning , volume=
[32]

International Conference on Machine Learning , pages=

Hidden symmetries of ReLU networks , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[33]

Advances in Neural Information Processing Systems , volume=

Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent , author=. Advances in Neural Information Processing Systems , volume=
[34]

Journal of Machine Learning Research , volume=

The implicit bias of gradient descent on separable data , author=. Journal of Machine Learning Research , volume=
[35]

International Conference on Machine Learning , pages=

Network morphism , author=. International Conference on Machine Learning , pages=. 2016 , organization=

2016
[36]

International Conference on Learning Representations , year=

Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics , author=. International Conference on Learning Representations , year=
[37]

International Conference on Machine Learning , pages=

Batch normalization: Accelerating deep network training by reducing internal covariate shift , author=. International Conference on Machine Learning , pages=. 2015 , organization=

2015
[38]

The Twelfth International Conference on Learning Representations , year=

A Symmetry-Aware Exploration of Bayesian Neural Network Posteriors , author=. The Twelfth International Conference on Learning Representations , year=
[39]

Parameter

Ziyin, Liu and Xu, Yizhou and Poggio, Tomaso and Chuang, Isaac , month = may, year =. Parameter. doi:10.48550/arXiv.2502.05300 , abstract =

work page doi:10.48550/arxiv.2502.05300
[40]

Song, Minhak and Ahn, Kwangjun and Yun, Chulhee , month = oct, year =. Does
[41]

Forty-second International Conference on Machine Learning , year=

Understanding Mode Connectivity via Parameter Space Symmetry , author=. Forty-second International Conference on Machine Learning , year=
[42]

Yunis, K

Approaching deep learning through the spectral dynamics of weights , author=. arXiv preprint arXiv:2408.11804 , year=

work page arXiv
[43]

OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop) , year=

On convexity and linear mode connectivity in neural networks , author=. OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop) , year=

2022
[44]

International Conference on Machine Learning , pages=

On the spectral bias of neural networks , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019
[45]

Putterman, Theo and Lim, Derek and Gelberg, Yoav and Bronstein, Michael M and Jegelka, Stefanie and Maron, Haggai , booktitle=
[46]

Predicting neural network accuracy from weights

Predicting neural network accuracy from weights , author=. arXiv preprint arXiv:2002.11448 , year=

work page arXiv 2002
[47]

arXiv preprint arXiv:2002.05688 , year=

Classifying the classifier: dissecting the weight space of neural networks , author=. arXiv preprint arXiv:2002.05688 , year=

work page arXiv 2002
[48]

Advances in Neural Information Processing Systems , volume=

Hyper-representations as generative models: Sampling unseen neural network weights , author=. Advances in Neural Information Processing Systems , volume=
[49]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=
[50]

Forty-first International Conference on Machine Learning , year=

Improved Generalization of Weight Space Networks via Augmentations , author=. Forty-first International Conference on Machine Learning , year=
[51]

Forty-first International Conference on Machine Learning , year=

Equivariant Deep Weight Space Alignment , author=. Forty-first International Conference on Machine Learning , year=
[52]

International Conference on Learning Representations , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=
[53]

Symmetry in

Zhao, Bo and Walters, Robin and Yu, Rose , month = jun, year =. Symmetry in. doi:10.48550/arXiv.2506.13018 , abstract =

work page doi:10.48550/arxiv.2506.13018
[54]

Improving

Zamir, Guy and Dokania, Aryan and Zhao, Bo and Yu, Rose , month = apr, year =. Improving. doi:10.48550/arXiv.2504.15399 , abstract =

work page doi:10.48550/arxiv.2504.15399
[55]

Laird, Lucas and Zhao, Bo and Yu, Rose and Walters, Robin , month = jun, year =. Data-
[56]

Understanding Mode Connectivity via Parameter Space Symmetry , author=
[57]

The Eleventh International Conference on Learning Representations , year=

Symmetries, Flat Minima, and the Conserved Quantities of Gradient Flow , author=. The Eleventh International Conference on Learning Representations , year=
[58]

2000 , publisher=

Functions of bounded variation and free discontinuity problems , author=. 2000 , publisher=

2000
[59]

, month = apr, year =

Evans, Lawrence Craig and Gariepy, Ronald F. , month = apr, year =. Measure
[60]

Andriushchenko, Maksym and Croce, Francesco and Müller, Maximilian and Hein, Matthias and Flammarion, Nicolas , month = jun, year =. A. doi:10.48550/arXiv.2302.07011 , abstract =

work page doi:10.48550/arxiv.2302.07011
[61]

npj Artificial Intelligence , author =

Symmetry breaking in neural network optimization: insights from input dimension expansion , volume =. npj Artificial Intelligence , author =. 2025 , note =. doi:10.1038/s44387-025-00010-0 , abstract =

work page doi:10.1038/s44387-025-00010-0 2025
[62]

Structured

Rochussen, Tommy , month = may, year =. Structured
[63]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
[64]

Weighted

Rahimi, Ali and Recht, Benjamin , year =. Weighted. Advances in
[65]

International Conference on Learning Representations , year=

The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks , author=. International Conference on Learning Representations , year=
[66]

International Conference on Machine Learning , year=

Linear Mode Connectivity and the Lottery Ticket Hypothesis , author=. International Conference on Machine Learning , year=
[67]

Conference on learning theory , pages=

Norm-based capacity control in neural networks , author=. Conference on learning theory , pages=. 2015 , organization=

2015
[68]

arXiv preprint arXiv:2007.06737 , year=

Representation transfer by optimal transport , author=. arXiv preprint arXiv:2007.06737 , year=

work page arXiv 2007
[69]

Advances in Neural Information Processing Systems , volume=

Learning to learn by gradient descent by gradient descent , author=. Advances in Neural Information Processing Systems , volume=
[70]

Advances in Neural Information Processing Systems , volume=

What is being transferred in transfer learning? , author=. Advances in Neural Information Processing Systems , volume=
[71]

Advances in Neural Information Processing Systems , volume=

A Tale of Two Symmetries: Exploring the Loss Landscape of Equivariant Models , author=. Advances in Neural Information Processing Systems , volume=
[72]

Advances in Neural Information Processing Systems , volume=

Model fusion via optimal transport , author=. Advances in Neural Information Processing Systems , volume=
[73]

The Twelfth International Conference on Learning Representations , year=

Graph Metanetworks for Processing Diverse Neural Architectures , author=. The Twelfth International Conference on Learning Representations , year=
[74]

Advances in Neural Information Processing Systems , volume=

Explaining landscape connectivity of low-cost solutions for multilayer nets , author=. Advances in Neural Information Processing Systems , volume=
[75]

L2 Regularization versus Batch and Weight Normalization

L2 regularization versus batch and weight normalization , author=. arXiv preprint arXiv:1706.05350 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Generalized Linear Mode Connectivity for Transformers , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[77]

Advances in Neural Information Processing Systems , volume=

Neural functional transformers , author=. Advances in Neural Information Processing Systems , volume=
[78]

arXiv preprint arXiv:2310.17513 , year=

The expressive power of low-rank adaptation , author=. arXiv preprint arXiv:2310.17513 , year=

work page arXiv
[79]

Advanced Neural Computers , pages=

On the algebraic structure of feedforward network weight spaces , author=. Advanced Neural Computers , pages=. 1990 , publisher=

1990
[80]

The Thirteenth International Conference on Learning Representations , year=

Deep Linear Probe Generators for Weight Space Learning , author=. The Thirteenth International Conference on Learning Representations , year=

Showing first 80 references.

[1] [1]

The Eleventh International Conference on Learning Representations , year=

Git Re-Basin: Merging Models modulo Permutation Symmetries , author=. The Eleventh International Conference on Learning Representations , year=

[2] [2]

arXiv preprint arXiv:2305.03053 , year=

Zipit! merging models from different tasks without training , author=. arXiv preprint arXiv:2305.03053 , year=

work page arXiv

[3] [3]

International Mathematics Research Notices , volume=

Small Ball Probabilities for Linear Images of High-Dimensional Distributions , author=. International Mathematics Research Notices , volume=

[4] [4]

Advances in Neural Information Processing Systems , volume=

On the power and limitations of random features for understanding neural networks , author=. Advances in Neural Information Processing Systems , volume=

[5] [5]

Keller Jordan and Hanie Sedghi and Olga Saukh and Rahim Entezari and Behnam Neyshabur , booktitle=

[6] [6]

Re-basin via implicit

Pe. Re-basin via implicit. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[7] [7]

arXiv preprint arXiv:2002.06440 , year=

Federated learning with matched averaging , author=. arXiv preprint arXiv:2002.06440 , year=

work page arXiv 2002

[8] [8]

Editing Models with Task Arithmetic

Editing models with task arithmetic , author=. arXiv preprint arXiv:2212.04089 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Advances in Neural Information Processing Systems , volume=

Symmetry teleportation for accelerated optimization , author=. Advances in Neural Information Processing Systems , volume=

[10] [10]

Vershynin, Roman , year =. High-

[11] [11]

The Twelfth International Conference on Learning Representations , year=

Improving Convergence and Generalization Using Parameter Symmetries , author=. The Twelfth International Conference on Learning Representations , year=

[12] [12]

Advances in Neural Information Processing Systems , volume=

The empirical impact of neural parameter symmetries, or lack thereof , author=. Advances in Neural Information Processing Systems , volume=

[13] [13]

Small nonlinearities in activation functions create bad local minima in neural networks , abstract =

Yun, Chulhee and Sra, Suvrit and Jadbabaie, Ali , month = sep, year =. Small nonlinearities in activation functions create bad local minima in neural networks , abstract =

[14] [14]

The Thirty Second Annual Conference on Learning Theory , year=

How do infinite width bounded norm networks look in function space? , author=. The Thirty Second Annual Conference on Learning Theory , year=

[15] [15]

The Thirty Seventh Annual Conference on Learning Theory , pages=

Exact mean square linear stability analysis for SGD , author=. The Thirty Seventh Annual Conference on Learning Theory , pages=. 2024 , organization=

2024

[16] [16]

Advances in Neural Information Processing Systems , volume=

Rank diminishing in deep neural networks , author=. Advances in Neural Information Processing Systems , volume=

[17] [17]

Proceedings of the National Academy of Sciences , volume=

Prevalence of neural collapse during the terminal phase of deep learning training , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

2020

[18] [18]

Transactions on Machine Learning Research , issn=

The Low-Rank Simplicity Bias in Deep Networks , author=. Transactions on Machine Learning Research , issn=

[19] [19]

Advances in Neural Information Processing Systems , volume=

Intrinsic dimension of data representations in deep neural networks , author=. Advances in Neural Information Processing Systems , volume=

[20] [20]

International Conference on Learning Representations , year=

The Intrinsic Dimension of Images and Its Impact on Learning , author=. International Conference on Learning Representations , year=

[21] [21]

Journal of Machine Learning Research , volume=

All models are wrong, but many are useful: Learning a variable's importance by studying an entire class of prediction models simultaneously , author=. Journal of Machine Learning Research , volume=

[22] [22]

Machine Learning , volume=

Parameter identifiability of a deep feedforward ReLU neural network , author=. Machine Learning , volume=. 2023 , publisher=

2023

[23] [23]

European Conference on Computer Vision , pages=

Predicting is not understanding: Recognizing and addressing underspecification in machine learning , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022

[24] [24]

Journal of Machine Learning Research , volume=

Underspecification presents challenges for credibility in modern machine learning , author=. Journal of Machine Learning Research , volume=

[25] [25]

The Thirteenth International Conference on Learning Representations , year=

Remove Symmetries to Control Model Expressivity and Improve Optimization , author=. The Thirteenth International Conference on Learning Representations , year=

[26] [26]

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape , author=. arXiv preprint arXiv:1907.02911 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1907

[27] [27]

International Conference on Machine Learning , pages=

Sharp minima can generalize for deep nets , author=. International Conference on Machine Learning , pages=. 2017 , organization=

2017

[28] [28]

International Conference on Machine Learning , pages=

Deep networks on toroids: removing symmetries reveals the structure of flat regions in the landscape geometry , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022

[29] [29]

International Conference on Machine Learning , pages=

Similarity of neural network representations revisited , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019

[30] [30]

Deep Variational Canonical Correlation Analysis

Deep variational canonical correlation analysis , author=. arXiv preprint arXiv:1610.03454 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

NIPS workshop on bayesian deep learning , volume=

Improving the identifiability of neural networks for Bayesian inference , author=. NIPS workshop on bayesian deep learning , volume=

[32] [32]

International Conference on Machine Learning , pages=

Hidden symmetries of ReLU networks , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[33] [33]

Advances in Neural Information Processing Systems , volume=

Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent , author=. Advances in Neural Information Processing Systems , volume=

[34] [34]

Journal of Machine Learning Research , volume=

The implicit bias of gradient descent on separable data , author=. Journal of Machine Learning Research , volume=

[35] [35]

International Conference on Machine Learning , pages=

Network morphism , author=. International Conference on Machine Learning , pages=. 2016 , organization=

2016

[36] [36]

International Conference on Learning Representations , year=

Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics , author=. International Conference on Learning Representations , year=

[37] [37]

International Conference on Machine Learning , pages=

Batch normalization: Accelerating deep network training by reducing internal covariate shift , author=. International Conference on Machine Learning , pages=. 2015 , organization=

2015

[38] [38]

The Twelfth International Conference on Learning Representations , year=

A Symmetry-Aware Exploration of Bayesian Neural Network Posteriors , author=. The Twelfth International Conference on Learning Representations , year=

[39] [39]

Parameter

Ziyin, Liu and Xu, Yizhou and Poggio, Tomaso and Chuang, Isaac , month = may, year =. Parameter. doi:10.48550/arXiv.2502.05300 , abstract =

work page doi:10.48550/arxiv.2502.05300

[40] [40]

Song, Minhak and Ahn, Kwangjun and Yun, Chulhee , month = oct, year =. Does

[41] [41]

Forty-second International Conference on Machine Learning , year=

Understanding Mode Connectivity via Parameter Space Symmetry , author=. Forty-second International Conference on Machine Learning , year=

[42] [42]

Yunis, K

Approaching deep learning through the spectral dynamics of weights , author=. arXiv preprint arXiv:2408.11804 , year=

work page arXiv

[43] [43]

OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop) , year=

On convexity and linear mode connectivity in neural networks , author=. OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop) , year=

2022

[44] [44]

International Conference on Machine Learning , pages=

On the spectral bias of neural networks , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019

[45] [45]

Putterman, Theo and Lim, Derek and Gelberg, Yoav and Bronstein, Michael M and Jegelka, Stefanie and Maron, Haggai , booktitle=

[46] [46]

Predicting neural network accuracy from weights

Predicting neural network accuracy from weights , author=. arXiv preprint arXiv:2002.11448 , year=

work page arXiv 2002

[47] [47]

arXiv preprint arXiv:2002.05688 , year=

Classifying the classifier: dissecting the weight space of neural networks , author=. arXiv preprint arXiv:2002.05688 , year=

work page arXiv 2002

[48] [48]

Advances in Neural Information Processing Systems , volume=

Hyper-representations as generative models: Sampling unseen neural network weights , author=. Advances in Neural Information Processing Systems , volume=

[49] [49]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

[50] [50]

Forty-first International Conference on Machine Learning , year=

Improved Generalization of Weight Space Networks via Augmentations , author=. Forty-first International Conference on Machine Learning , year=

[51] [51]

Forty-first International Conference on Machine Learning , year=

Equivariant Deep Weight Space Alignment , author=. Forty-first International Conference on Machine Learning , year=

[52] [52]

International Conference on Learning Representations , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

[53] [53]

Symmetry in

Zhao, Bo and Walters, Robin and Yu, Rose , month = jun, year =. Symmetry in. doi:10.48550/arXiv.2506.13018 , abstract =

work page doi:10.48550/arxiv.2506.13018

[54] [54]

Improving

Zamir, Guy and Dokania, Aryan and Zhao, Bo and Yu, Rose , month = apr, year =. Improving. doi:10.48550/arXiv.2504.15399 , abstract =

work page doi:10.48550/arxiv.2504.15399

[55] [55]

Laird, Lucas and Zhao, Bo and Yu, Rose and Walters, Robin , month = jun, year =. Data-

[56] [56]

Understanding Mode Connectivity via Parameter Space Symmetry , author=

[57] [57]

The Eleventh International Conference on Learning Representations , year=

Symmetries, Flat Minima, and the Conserved Quantities of Gradient Flow , author=. The Eleventh International Conference on Learning Representations , year=

[58] [58]

2000 , publisher=

Functions of bounded variation and free discontinuity problems , author=. 2000 , publisher=

2000

[59] [59]

, month = apr, year =

Evans, Lawrence Craig and Gariepy, Ronald F. , month = apr, year =. Measure

[60] [60]

Andriushchenko, Maksym and Croce, Francesco and Müller, Maximilian and Hein, Matthias and Flammarion, Nicolas , month = jun, year =. A. doi:10.48550/arXiv.2302.07011 , abstract =

work page doi:10.48550/arxiv.2302.07011

[61] [61]

npj Artificial Intelligence , author =

Symmetry breaking in neural network optimization: insights from input dimension expansion , volume =. npj Artificial Intelligence , author =. 2025 , note =. doi:10.1038/s44387-025-00010-0 , abstract =

work page doi:10.1038/s44387-025-00010-0 2025

[62] [62]

Structured

Rochussen, Tommy , month = may, year =. Structured

[63] [63]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

[64] [64]

Weighted

Rahimi, Ali and Recht, Benjamin , year =. Weighted. Advances in

[65] [65]

International Conference on Learning Representations , year=

The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks , author=. International Conference on Learning Representations , year=

[66] [66]

International Conference on Machine Learning , year=

Linear Mode Connectivity and the Lottery Ticket Hypothesis , author=. International Conference on Machine Learning , year=

[67] [67]

Conference on learning theory , pages=

Norm-based capacity control in neural networks , author=. Conference on learning theory , pages=. 2015 , organization=

2015

[68] [68]

arXiv preprint arXiv:2007.06737 , year=

Representation transfer by optimal transport , author=. arXiv preprint arXiv:2007.06737 , year=

work page arXiv 2007

[69] [69]

Advances in Neural Information Processing Systems , volume=

Learning to learn by gradient descent by gradient descent , author=. Advances in Neural Information Processing Systems , volume=

[70] [70]

Advances in Neural Information Processing Systems , volume=

What is being transferred in transfer learning? , author=. Advances in Neural Information Processing Systems , volume=

[71] [71]

Advances in Neural Information Processing Systems , volume=

A Tale of Two Symmetries: Exploring the Loss Landscape of Equivariant Models , author=. Advances in Neural Information Processing Systems , volume=

[72] [72]

Advances in Neural Information Processing Systems , volume=

Model fusion via optimal transport , author=. Advances in Neural Information Processing Systems , volume=

[73] [73]

The Twelfth International Conference on Learning Representations , year=

Graph Metanetworks for Processing Diverse Neural Architectures , author=. The Twelfth International Conference on Learning Representations , year=

[74] [74]

Advances in Neural Information Processing Systems , volume=

Explaining landscape connectivity of low-cost solutions for multilayer nets , author=. Advances in Neural Information Processing Systems , volume=

[75] [75]

L2 Regularization versus Batch and Weight Normalization

L2 regularization versus batch and weight normalization , author=. arXiv preprint arXiv:1706.05350 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[76] [76]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Generalized Linear Mode Connectivity for Transformers , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[77] [77]

Advances in Neural Information Processing Systems , volume=

Neural functional transformers , author=. Advances in Neural Information Processing Systems , volume=

[78] [78]

arXiv preprint arXiv:2310.17513 , year=

The expressive power of low-rank adaptation , author=. arXiv preprint arXiv:2310.17513 , year=

work page arXiv

[79] [79]

Advanced Neural Computers , pages=

On the algebraic structure of feedforward network weight spaces , author=. Advanced Neural Computers , pages=. 1990 , publisher=

1990

[80] [80]

The Thirteenth International Conference on Learning Representations , year=

Deep Linear Probe Generators for Weight Space Learning , author=. The Thirteenth International Conference on Learning Representations , year=