Most ReLU Networks Admit Identifiable Parameters

Guido Mont\'ufar; Moritz Grillo

arxiv: 2605.03601 · v2 · pith:6WJTXNHUnew · submitted 2026-05-05 · 💻 cs.LG · cs.DM· math.CO

Most ReLU Networks Admit Identifiable Parameters

Moritz Grillo , Guido Mont\'ufar This is my paper

Pith reviewed 2026-05-21 08:26 UTC · model grok-4.3

classification 💻 cs.LG cs.DMmath.CO

keywords ReLU networksparameter identifiabilityfunctional dimensionweighted polyhedral complexesrealization mapnetwork symmetriesdepth hierarchy

0 comments

The pith

For ReLU networks with input and hidden layers of width at least two, an open set of parameters are identifiable from the realized function up to scaling and permutation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in deep ReLU networks where input and all hidden layers have width at least two, most choices of parameters allow the input-output function to determine the weights uniquely once scaling and reordering of hidden units are accounted for. A sympathetic reader would care because this fixes the exact dimension of the space of representable functions as total parameters minus the number of hidden neurons. The authors reach the result by modeling the network as a weighted polyhedral complex whose combinatorial structure detects extra redundancies and proves they vanish on an open set. They further establish that minimal realizations can retain some non-standard symmetries and that deeper networks generically realize functions no shallower network can match.

Core claim

The realization map of a ReLU network is generically injective up to scaling and permutation of hidden neurons whenever every layer has width at least two. Consequently the functional dimension equals the total number of parameters minus the total number of hidden neurons, and the set of parameters that realize a given function is discrete outside the standard symmetries.

What carries the argument

Weighted polyhedral complexes that encode the arrangement of linear regions, their bounding hyperplanes, and the linear coefficients on each region, thereby exposing hidden parameter redundancies beyond scaling and permutation.

If this is right

The functional dimension of every such architecture equals the number of parameters minus the number of hidden neurons.
Even a minimal functional representation can still possess non-trivial parameter redundancies.
For an open dense set of parameters the realized function cannot be matched by any shallower network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training dynamics may converge to isolated points in parameter space once scaling and permutation are quotiented out.
The polyhedral-complex description could be used to count the number of distinct linear regions realized by a generic network.
The same counting argument might apply to other piecewise-linear activations that induce polyhedral partitions.

Load-bearing premise

The weighted polyhedral complex framework captures every possible hidden redundancy that could make parameters non-identifiable beyond scaling and permutation.

What would settle it

Exhibit a concrete ReLU network with all widths at least two together with a positive-measure open set of parameters inside which two distinct (non-scaling, non-permutation) parameter vectors realize identical functions.

Figures

Figures reproduced from arXiv: 2605.03601 by Guido Mont\'ufar, Moritz Grillo.

**Figure 1.** Figure 1: Illustration of weighted polyhedral complexes, the canonical polyhedral complex and bent hyper view at source ↗

**Figure 1.** Figure 1: Illustration of weighted polyhedral complexes, the canonical polyhedral complex and bent hyper [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Illustration of how transparency implies LRA, and how supertransversality can fail. view at source ↗

**Figure 2.** Figure 2: Illustration of how transparency implies LRA, and how supertransversality can fail. [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the inductive construction in Theorem 4.10 that satisfies TPIC and LRA. view at source ↗

**Figure 3.** Figure 3: Illustration of bending and non-bending ridges. [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of bending and non-bending ridges. view at source ↗

**Figure 5.** Figure 5: Illustration of Lemma 5.21. H1, . . . , Hn, Lemma 5.20 implies that the linear part of fθ − g is of the form Pn i=1 αi W (2) :,i W (1) i . Hence the affine-linear map h(x) = Ax + (fθ − g)(x) has linear part A + Pn i=1 αi W (2) :,i W (1) i . By assumption, the projection (A + Pn i=1 αi W (2) :,i W (1) i )Q has rank r. Therefore the linear part of h|aff(P ) has rank at least r. On the other hand, h is repres… view at source ↗

**Figure 5.** Figure 5: Canonical complexes for Example 6.2. Dashed lines indicate breakpoints that are not visible from [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗

**Figure 6.** Figure 6: Canonical complexes for Example 5.26. Dashed lines indicate breakpoints that are not visible from view at source ↗

**Figure 6.** Figure 6: Illustration of the inductive construction of Grigsby et al. (2023) for the architecture (2 [PITH_FULL_IMAGE:figures/full_fig_p040_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of the inductive construction in (Grigsby et al., 2023) for the architecture (2 view at source ↗

**Figure 7.** Figure 7: Illustration of Lemma B.2. On the other hand, h is represented by N − n neurons, each of which is affine linear on P and contributes a matrix of rank at most 1 to the linear part. It follows that N − n ≥ r. Therefore N ≥ n + r, which proves the claim. C Algebraic Superset of Fiber In this section, we describe (a superset of) the fiber algebraically. To do so, we first fix the combinatorial structure of the… view at source ↗

read the original abstract

We study the realization map of deep ReLU networks, focusing on when a function determines its parameters up to scaling and permutation. To analyze hidden redundancies beyond these standard symmetries, we introduce a framework based on weighted polyhedral complexes. Our main result shows that for every architecture whose input and hidden layers have width at least two, there exists an open set of identifiable parameters. This implies that the functional dimension of every such architecture is exactly the number of parameters minus the number of hidden neurons. We further show that minimal functional representations can still have non-trivial parameter redundancies. Finally, we establish a generic depth hierarchy, whereby for an open set of parameters the realized function cannot be represented generically by any shallower network.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper studies the realization map of deep ReLU networks, introducing a weighted polyhedral complex framework to detect hidden redundancies beyond per-neuron scaling and discrete permutation symmetries. The central claim is that for every architecture with input dimension and all hidden-layer widths at least 2, there exists a nonempty open set of identifiable parameters; consequently the functional dimension equals the total parameter count minus the number of hidden neurons. Additional results establish that minimal functional representations may retain non-trivial redundancies and that, generically, the realized function cannot be expressed by any shallower network.

Significance. If the main existence result holds, the work supplies a sharp geometric characterization of the identifiability locus for ReLU networks and a precise formula for functional dimension. The weighted-polyhedral-complex construction is a novel tool that could be useful for analyzing other piecewise-linear architectures; the generic depth-hierarchy statement also strengthens the literature on expressivity across depths.

major comments (1)

[Main-result section] Main-result section (proof of the open-set claim): the argument that the weighted polyhedral complex exhausts all continuous redundancies is load-bearing. A concrete verification is needed that no additional continuous equivalences (for example, layer-wise weight redistributions that preserve the piecewise-linear map on a positive-measure set of activation patterns) can occur when all widths are at least 2; without this, the claimed open set of identifiable parameters could be empty.

minor comments (1)

[Abstract] The abstract states that minimal functional representations 'can still have non-trivial parameter redundancies' but gives no concrete example; adding a low-dimensional illustration would clarify the distinction between functional and parametric minimality.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive report. The single major comment raises a valid point about the completeness of our argument that the weighted polyhedral complex captures all continuous redundancies. We address it below and indicate the revision we will make.

read point-by-point responses

Referee: [Main-result section] Main-result section (proof of the open-set claim): the argument that the weighted polyhedral complex exhausts all continuous redundancies is load-bearing. A concrete verification is needed that no additional continuous equivalences (for example, layer-wise weight redistributions that preserve the piecewise-linear map on a positive-measure set of activation patterns) can occur when all widths are at least 2; without this, the claimed open set of identifiable parameters could be empty.

Authors: We appreciate the referee highlighting the need for explicit verification that the weighted polyhedral complex rules out additional continuous equivalences. In the proof of the main theorem, the complex is constructed from the full collection of activation patterns and the associated affine maps on each polyhedral region; any continuous redundancy must map this complex to itself. Layer-wise redistributions that preserve the overall piecewise-linear function on a positive-measure set would have to preserve both the hyperplane arrangement and the linear coefficients on each chamber. When every hidden width is at least 2, the only transformations that achieve this are the standard per-neuron scalings (which are already quotiented out) and discrete permutations. We agree that this implication is not spelled out as explicitly as it could be and will add a short clarifying paragraph (or small lemma) in the revised main-result section that directly rules out non-trivial layer-wise redistributions for widths >=2, thereby confirming that the open set of identifiable parameters is nonempty. revision: partial

Circularity Check

0 steps flagged

No circularity: existence proof via new geometric framework

full rationale

The paper's derivation is a self-contained mathematical existence proof. It introduces the weighted polyhedral complex framework to characterize redundancies in the realization map of ReLU networks beyond scaling and permutation symmetries, then proves that for architectures with input and hidden widths at least 2 there exists a nonempty open set of parameters identifiable up to those symmetries. This directly yields the functional-dimension formula as #parameters minus #hidden neurons. No step reduces a claimed prediction or theorem to a fitted quantity, a self-referential definition, or a load-bearing self-citation whose validity depends on the present result; the framework and its properties are developed and applied within the paper itself as independent mathematical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proof relies on standard mathematical background in polyhedral geometry and neural network realization maps. No free parameters are introduced. The weighted polyhedral complex is a new modeling tool rather than an invented physical entity.

axioms (1)

domain assumption The realization map of a ReLU network can be faithfully represented by a weighted polyhedral complex that encodes all linear regions and their weights.
Invoked to analyze redundancies beyond scaling and permutation.

invented entities (1)

weighted polyhedral complex no independent evidence
purpose: Framework to track hidden redundancies in ReLU network parameters
New modeling device introduced in the paper; no independent empirical evidence provided beyond the mathematical construction.

pith-pipeline@v0.9.0 · 5643 in / 1307 out tokens · 28285 ms · 2026-05-21T08:26:28.486377+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a framework based on weighted polyhedral complexes... tropical weight cf(σ) := (AP − AQ)eP/σ
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alphaCoordinateFixationCert unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

functional dimension ... exactly the number of parameters minus the number of hidden neurons

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

Simone Bombari, Mohammad Hossein Amani, and Marco Mondelli

URL https://proceedings.mlr.press/v139/nguyen21g.html. Simone Bombari, Mohammad Hossein Amani, and Marco Mondelli. Memorization and optimization in deep neural networks with minimum over-parameterization. In Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=x8DNliTBSYY. Andrea Montanari and Yiqiao Zhong. The inte...

work page 2022
[2]

URL http://proceedings.mlr.press/v49/eldan16.html

PMLR. URL http://proceedings.mlr.press/v49/eldan16.html. Matus Telgarsky. benefits of depth in neural networks. In 29th Annual Conference on Learning Theory , volume 49 of Proceedings of Machine Learning Research , pages 1517–1539, Columbia University, New York, New York, USA, 2016. PMLR. URL https://proceedings.mlr.press/v49/telgarsky16.html. Hrushikesh ...

work page doi:10.1609/aaai.v31i1.10913 2016
[3]

URL https://arxiv.org/abs/2601.01417. 37 A Remarks on Previous Identifiability Results and Constructions This appendix clarifies two points relevant to comparison with prior work: first, the distinction between uniqueness within restricted parameter classes and identifiability in the full parameter space; second, the way in which our construction differs ...

work page arXiv
[4]

among all parameters η,

work page
[5]

among all generic parameters η, or

work page
[6]

These are progressively weaker notions of uniqueness and should not be conflated

only among parameters η belonging to some restricted class, for example parameters satisfying geo- metric conditions such as TPIC and LRA. These are progressively weaker notions of uniqueness and should not be conflated. In this hierarchy, the result of Rolnick and Kording (2020) is of the third type, being formulated on a restricted class of parameters s...

work page 2020
[8]

, Hn; 40

exactly n neurons of ˜f have breakpoint hyperplanes intersecting relint(P ), namely H1, . . . , Hn; 40

work page
[9]

the remaining k neurons have breakpoint hyperplanes disjoint from P . Proof. Group the neurons of f according to the visible hyperplane they induce in relint( P ). For each j ∈ [n], let Ij be the set of neurons whose breakpoint hyperplane in relint( P ) is Hj, and let K be the set of neurons whose nonzero locus does not appear as breakpoint of the final f...

work page
[10]

˜f(x) = f(x) for all x ∈ P

work page
[11]

exactly n neurons of ˜f have breakpoint hyperplanes intersecting relint( P ), namely H1, . . . , Hn

work page
[12]

Write ˜f(x) = g(x) + h(x) for all x ∈ P, where g is the subnetwork consisting of the n visible neurons and h is the subnetwork consisting of the remaining N − n neurons

the remaining N − n neurons have breakpoint hyperplanes disjoint from P . Write ˜f(x) = g(x) + h(x) for all x ∈ P, where g is the subnetwork consisting of the n visible neurons and h is the subnetwork consisting of the remaining N − n neurons. Then h is affine linear on P , since all of its breakpoint hyperplanes are disjoint from P . Moreover, g(x) = fθ(...

work page
[13]

ϕ: Vfθ → VA is a map from the set of candidate bent hyperplanes of fθ to the set of hidden neurons of A such that for every edge ( u, v) ∈ Efθ, if ϕ(u) = (i, ℓ) and ϕ(v) = (j, k), then ℓ < k

work page
[14]

For such a configuration, we write φ: Bd−1 θ → VA with φ(σ) := ϕ(π(σ)), where π : Bd−1 θ → Vfθ maps a facet to its unique candidate bent hyperplane

s = {s(σ)}σ∈Bd−1 θ is an assignment of an activation pattern to each facet σ ∈ B d−1 θ . For such a configuration, we write φ: Bd−1 θ → VA with φ(σ) := ϕ(π(σ)), where π : Bd−1 θ → Vfθ maps a facet to its unique candidate bent hyperplane. Definition C.2. Let (ϕ, s) be a discrete fiber configuration for fθ with respect to A, and let φ(σ) = ϕ(π(σ)) as above....

work page
[15]

the facet σ is contained in the bent hyperplane of the neuron φ(σ) in the realization η, and 42

work page
[16]

Definition C.3 (Configuration variety)

the activation pattern induced by η on σ is equal to s(σ). Definition C.3 (Configuration variety). For an architecture A and a discrete fiber configuration ( ϕ, s) for fθ, let φ(σ) = ϕ(π(σ)) for all σ ∈ B d−1 θ . The configuration variety V(ϕ,s) ⊆ ΘA × RBd−1 θ × RBd−1 θ is the algebraic variety in the variables η = (W (ℓ), b(ℓ))ℓ∈[L+1] and (λσ, δσ)σ∈Bd−1 ...

work page
[17]

Geometric Alignment: gσ(η) = δσλσaσ and tσ(η) = δσλσβσ

work page
[18]

Tropical Weight Matching: λσvσ(η) = cθ(σ)

work page
[19]

For fixed activation patternss(σ), all expressions gσ(η), tσ(η), and vσ(η) are polynomial in the parameters η

Sign Equation: δ2 σ = 1. For fixed activation patternss(σ), all expressions gσ(η), tσ(η), and vσ(η) are polynomial in the parameters η. Hence the above equations define an algebraic variety. We denote by πA(V(ϕ,s)) the projection onto Θ A and call this the configuration set. For a CPWL function f and an architecture A, let ˜S(f, A) = {η ∈ ˜ΘA | f = fη} be...

work page

[1] [1]

Simone Bombari, Mohammad Hossein Amani, and Marco Mondelli

URL https://proceedings.mlr.press/v139/nguyen21g.html. Simone Bombari, Mohammad Hossein Amani, and Marco Mondelli. Memorization and optimization in deep neural networks with minimum over-parameterization. In Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=x8DNliTBSYY. Andrea Montanari and Yiqiao Zhong. The inte...

work page 2022

[2] [2]

URL http://proceedings.mlr.press/v49/eldan16.html

PMLR. URL http://proceedings.mlr.press/v49/eldan16.html. Matus Telgarsky. benefits of depth in neural networks. In 29th Annual Conference on Learning Theory , volume 49 of Proceedings of Machine Learning Research , pages 1517–1539, Columbia University, New York, New York, USA, 2016. PMLR. URL https://proceedings.mlr.press/v49/telgarsky16.html. Hrushikesh ...

work page doi:10.1609/aaai.v31i1.10913 2016

[3] [3]

URL https://arxiv.org/abs/2601.01417. 37 A Remarks on Previous Identifiability Results and Constructions This appendix clarifies two points relevant to comparison with prior work: first, the distinction between uniqueness within restricted parameter classes and identifiability in the full parameter space; second, the way in which our construction differs ...

work page arXiv

[4] [4]

among all parameters η,

work page

[5] [5]

among all generic parameters η, or

work page

[6] [6]

These are progressively weaker notions of uniqueness and should not be conflated

only among parameters η belonging to some restricted class, for example parameters satisfying geo- metric conditions such as TPIC and LRA. These are progressively weaker notions of uniqueness and should not be conflated. In this hierarchy, the result of Rolnick and Kording (2020) is of the third type, being formulated on a restricted class of parameters s...

work page 2020

[7] [8]

, Hn; 40

exactly n neurons of ˜f have breakpoint hyperplanes intersecting relint(P ), namely H1, . . . , Hn; 40

work page

[8] [9]

the remaining k neurons have breakpoint hyperplanes disjoint from P . Proof. Group the neurons of f according to the visible hyperplane they induce in relint( P ). For each j ∈ [n], let Ij be the set of neurons whose breakpoint hyperplane in relint( P ) is Hj, and let K be the set of neurons whose nonzero locus does not appear as breakpoint of the final f...

work page

[9] [10]

˜f(x) = f(x) for all x ∈ P

work page

[10] [11]

exactly n neurons of ˜f have breakpoint hyperplanes intersecting relint( P ), namely H1, . . . , Hn

work page

[11] [12]

Write ˜f(x) = g(x) + h(x) for all x ∈ P, where g is the subnetwork consisting of the n visible neurons and h is the subnetwork consisting of the remaining N − n neurons

the remaining N − n neurons have breakpoint hyperplanes disjoint from P . Write ˜f(x) = g(x) + h(x) for all x ∈ P, where g is the subnetwork consisting of the n visible neurons and h is the subnetwork consisting of the remaining N − n neurons. Then h is affine linear on P , since all of its breakpoint hyperplanes are disjoint from P . Moreover, g(x) = fθ(...

work page

[12] [13]

ϕ: Vfθ → VA is a map from the set of candidate bent hyperplanes of fθ to the set of hidden neurons of A such that for every edge ( u, v) ∈ Efθ, if ϕ(u) = (i, ℓ) and ϕ(v) = (j, k), then ℓ < k

work page

[13] [14]

For such a configuration, we write φ: Bd−1 θ → VA with φ(σ) := ϕ(π(σ)), where π : Bd−1 θ → Vfθ maps a facet to its unique candidate bent hyperplane

s = {s(σ)}σ∈Bd−1 θ is an assignment of an activation pattern to each facet σ ∈ B d−1 θ . For such a configuration, we write φ: Bd−1 θ → VA with φ(σ) := ϕ(π(σ)), where π : Bd−1 θ → Vfθ maps a facet to its unique candidate bent hyperplane. Definition C.2. Let (ϕ, s) be a discrete fiber configuration for fθ with respect to A, and let φ(σ) = ϕ(π(σ)) as above....

work page

[14] [15]

the facet σ is contained in the bent hyperplane of the neuron φ(σ) in the realization η, and 42

work page

[15] [16]

Definition C.3 (Configuration variety)

the activation pattern induced by η on σ is equal to s(σ). Definition C.3 (Configuration variety). For an architecture A and a discrete fiber configuration ( ϕ, s) for fθ, let φ(σ) = ϕ(π(σ)) for all σ ∈ B d−1 θ . The configuration variety V(ϕ,s) ⊆ ΘA × RBd−1 θ × RBd−1 θ is the algebraic variety in the variables η = (W (ℓ), b(ℓ))ℓ∈[L+1] and (λσ, δσ)σ∈Bd−1 ...

work page

[16] [17]

Geometric Alignment: gσ(η) = δσλσaσ and tσ(η) = δσλσβσ

work page

[17] [18]

Tropical Weight Matching: λσvσ(η) = cθ(σ)

work page

[18] [19]

For fixed activation patternss(σ), all expressions gσ(η), tσ(η), and vσ(η) are polynomial in the parameters η

Sign Equation: δ2 σ = 1. For fixed activation patternss(σ), all expressions gσ(η), tσ(η), and vσ(η) are polynomial in the parameters η. Hence the above equations define an algebraic variety. We denote by πA(V(ϕ,s)) the projection onto Θ A and call this the configuration set. For a CPWL function f and an architecture A, let ˜S(f, A) = {η ∈ ˜ΘA | f = fη} be...

work page