Recognition: unknown
Can an MLP Absorb Its Own Skip Connection?
Pith reviewed 2026-05-08 06:34 UTC · model grok-4.3
The pith
Skip connections around single-hidden-layer MLPs cannot be absorbed into residual-free networks of the same width for generic weights.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For generic weight matrices, absorption holds at the single-block level if and only if there exists an index set S of size at least d such that W_down[:,S] W_up[S,:] = -I_d. This condition is non-generic (it fails with probability one under continuous weight distributions), so skip-connected and residual-free MLPs of the same width represent generically disjoint function classes. For homogeneous activations of degree k ≠ 1 and for gated activations with g(0)=0, absorption is unconditionally impossible, and these results extend to arbitrary depth.
What carries the argument
The index-set condition W_down[:,S] W_up[S,:] = -I_d whose existence decides single-block absorption for ReLU and GELU; for other activations, the homogeneity degree or the linearization of the gate at zero.
If this is right
- A composition of L residual blocks using homogeneous activations of degree k ≠ 1 cannot be replicated by any composition of L residual-free blocks of the same width.
- For gated activations whose gate is differentiable at the origin with g(0)=0, the same depth-L impossibility holds by linearization.
- At the single-block level for ReLU or GELU, absorption occurs only on the measure-zero set of weights obeying the index-set condition.
- The question of whether disjointness persists for deep stacks of ReLU or GELU blocks is left unresolved by the paper.
Where Pith is reading between the lines
- If the single-block disjointness for generic ReLU and GELU extends to depth, then residual connections supply an irreducible form of expressivity that cannot be recovered merely by increasing width.
- Architectural searches that replace explicit skips with learned internal transformations may therefore lose coverage of certain function families.
- The open deep-composition question could be probed by constructing explicit low-dimensional counterexamples that match a skip block but cannot be matched by any skip-free stack of equal width.
Load-bearing premise
The weight matrices are generic, drawn from a continuous distribution so that the specific index-set equality fails with probability one.
What would settle it
Sample random continuous weight matrices for a skip-connected MLP and check whether any index set S of size at least d satisfies the exact matrix equality W_down[:,S] W_up[S,:] = -I_d; if no such S exists for almost all draws, the classes are disjoint.
read the original abstract
We study when a skip connection around a single-hidden-layer MLP can be absorbed into a residual-free MLP of the same width. We first show that for any architecture whose skip branch is an invertible linear map (including Hyper-Connections and their manifold-constrained variants), the problem reduces to the identity skip case. For homogeneous activations of degree $k \neq 1$, such as ReLU$^2$ and ReGLU, absorption is unconditionally impossible by a degree argument. For gated activations whose gate is differentiable at the origin with $g(0) = 0$, including SwiGLU and GeGLU, a linearization argument gives the same conclusion. These impossibility results extend to arbitrary depth: a composition of $L$ residual blocks using such activations cannot be replicated by any composition of $L$ residual-free blocks of the same width. For ungated ReLU and GELU, the situation is richer. For generic weight matrices, absorption holds at the single-block level if and only if there exists an index set $S$ of size at least $d$ such that $W_{\mathrm{down}}[:,S]\,W_{\mathrm{up}}[S,:] = -I_d$. This condition is non-generic (it fails with probability one under continuous weight distributions), so skip-connected and residual-free MLPs of the same width represent generically disjoint function classes. Whether this disjointness persists for deep compositions of ReLU or GELU blocks remains open.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines when a skip connection around a single-hidden-layer MLP can be absorbed into a residual-free MLP of the same width. For homogeneous activations of degree k ≠ 1 (e.g., ReLU²) and gated activations with g(0)=0 (e.g., SwiGLU), it proves unconditional impossibility via degree and linearization arguments, with extensions to arbitrary depth. For ReLU and GELU, it gives an if-and-only-if algebraic condition: absorption holds precisely when there exists an index set S of size at least d such that W_down[:,S] W_up[S,:] = -I_d. This condition fails with probability one for generic (continuous-distribution) weights, implying that skip-connected and residual-free MLPs of the same width represent generically disjoint function classes. The deep-composition case for ReLU/GELU is left explicitly open.
Significance. If the results hold, the work supplies a clean algebraic and analytic separation between residual and non-residual function classes, which bears directly on representational capacity questions in deep learning theory. The probability-one generic-disjointness statement, the unconditional impossibility results for other activations, and the explicit open-problem flag for deep ReLU/GELU stacks are all strengths; the derivations rest on standard linear algebra and activation properties rather than fitted quantities.
minor comments (3)
- The abstract and introduction should define the notation W_down and W_up at first use rather than assuming familiarity with the single-block diagram.
- In the ReLU/GELU section, the statement that the index-set condition 'fails with probability one' would benefit from a one-sentence reminder of the measure-theoretic setting (continuous distributions on the weight space) to avoid any ambiguity for readers outside linear algebra.
- The extension of the impossibility results to arbitrary depth is stated clearly, but a brief remark on whether the same width constraint is preserved across all L blocks would remove a minor source of potential confusion.
Simulated Author's Rebuttal
We thank the referee for the careful and positive evaluation of the manuscript, including the clear summary of our results and the recommendation for minor revision. No specific major comments were raised in the report.
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper's central results follow from direct algebraic manipulations and analytic arguments on the MLP equations. The reduction of general invertible skips to the identity case uses only matrix invertibility. Impossibility for homogeneous activations of degree k≠1 is obtained by comparing polynomial degrees on both sides. The linearization argument for gated activations with g(0)=0 likewise compares the first-order behavior at the origin. For ReLU/GELU the iff characterization is obtained by solving the functional equation W_down σ(W_up x) + x = W'_down σ(W'_up x) and isolating the necessary index-set condition on the weights; this condition is then shown to have measure zero under continuous distributions by standard facts of linear algebra over R. No step equates a derived quantity to a fitted input, renames a known result, or relies on a self-citation whose content is itself unverified. The open question for deep ReLU/GELU stacks is stated explicitly rather than assumed away.
Axiom & Free-Parameter Ledger
axioms (3)
- domain assumption Activation is homogeneous of degree k ≠ 1
- domain assumption Gate function is differentiable at the origin with g(0) = 0
- domain assumption Weight matrices are drawn from a continuous distribution
Reference graph
Works this paper leans on
-
[1]
DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437,
work page internal anchor Pith review arXiv
-
[2]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review arXiv
-
[3]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team. Gemma: Open models based on Gemini research and technology.arXiv preprint arXiv:2403.08295,
work page internal anchor Pith review arXiv
-
[4]
KV-weights are all you need for skipless transformers.arXiv preprint arXiv:2404.12362,
Nils Graef. KV-weights are all you need for skipless transformers.arXiv preprint arXiv:2404.12362,
-
[5]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lu- cile Saulnier, L´ elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ ee Lacroix, and William El Sayed. Mistral 7B. arXiv preprint a...
work page internal anchor Pith review arXiv
-
[6]
Marko Karbevski and Antonij Mijoski. Key and value weights are probably all you need: On the necessity of the query, key, value weight triplet in transformers.arXiv preprint arXiv:2510.23912,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
NVIDIA. Nemotron 3 Nano: Open, efficient mixture-of-experts hybrid Mamba-transformer model for agentic reasoning.arXiv preprint arXiv:2512.20848,
-
[8]
Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,
work page internal anchor Pith review arXiv
-
[9]
GLU Variants Improve Transformer
Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202,
work page internal anchor Pith review arXiv 2002
-
[10]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review arXiv
-
[11]
arXiv preprint arXiv:2512.24880 , year=
15 Mijoski and Karbevski Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Liang Zhao, Shangyan Zhou, Zhean Xu, Zhengyan Zhang, Wangding Zeng, Shengding Hu, Yuqing Wang, Jingyang Yuan, Lean Wang, and Wenfeng Liang. mHC: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880,
-
[12]
arXiv preprint arXiv:2409.19606 , year=
Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper-connections.arXiv preprint arXiv:2409.19606,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.