arxiv: 2604.23705 · v1 · submitted 2026-04-26 · 💻 cs.LG

Recognition: unknown

Can an MLP Absorb Its Own Skip Connection?

Antonij Mijoski, Marko Karbevski

Pith reviewed 2026-05-08 06:34 UTC · model grok-4.3

classification 💻 cs.LG

keywords MLPskip connectionsresidual networksfunction classesReLUGELUactivation functionsexpressivity

0 comments

The pith

Skip connections around single-hidden-layer MLPs cannot be absorbed into residual-free networks of the same width for generic weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines when a skip connection can be folded into the weights of an otherwise identical-width MLP without changing the represented functions. It proves that for homogeneous activations of degree not equal to one, such as ReLU squared, and for gated activations like SwiGLU whose gate vanishes at zero, absorption is impossible either by a degree-counting argument or by linearization at the origin. For ungated ReLU and GELU, absorption succeeds only when the down and up weight matrices satisfy a precise index-set condition that equates a subproduct to the negative identity; this condition occurs with probability zero for continuous random weights. The same impossibility carries over to arbitrary-depth compositions using the homogeneous or gated activations, while the deep ReLU and GELU case is left open.

Core claim

For generic weight matrices, absorption holds at the single-block level if and only if there exists an index set S of size at least d such that W_down[:,S] W_up[S,:] = -I_d. This condition is non-generic (it fails with probability one under continuous weight distributions), so skip-connected and residual-free MLPs of the same width represent generically disjoint function classes. For homogeneous activations of degree k ≠ 1 and for gated activations with g(0)=0, absorption is unconditionally impossible, and these results extend to arbitrary depth.

What carries the argument

The index-set condition W_down[:,S] W_up[S,:] = -I_d whose existence decides single-block absorption for ReLU and GELU; for other activations, the homogeneity degree or the linearization of the gate at zero.

If this is right

A composition of L residual blocks using homogeneous activations of degree k ≠ 1 cannot be replicated by any composition of L residual-free blocks of the same width.
For gated activations whose gate is differentiable at the origin with g(0)=0, the same depth-L impossibility holds by linearization.
At the single-block level for ReLU or GELU, absorption occurs only on the measure-zero set of weights obeying the index-set condition.
The question of whether disjointness persists for deep stacks of ReLU or GELU blocks is left unresolved by the paper.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the single-block disjointness for generic ReLU and GELU extends to depth, then residual connections supply an irreducible form of expressivity that cannot be recovered merely by increasing width.
Architectural searches that replace explicit skips with learned internal transformations may therefore lose coverage of certain function families.
The open deep-composition question could be probed by constructing explicit low-dimensional counterexamples that match a skip block but cannot be matched by any skip-free stack of equal width.

Load-bearing premise

The weight matrices are generic, drawn from a continuous distribution so that the specific index-set equality fails with probability one.

What would settle it

Sample random continuous weight matrices for a skip-connected MLP and check whether any index set S of size at least d satisfies the exact matrix equality W_down[:,S] W_up[S,:] = -I_d; if no such S exists for almost all draws, the classes are disjoint.

read the original abstract

We study when a skip connection around a single-hidden-layer MLP can be absorbed into a residual-free MLP of the same width. We first show that for any architecture whose skip branch is an invertible linear map (including Hyper-Connections and their manifold-constrained variants), the problem reduces to the identity skip case. For homogeneous activations of degree $k \neq 1$, such as ReLU$^2$ and ReGLU, absorption is unconditionally impossible by a degree argument. For gated activations whose gate is differentiable at the origin with $g(0) = 0$, including SwiGLU and GeGLU, a linearization argument gives the same conclusion. These impossibility results extend to arbitrary depth: a composition of $L$ residual blocks using such activations cannot be replicated by any composition of $L$ residual-free blocks of the same width. For ungated ReLU and GELU, the situation is richer. For generic weight matrices, absorption holds at the single-block level if and only if there exists an index set $S$ of size at least $d$ such that $W_{\mathrm{down}}[:,S]\,W_{\mathrm{up}}[S,:] = -I_d$. This condition is non-generic (it fails with probability one under continuous weight distributions), so skip-connected and residual-free MLPs of the same width represent generically disjoint function classes. Whether this disjointness persists for deep compositions of ReLU or GELU blocks remains open.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Skip connections around single-block MLPs cannot be absorbed into plain MLPs of the same width except in non-generic cases for ReLU and GELU.

read the letter

The main thing to know is that this paper shows skip-connected and residual-free MLPs of equal width are generically disjoint function classes for the activations it considers. The proofs separate cleanly by activation type and extend the negative results to arbitrary depth in the impossible cases. For homogeneous activations of degree not equal to 1, a degree-counting argument rules out absorption outright. For gated activations with gate value zero at the origin, linearization at zero gives the same unconditional impossibility. Both carry over to L-layer compositions without extra work. For plain ReLU and GELU the paper derives the exact algebraic condition on the weight matrices that allows absorption at one layer, then notes that this condition fails with probability one under continuous weight distributions. That gives the generic-disjointness statement and is the clearest new piece. The derivations use only standard properties of the activations plus linear algebra, and the generic qualifier is stated explicitly rather than assumed without comment. The open question for deep ReLU or GELU stacks is flagged honestly. The main limitation is that the generic-weight assumption may not translate directly to trained networks, where weights are optimized rather than random; it is possible the special index-set condition appears more often in practice than the measure-zero claim suggests. The deep case remaining open also caps how far the result applies to modern residual stacks right now. This is useful for anyone working on expressivity gaps between residual and feedforward architectures. A reader who wants precise statements about what width and activation choices can or cannot represent will get something concrete from it. The work is tight enough on its own terms to deserve a serious referee, even with the open question left on the table.

Referee Report

0 major / 3 minor

Summary. The manuscript examines when a skip connection around a single-hidden-layer MLP can be absorbed into a residual-free MLP of the same width. For homogeneous activations of degree k ≠ 1 (e.g., ReLU²) and gated activations with g(0)=0 (e.g., SwiGLU), it proves unconditional impossibility via degree and linearization arguments, with extensions to arbitrary depth. For ReLU and GELU, it gives an if-and-only-if algebraic condition: absorption holds precisely when there exists an index set S of size at least d such that W_down[:,S] W_up[S,:] = -I_d. This condition fails with probability one for generic (continuous-distribution) weights, implying that skip-connected and residual-free MLPs of the same width represent generically disjoint function classes. The deep-composition case for ReLU/GELU is left explicitly open.

Significance. If the results hold, the work supplies a clean algebraic and analytic separation between residual and non-residual function classes, which bears directly on representational capacity questions in deep learning theory. The probability-one generic-disjointness statement, the unconditional impossibility results for other activations, and the explicit open-problem flag for deep ReLU/GELU stacks are all strengths; the derivations rest on standard linear algebra and activation properties rather than fitted quantities.

minor comments (3)

The abstract and introduction should define the notation W_down and W_up at first use rather than assuming familiarity with the single-block diagram.
In the ReLU/GELU section, the statement that the index-set condition 'fails with probability one' would benefit from a one-sentence reminder of the measure-theoretic setting (continuous distributions on the weight space) to avoid any ambiguity for readers outside linear algebra.
The extension of the impossibility results to arbitrary depth is stated clearly, but a brief remark on whether the same width constraint is preserved across all L blocks would remove a minor source of potential confusion.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful and positive evaluation of the manuscript, including the clear summary of our results and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's central results follow from direct algebraic manipulations and analytic arguments on the MLP equations. The reduction of general invertible skips to the identity case uses only matrix invertibility. Impossibility for homogeneous activations of degree k≠1 is obtained by comparing polynomial degrees on both sides. The linearization argument for gated activations with g(0)=0 likewise compares the first-order behavior at the origin. For ReLU/GELU the iff characterization is obtained by solving the functional equation W_down σ(W_up x) + x = W'_down σ(W'_up x) and isolating the necessary index-set condition on the weights; this condition is then shown to have measure zero under continuous distributions by standard facts of linear algebra over R. No step equates a derived quantity to a fitted input, renames a known result, or relies on a self-citation whose content is itself unverified. The open question for deep ReLU/GELU stacks is stated explicitly rather than assumed away.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The central claims rest only on standard mathematical properties of the listed activation functions and linear algebra; no free parameters are fitted and no new entities are postulated.

axioms (3)

domain assumption Activation is homogeneous of degree k ≠ 1
Invoked for the degree argument that rules out absorption for ReLU² and ReGLU.
domain assumption Gate function is differentiable at the origin with g(0) = 0
Used in the linearization argument for SwiGLU and GeGLU.
domain assumption Weight matrices are drawn from a continuous distribution
Underlies the claim that the absorption condition holds with probability one.

pith-pipeline@v0.9.0 · 5563 in / 1474 out tokens · 66165 ms · 2026-05-08T06:34:37.358182+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 8 internal anchors

[1]

DeepSeek-V3 Technical Report

DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review arXiv
[2]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review arXiv
[3]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team. Gemma: Open models based on Gemini research and technology.arXiv preprint arXiv:2403.08295,

work page internal anchor Pith review arXiv
[4]

KV-weights are all you need for skipless transformers.arXiv preprint arXiv:2404.12362,

Nils Graef. KV-weights are all you need for skipless transformers.arXiv preprint arXiv:2404.12362,

work page arXiv
[5]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lu- cile Saulnier, L´ elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ ee Lacroix, and William El Sayed. Mistral 7B. arXiv preprint a...

work page internal anchor Pith review arXiv
[6]

Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers

Marko Karbevski and Antonij Mijoski. Key and value weights are probably all you need: On the necessity of the query, key, value weight triplet in transformers.arXiv preprint arXiv:2510.23912,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning.arXiv preprint arXiv:2512.20848, 2025

NVIDIA. Nemotron 3 Nano: Open, efficient mixture-of-experts hybrid Mamba-transformer model for agentic reasoning.arXiv preprint arXiv:2512.20848,

work page arXiv
[8]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review arXiv
[9]

GLU Variants Improve Transformer

Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review arXiv 2002
[10]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review arXiv
[11]

arXiv preprint arXiv:2512.24880 , year=

15 Mijoski and Karbevski Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Liang Zhao, Shangyan Zhou, Zhean Xu, Zhengyan Zhang, Wangding Zeng, Shengding Hu, Yuqing Wang, Jingyang Yuan, Lean Wang, and Wenfeng Liang. mHC: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880,

work page arXiv
[12]

arXiv preprint arXiv:2409.19606 , year=

Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper-connections.arXiv preprint arXiv:2409.19606,

work page arXiv