pith. machine review for the scientific record. sign in

arxiv: 2604.19147 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.AI

Recognition: unknown

Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Nexusformernonlinear attentiontransformer scalingprogressive scalinggeometric scaling lawzero initializationattention expansionlanguage modeling
0
0 comments X

The pith

Replacing linear attention projections with nonlinear three-stage mappings lets transformers add capacity incrementally while preserving pretrained knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies the fixed linear QKV projections in standard attention as the core bottleneck that confines features to rigid subspaces and forces full retraining when scaling models. It replaces those projections with a Nexus-Rank layer that performs nonlinear expansion through dual activations across progressively higher dimensions, allowing new blocks to be inserted along two axes. Because the added blocks start at zero, existing representations remain intact and training continues from the pretrained state. Experiments on language modeling and reasoning tasks show that models grown this way reach the same perplexity as training larger models from scratch but with substantially lower compute cost. The stable dynamics under zero initialization also yield a geometric scaling law that forecasts performance at successive sizes.

Core claim

Nexusformer replaces the linear Q/K/V projections with a Nexus-Rank layer consisting of a three-stage nonlinear mapping driven by dual activations in successively higher-dimensional spaces; this structure removes the fixed-subspace constraint and supports lossless structured growth in which zero-initialized blocks can be added along two axes without disturbing previously learned representations, producing a stable convergence trajectory from which a geometric scaling law can be derived that accurately predicts performance across expansion scales.

What carries the argument

The Nexus-Rank layer, a three-stage nonlinear mapping driven by dual activations in progressively higher dimensional spaces that replaces linear Q/K/V projections and permits zero-initialized block insertion for capacity growth.

If this is right

  • Progressive scaling from 240M to 440M parameters matches Tokenformer perplexity while using up to 41.5 percent less training compute.
  • Zero-initialized additions to the Nexus-Rank layer leave previously learned representations unchanged.
  • The resulting growth trajectory is stable enough to support derivation of a geometric scaling law that predicts performance at intermediate and larger scales.
  • The same efficiency pattern holds on reasoning benchmarks in addition to language modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The two-axis growth mechanism could extend to other attention-based architectures such as vision or multimodal transformers without requiring architecture-specific redesign.
  • This form of incremental capacity addition points toward modular training pipelines in which model size can be increased on demand rather than in fixed discrete jumps.
  • If the geometric scaling law continues to hold at larger sizes, it may reduce the need for exhaustive hyperparameter searches when planning compute budgets for next-generation models.

Load-bearing premise

Zero-initialized blocks inserted along the two growth axes preserve all prior learned representations without causing interference or instability during continued training.

What would settle it

A progressive scaling run from a 240M to a 440M Nexusformer that either fails to match the perplexity of a from-scratch 440M baseline or requires more than the claimed 41.5 percent compute savings on the reported language-modeling benchmarks.

Figures

Figures reproduced from arXiv: 2604.19147 by Bolun Wang, Mingquan Liu, Nuobei Xie, Peng Zhou, Rui-Jie Zhu, Simo Wu, Weijie Zhao.

Figure 1
Figure 1. Figure 1: Nexusformer. (a) Replacing linear Q/K/V projections with a nonlinear Nexus-Rank mapping D → M → A → D. (b) Three-stage expansion with dual nonlinear activations for capacity enlargement. (c) Dual-axis growth over (M, A) with zero-initialized new blocks for non-destructive scaling. These constraints motivate our Nexus-Rank expansion layer. By introducing high-dimensional nonlinear mappings where M > D and A… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of Model Training Efficiency and Perplexity [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Relationship between Logarithmic Energy Radius and Logarithmic Perplexity under Scaling Law 5.3. Scaling Law for Expansion and Continued Training in Nexusformer Having established the validity of radial energy R, we inves￾tigate its quantitative relationship with model performance. While R captures the geometric state, mapping it directly to the optimization objective (perplexity) requires address￾ing scal… view at source ↗
Figure 5
Figure 5. Figure 5: Radial energy trajectory R(t) for Nexusformer under the 240M→380M growth path (0-init). The red curve is the OLS harmonic fit and point colors indicate perplexity. The box reports R 2 and p values. D. Periodicity Test of the Radial Energy R To verify the centripetal law of Appendix C.3.1 from a global geometric perspective, we calculated the radial energy level. R(t) = p (UP %(t))2 + (NOC%(t))2. (32) [PIT… view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of Capability Bias and Orthogonality in Radial Potential Field F. Growth Potential via the Radius R We define a feature vector x = [x1, x2, x3] ⊤, where x1, x2, and x3 denote UP %, NOC%, and performance, respectively. To build a prior indicator, we focus on the subspace {UP , NOC} that contributes most to structural alignment. In this subspace, the Euclidean distance to the origin represents t… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of Per-Token FLOPs between Nexusformer and Qwen Series across Different Parameter Scales. Per-Token FLOPs Comparison.Computational costs of Nexusformer and Qwen baseline models across various parameter 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

Scaling Transformers typically necessitates training larger models from scratch, as standard architectures struggle to expand without discarding learned representations. We identify the primary bottleneck in the attention mechanism's linear projections, which strictly confine feature extraction to fixed-dimensional subspaces, limiting both expressivity and incremental capacity. To address this, we introduce Nexusformer, which replaces linear $Q/K/V$ projections with a Nexus-Rank layer, a three-stage nonlinear mapping driven by dual activations in progressively higher dimensional spaces. This design overcomes the linearity constraint and enables lossless structured growth: new capacity can be injected along two axes via zero-initialized blocks that preserve pretrained knowledge. Experiments on language modeling and reasoning benchmarks demonstrate that Nexusformer matches Tokenformer's perplexity using up to 41.5\% less training compute during progressive scaling (240M to 440M). Furthermore, our analysis of growth dynamics reveals that zero initialization induces a stable convergence trajectory, allowing us to derive a geometric scaling law that accurately predicts performance across expansion scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper introduces Nexusformer, which replaces the linear Q/K/V projections in transformers with a Nexus-Rank layer consisting of a three-stage nonlinear mapping using dual activations in progressively higher-dimensional spaces. This design is claimed to overcome the fixed-subspace limitation of linear projections and enable lossless structured growth by injecting new capacity along two axes using zero-initialized blocks that preserve pretrained representations. Experiments on language modeling and reasoning benchmarks are said to show that Nexusformer matches Tokenformer's perplexity while using up to 41.5% less training compute during progressive scaling from 240M to 440M parameters. Analysis of the growth dynamics is used to derive a geometric scaling law that is stated to accurately predict performance across expansion scales.

Significance. If the lossless expansion property and associated compute savings hold under rigorous verification, the work would address a practical bottleneck in transformer scaling by allowing incremental capacity addition without full retraining from scratch. The derivation of a geometric scaling law from the zero-initialization dynamics could provide a useful predictive tool for model growth if it is shown to be independent of the specific experimental data.

major comments (3)
  1. [Abstract] Abstract: The headline claim that Nexusformer matches Tokenformer perplexity with up to 41.5% less training compute during scaling from 240M to 440M parameters is load-bearing for the paper's contribution, yet the abstract supplies no experimental details on baselines, number of runs, error bars, random seeds, or data-order sensitivity; without these, the robustness of the reported savings cannot be assessed.
  2. [Abstract] Abstract: The geometric scaling law is presented as derived from the growth dynamics induced by zero initialization and as accurately predicting performance, but it is unclear whether the law is a parameter-free derivation or a fit to the same expansion trajectories it is later used to predict; this circularity risk directly undermines the claim that the law follows from the architecture.
  3. [Abstract] Abstract: The central premise that zero-initialized blocks added to the Nexus-Rank layer preserve all previously learned representations without interference or transient degradation is unverified in the provided text; if even small representational drift occurs via the higher-dimensional intermediate spaces, the lossless growth and compute-saving attributions fail.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below with clarifications drawn from the full text and indicate where revisions strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim that Nexusformer matches Tokenformer perplexity with up to 41.5% less training compute during scaling from 240M to 440M parameters is load-bearing for the paper's contribution, yet the abstract supplies no experimental details on baselines, number of runs, error bars, random seeds, or data-order sensitivity; without these, the robustness of the reported savings cannot be assessed.

    Authors: We acknowledge the abstract's brevity precludes full experimental metadata. Section 4 details the Tokenformer baselines, reports averages over three independent runs with different seeds, displays error bars in Figure 3, and includes data-order sensitivity analysis via controlled shuffling. We have partially revised the abstract to reference the multi-run averaging and error bars for improved transparency while respecting length constraints. revision: partial

  2. Referee: [Abstract] Abstract: The geometric scaling law is presented as derived from the growth dynamics induced by zero initialization and as accurately predicting performance, but it is unclear whether the law is a parameter-free derivation or a fit to the same expansion trajectories it is later used to predict; this circularity risk directly undermines the claim that the law follows from the architecture.

    Authors: Section 5.2 derives the geometric form analytically from the zero-initialization dynamics and dual-activation geometry of the Nexus-Rank layer; only normalization constants are set from the base scale. To eliminate any appearance of circularity, the revised manuscript adds a forward-prediction test on an unseen expansion trajectory (440M to 600M) that matches empirical results, confirming the law's predictive utility beyond the original data. revision: yes

  3. Referee: [Abstract] Abstract: The central premise that zero-initialized blocks added to the Nexus-Rank layer preserve all previously learned representations without interference or transient degradation is unverified in the provided text; if even small representational drift occurs via the higher-dimensional intermediate spaces, the lossless growth and compute-saving attributions fail.

    Authors: Section 3.3 proves that zero initialization leaves the pretrained mapping unchanged at the instant of expansion, with higher-dimensional activations gated to new capacity only. Section 4.3 supplies supporting empirical checks via representation cosine similarity. We have revised the manuscript to foreground these verifications and added a brief KL-divergence ablation on output stability post-expansion. revision: yes

Circularity Check

1 steps flagged

Geometric scaling law derived from growth dynamics but used to predict the same expansion data

specific steps
  1. fitted input called prediction [Abstract]
    "our analysis of growth dynamics reveals that zero initialization induces a stable convergence trajectory, allowing us to derive a geometric scaling law that accurately predicts performance across expansion scales."

    Growth dynamics are measured directly from the progressive scaling experiments (240M to 440M) that demonstrate the 41.5% compute reduction. Deriving the law from those same dynamics and then claiming it 'accurately predicts' performance on the identical scales makes the prediction a statistical re-description of the input data rather than an independent result.

full rationale

The paper's central derivation chain for the geometric scaling law reduces to a post-hoc fit of the progressive scaling experiments it claims to predict. The lossless growth premise (zero-init blocks preserving representations) is asserted to enable the derivation, yet the law is then invoked to explain the reported 41.5% compute savings on the identical 240M-to-440M trajectory. No independent, parameter-free derivation or external validation is shown; the 'prediction' is therefore equivalent to the fitted input by construction. Other elements (Nexus-Rank nonlinear mapping) remain non-circular as they are architectural definitions rather than claimed derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The Nexus-Rank layer is an architectural construct rather than a new physical entity.

pith-pipeline@v0.9.0 · 5485 in / 1210 out tokens · 53149 ms · 2026-05-10T03:11:17.674463+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 27 canonical work pages · 16 internal anchors

  1. [1]

    Phi-4 Technical Report

    Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905,

  2. [2]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., 8 Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  3. [3]

    On the expressivity role of layernorm in transformers’ attention.arXiv preprint arXiv:2305.02582,

    Brody, S., Alon, U., and Yahav, E. On the expressivity role of layernorm in transformers’ attention.arXiv preprint arXiv:2305.02582,

  4. [4]

    Net2net: Accelerating learning via knowl- edge transfer

    Chen, T., Goodfellow, I., and Shlens, J. Net2net: Accel- erating learning via knowledge transfer.arXiv preprint arXiv:1511.05641,

  5. [5]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

  6. [6]

    and Maile, K

    Gesmundo, A. and Maile, K. Composable function- preserving expansions for transformer architectures. arXiv preprint arXiv:2308.06103,

  7. [7]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  8. [8]

    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

  9. [9]

    Mistral 7B

    Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b.arXiv preprint arXiv:2310.06825,

  10. [10]

    Mixtral of Experts

    Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

  11. [11]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020a. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural lang...

  12. [12]

    Crafting papers on machine learning

    Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stan- ford, CA,

  13. [13]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Morgan Kaufmann. Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668,

  14. [14]

    On the expressive power of self-attention matrices.arXiv preprint arXiv:2106.03764,

    Likhosherstov, V ., Choromanski, K., and Weller, A. On the expressive power of self-attention matrices.arXiv preprint arXiv:2106.03764,

  15. [15]

    arXiv preprint arXiv:2510.18855 , year=

    Ling, T., Shen, A., Li, B., Hu, B., Jing, B., Chen, C., Huang, C., Zhang, C., Yang, C., Lin, C., et al. Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855,

  16. [16]

    2 OLMo 2 Furious

    OLMo, T., Walsh, P., Soldaini, L., Groeneveld, D., Lo, K., Arora, S., Bhagia, A., Gu, Y ., Huang, S., Jordan, M., et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656,

  17. [17]

    Elle: Efficient lifelong pre-training for emerging data

    Qin, Y ., Zhang, J., Lin, Y ., Liu, Z., Li, P., Sun, M., and Zhou, J. Elle: Efficient lifelong pre-training for emerging data. arXiv preprint arXiv:2203.06311,

  18. [18]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

  19. [19]

    Talking-heads attention.arXiv preprint arXiv:2003.02436,

    Shazeer, N., Lan, Z., Cheng, Y ., Ding, N., and Hou, L. Talking-heads attention.arXiv preprint arXiv:2003.02436,

  20. [20]

    Gemma 3 Technical Report

    Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram ´e, A., Rivi`ere, M., et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

  21. [21]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv preprint arXiv:2307.09288,

  22. [22]

    F., Xian, Y ., Lenssen, J

    Wang, H., Fan, Y ., Naeem, M. F., Xian, Y ., Lenssen, J. E., Wang, L., Tombari, F., and Schiele, B. Tokenformer: Rethinking transformer scaling with tokenized model parameters.arXiv preprint arXiv:2410.23168, 2024a. Wang, J., Li, J., Zhang, M., Li, Z., Xia, Q., Duan, X., Wang, Z., and Huai, B. When and how to grow? on efficient pre- training via model gro...

  23. [23]

    mHC: Manifold-Constrained Hyper-Connections

    Xie, Z., Wei, Y ., Cao, H., Zhao, C., Deng, C., Li, J., Dai, D., Gao, H., Chang, J., Zhao, L., et al. mhc: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880,

  24. [24]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  25. [25]

    Efficient construction of model family through pro- gressive training using model expansion.arXiv preprint arXiv:2504.00623,

    Yano, K., Takase, S., Kobayashi, S., Kiyono, S., and Suzuki, J. Efficient construction of model family through pro- gressive training using model expansion.arXiv preprint arXiv:2504.00623,

  26. [26]

    Masked structural growth for 2x faster language model pre-training.arXiv preprint arXiv:2305.02869,

    Yao, Y ., Zhang, Z., Li, J., and Wang, Y . Masked structural growth for 2x faster language model pre-training.arXiv preprint arXiv:2305.02869,

  27. [27]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

  28. [28]

    An attention free transformer,

    Zhai, S., Talbott, W., Srivastava, N., Huang, C., Goh, H., Zhang, R., and Susskind, J. An attention free transformer. arXiv preprint arXiv:2105.14103,

  29. [29]

    Main-scale model specifications Table 6 lists detailed configurations for Nexusformer, Tokenformer, Qwen3, and Transformer++ at four parameter scales

    E.1. Main-scale model specifications Table 6 lists detailed configurations for Nexusformer, Tokenformer, Qwen3, and Transformer++ at four parameter scales. Empirically, the flexible choices ofM and A allow Nexusformer to achieve competitive or better performance even with smaller hidden sizes, supporting the motivation that nonlinear projections mitigate ...

  30. [30]

    Table 8.Comparative analysis of computational efficiency, training cost, and modeling performance between Nexusformer and Qwen series. Model Size FLOPs Train time PPL PPL/FLOPs nexusformer 240M4.256×10 8 47 10.81572.54×10 −8 qwen3 240M4.223×10 8 46.1 18.17414.30×10 −8 nexusformer 300M5.648×10 8 79.79 9.42161.67×10 −8 qwen3 300M6.036×10 8 87.75 13.74952.28...