SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

Dongchen Han; Erchao Zhao; Gao Huang; Guanjun Jiang; Haofeng Huang; Mengyu Zhou; Ming Chen; Tianyu Li; Xiaoxi Jiang; Zixuan Cao

arxiv: 2602.08064 · v2 · pith:ZGLDJR3Nnew · submitted 2026-02-08 · 💻 cs.LG · cs.AI· cs.CL

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

Tianyu Li , Dongchen Han , Zixuan Cao , Haofeng Huang , Mengyu Zhou , Ming Chen , Erchao Zhao , Xiaoxi Jiang

show 2 more authors

Guanjun Jiang Gao Huang

This is my paper

Pith reviewed 2026-05-22 11:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords SiameseNormPre-NormPost-NormTransformernormalizationtraining stabilityresidual blocksarchitecture design

0 comments

The pith

SiameseNorm uses a two-stream design with shared residual blocks to combine Pre-Norm stability and Post-Norm capacity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers face a persistent trade-off where Pre-Norm ensures stable training through identity gradient paths but restricts how much the residual can be transformed, while Post-Norm allows stronger transformations at the risk of unstable gradients. Single-stream attempts to blend them have not held up across different training conditions. SiameseNorm introduces a two-stream architecture in which a Pre-Norm-like stream and a Post-Norm-like stream share the same residual blocks, so each block gets training signals from both styles at the same time. This design adds almost no extra cost and works with existing Pre-Norm training methods. Tests on language models, mixture-of-experts systems, vision transformers, and diffusion models show better results without losing stability, suggesting the approach could help build more capable models more reliably.

Core claim

The long-standing tension between Pre- and Post-Norm reflects a fundamental trade-off between training stability and representational capacity. Single-stream architectures struggle to reconcile Pre-Norm's stable identity-gradient propagation with Post-Norm's normalization of the main residual path. SiameseNorm addresses this by proposing a two-stream architecture that couples Pre-Norm-like and Post-Norm-like streams through shared residual blocks, allowing each residual block to receive optimization signals from both pathways with negligible overhead. Extensive experiments on 400M and 1.3B dense language models, 15B MoE models, Vision Transformers, and Diffusion Transformers show that Siames

What carries the argument

SiameseNorm's two-stream architecture with shared residual blocks that supplies optimization signals from both Pre-Norm-like and Post-Norm-like pathways.

Load-bearing premise

That a two-stream design with shared residual blocks can deliver optimization signals from both Pre-Norm-like and Post-Norm-like pathways without introducing new instabilities or conflicts that offset the reported gains.

What would settle it

A training run on a 1.3B language model using SiameseNorm that shows no performance gain or reduced stability compared to standard Pre-Norm would disprove the claim.

read the original abstract

The long-standing tension between Pre- and Post-Norm remains an open problem in Transformer architecture, reflecting a fundamental trade-off between training stability and representational capacity. Prior attempts to combine their strengths have made progress, but often show limited robustness across training settings, restricting their broader applicability. We revisit this dilemma, showing that single-stream architectures struggle to reconcile Pre-Norm's stable identity-gradient propagation with Post-Norm's normalization of the main residual path. To address this structural tension, we propose SiameseNorm, a simple yet effective two-stream architecture that remains compatible with Pre-Norm training recipes. SiameseNorm couples Pre-Norm-like and Post-Norm-like streams through shared residual blocks, allowing each residual block to receive optimization signals from both pathways with negligible overhead. Extensive experiments on 400M and 1.3B dense language models, 15B MoE models, Vision Transformers, and Diffusion Transformers show that SiameseNorm consistently improves performance while maintaining strong training stability across architectures and modalities. Code is available at https://github.com/Qwen-Applications/SiameseNorm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SiameseNorm's two-stream shared-residual design tries to split the pre/post-norm difference but leaves open whether the streams actually reinforce or just average each other out.

read the letter

SiameseNorm runs two parallel streams through the same residual blocks, one behaving like pre-norm for stable identity gradients and the other like post-norm for normalized main-path updates. The claim is that this coupling gives both stability and capacity with almost no extra cost and works inside existing training setups. The experiments span 400M and 1.3B dense language models, 15B MoE models, Vision Transformers, and Diffusion Transformers, which is a reasonable spread for an architecture note. That breadth is the clearest strength; it shows the pattern is not limited to one domain or scale. The code release also helps anyone who wants to check the implementation directly. The main soft spot is exactly the one the stress-test note flags. Because the blocks are shared, the pre-norm path tends to deliver large, un-attenuated gradients while the post-norm path normalizes and can shrink them. Nothing in the abstract or the reported results demonstrates that these two signals add constructively rather than partially cancel. Without gradient-flow measurements, layer-wise ablation, or a controlled comparison that holds optimizer and schedule fixed, the “consistent improvements and strong stability” could still be tied to the particular training recipe rather than the architecture itself. The paper is aimed at people who train large transformers and are already tweaking residuals. A reader who needs a drop-in pattern to try on their own stack will get something concrete to test. A reader looking for a closed-form reconciliation or formal proof of gradient compatibility will not find it here. I would send it to peer review. The empirical scope is wide enough that referees can usefully check the ablations and training details, even if the central mechanism still needs tighter validation.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SiameseNorm, a two-stream Transformer architecture that couples Pre-Norm-like and Post-Norm-like streams via shared residual blocks. This design is intended to reconcile the stability of Pre-Norm's identity-gradient path with the representational benefits of Post-Norm's normalized main path. The authors report that the approach maintains compatibility with standard Pre-Norm training recipes and delivers consistent performance gains with strong stability on dense language models (400M and 1.3B), 15B MoE models, Vision Transformers, and Diffusion Transformers.

Significance. If the central claim holds, SiameseNorm would provide a practical, low-overhead architectural fix for a persistent tension in Transformer design, with potential impact on large-scale training across modalities. Strengths include the scale of experiments (up to 15B parameters), coverage of multiple architectures and modalities, and public code release. These elements support practical significance beyond incremental empirical tuning.

major comments (2)

[Section 3] Section 3 (Architecture description): The coupling of streams through identical residual blocks is presented as delivering compatible optimization signals, yet no analysis of gradient magnitudes, directions, or potential averaging effects is provided. This leaves open the possibility that the stable identity path and attenuated normalized path produce conflicting updates on shared weights, which is load-bearing for the reconciliation claim and the reported stability.
[Section 4] Section 4 (Experiments): Results claim consistent improvements and strong stability across scales and modalities, but lack ablations isolating the contribution of each stream or testing under varied optimizers and initializations. Without these, it remains possible that observed gains depend on the specific training recipe rather than the two-stream structure itself.

minor comments (2)

[Abstract] The abstract states 'negligible overhead' without a concrete comparison of parameter count or FLOPs relative to a standard single-stream baseline.
[Figures] Figure captions and method diagrams would benefit from explicit labels distinguishing the Pre-Norm-like and Post-Norm-like pathways to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the analysis and experimental validation without altering the core claims.

read point-by-point responses

Referee: [Section 3] Section 3 (Architecture description): The coupling of streams through identical residual blocks is presented as delivering compatible optimization signals, yet no analysis of gradient magnitudes, directions, or potential averaging effects is provided. This leaves open the possibility that the stable identity path and attenuated normalized path produce conflicting updates on shared weights, which is load-bearing for the reconciliation claim and the reported stability.

Authors: We agree that explicit analysis of gradient flow would better substantiate the compatibility of optimization signals. Although the consistent stability observed across 400M–15B models and multiple modalities provides indirect evidence against severe conflicts, we will add a dedicated subsection in the revised Section 3. This will include quantitative comparisons of gradient magnitudes and directional alignment for the shared residual blocks under both streams, along with a brief discussion of any averaging effects. revision: yes
Referee: [Section 4] Section 4 (Experiments): Results claim consistent improvements and strong stability across scales and modalities, but lack ablations isolating the contribution of each stream or testing under varied optimizers and initializations. Without these, it remains possible that observed gains depend on the specific training recipe rather than the two-stream structure itself.

Authors: We acknowledge the value of these ablations for isolating the architectural contribution. In the revised manuscript we will expand Section 4 (and supplementary material) with (i) controlled ablations that disable one stream at a time while keeping the other fixed, and (ii) additional runs using alternative optimizers and varied initialization schemes. These results will be reported at the same scales to demonstrate that performance gains are attributable to the two-stream coupling rather than the specific training recipe. revision: yes

Circularity Check

0 steps flagged

No circularity: new two-stream architecture validated empirically

full rationale

The paper introduces SiameseNorm as a structural proposal: a two-stream design with shared residual blocks that supplies optimization signals from both Pre-Norm-like and Post-Norm-like pathways. The central argument rests on identifying a tension in single-stream architectures and then defining the new coupling mechanism, followed by direct experimental validation on 400M–15B models, ViTs, and diffusion models. No equations, fitted parameters, or self-citations are used to derive performance claims; the reported stability and gains are presented as outcomes of the architecture itself rather than reductions to prior inputs or definitions. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven premise that dual streams with shared blocks transmit complementary optimization signals without new failure modes; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Single-stream architectures cannot simultaneously achieve Pre-Norm gradient stability and Post-Norm main-path normalization.
Stated directly in the abstract as the structural tension motivating the two-stream design.

invented entities (1)

SiameseNorm two-stream architecture no independent evidence
purpose: To reconcile Pre-Norm and Post-Norm benefits through shared residual blocks
New architectural construct introduced by the paper; independent evidence is limited to the reported experiments.

pith-pipeline@v0.9.0 · 5752 in / 1203 out tokens · 41082 ms · 2026-05-22T11:07:58.955862+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SiameseNorm couples Pre-Norm-like and Post-Norm-like streams through shared residual blocks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rethinking Cross-Layer Information Routing in Diffusion Transformers
cs.CV 2026-05 conditional novelty 6.0

DAR replaces residual addition in DiTs with learnable timestep-adaptive non-incremental aggregation of sublayer outputs, improving FID by 2.11 on ImageNet 256x256 and accelerating convergence by 8.75x.
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

GAP introduces three-level alignment for visual latent reasoning in MLLMs, achieving top aggregate perception and reasoning performance on Qwen2.5-VL 7B by addressing decoder-input norm mismatch.
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding best aggregate perception/reasoning scores on Qwen2.5-VL 7B among supervised variants while showing task-relevant signal i...
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 5.0

GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...
Attention Residuals
cs.CL 2026-03 unverdicted novelty 5.0

Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter mo...

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 3 Pith papers · 17 internal anchors

[1]

Layer Normalization

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[3]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901
[4]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Chen, A., Li, A., Gong, B., Jiang, B., Fei, B., Yang, B., Shan, B., Yu, C., Wang, C., Zhu, C., et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Generating Long Sequences with Sparse Transformers

URL https://arxiv.org/abs/1904.10509. Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[6]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

9 SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[8]

Trans- former feed-forward layers are key-value memories

Geva, M., Schuster, R., Berant, J., and Levy, O. Trans- former feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484–5495,

work page 2021
[9]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Step by step network.arXiv preprint arXiv:2511.14329,

Han, D., Ye, T., Xia, Z., Chen, K., Wang, Y., Chen, H., and Huang, G. Step by step network.arXiv preprint arXiv:2511.14329,

work page arXiv
[11]

Gaussian Error Linear Units (GELUs)

Hendrycks, D. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

R., Pawar, S

Henry, A., Dachapally, P. R., Pawar, S. S., and Chen, Y. Query-key normalization for transformers. InFindings of the Association for Computational Linguistics: EMNLP 2020, pp. 4246–4253,

work page 2020
[13]

Kim, J., Lee, B., Park, C., Oh, Y., Kim, B., Yoo, T., Shin, S., Han, D., Shin, J., and Yoo, K. M. Peri-ln: Revisiting nor- malization layer in the transformer architecture.arXiv preprint arXiv:2502.02732,

work page arXiv
[14]

Reformer: The Efficient Transformer

Kitaev, N., Kaiser, Ł., and Levskaya, A. Reformer: The efficient transformer.arXiv preprint arXiv:2001.04451,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[15]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Understand- ing the difficulty of training transformers

Liu, L., Liu, X., Gao, J., Chen, W., and Han, J. Understand- ing the difficulty of training transformers. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5747–5763,

work page 2020
[17]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391,

work page 2018
[18]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., et al. Gated attention for large language models: Non- linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

GLU Variants Improve Transformer

10 SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm Shazeer, N. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[20]

Highway Networks

Srivastava, R. K., Greff, K., and Schmidhuber, J. Highway networks.arXiv preprint arXiv:1505.00387,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

Sun, W., Song, X., Li, P., Yin, L., Zheng, Y., and Liu, S. The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

work page arXiv
[22]

Kimi Linear: An Expressive, Efficient Attention Architecture

Team, K., Zhang, Y., Lin, Z., Yao, X., Hu, J., Meng, F., Liu, C., Men, X., Yang, S., Li, Z., et al. Kimi linear: An expressive, efficientattentionarchitecture.arXivpreprint arXiv:2510.26692,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., etal. Llama: Openandefficientfoundationlan- guage models.arXiv preprint arXiv:2302.13971, 2023a. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et a...

work page internal anchor Pith review Pith/arXiv arXiv
[24]

H., Menezes, A., Qin, T., and Yan, R

Xie, S., Zhang, H., Guo, J., Tan, X., Bian, J., Awadalla, H. H., Menezes, A., Qin, T., and Yan, R. Residual: Trans- former with dual residual connections.arXiv preprint arXiv:2304.14802,

work page arXiv
[25]

mHC: Manifold-Constrained Hyper-Connections

Xie, Z., Wei, Y., Cao, H., Zhao, C., Deng, C., Li, J., Dai, D., Gao, H., Chang, J., Zhao, L., et al. mhc: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Hyper-connections

Zhu, D., Huang, H., Huang, Z., Zeng, Y., Mao, Y., Wu, B., Min, Q., and Zhou, X. Hyper-connections. InThe Thirteenth International Conference on Learning Repre- sentations, 2025a. Zhu, D., Huang, H., Zhou, J., Huang, Z., Zeng, Y., Wu, B., Min, Q., and Zhou, X. Frac-connections: Frac- tional extension of hyper-connections.arXiv preprint arXiv:2503.14125, 20...

work page arXiv
[28]

Appendix A.1

11 SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm A. Appendix A.1. Comparison with Existing Multi-path Designs Input f × 𝑵 LN Output LN Figure 7|Architecture of Residual (Xie et al., 2023). ResiDual (Xie et al., 2023)The work most structurally similar to ours is ResiDual (Xie et al., 2023), as illustrated in Fig

work page 2023
[29]

However, a fundamental difference lies in the topology: in ResiDual, the Pre-Norm stream (Y-stream) is not connected to the input of the residual block. This implies that the 𝑌-stream acts as a global shortcut that aggregatestheoutputofeachresidualblockdirectlytoward the final output, rather than an active participant in the iterative transformation proce...

work page 2025
[30]

It should be noted that the learning rate and the total number of training tokens vary across our different experimental setups. Table 4|Detailed Experimental Settings for OLMo-1.3B Category Configuration / Value Model architecture Number of Layers 16 Hidden Size 2048 Attention Heads 16 Key-Value heads 16 FFN Intermediate Size 8192 Activation Function Swi...

work page 2048

[1] [1]

Layer Normalization

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004

[3] [3]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901

[4] [4]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Chen, A., Li, A., Gong, B., Jiang, B., Fei, B., Yang, B., Shan, B., Yu, C., Wang, C., Zhu, C., et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Generating Long Sequences with Sparse Transformers

URL https://arxiv.org/abs/1904.10509. Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[6] [6]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

9 SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[8] [8]

Trans- former feed-forward layers are key-value memories

Geva, M., Schuster, R., Berant, J., and Levy, O. Trans- former feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484–5495,

work page 2021

[9] [9]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Step by step network.arXiv preprint arXiv:2511.14329,

Han, D., Ye, T., Xia, Z., Chen, K., Wang, Y., Chen, H., and Huang, G. Step by step network.arXiv preprint arXiv:2511.14329,

work page arXiv

[11] [11]

Gaussian Error Linear Units (GELUs)

Hendrycks, D. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

R., Pawar, S

Henry, A., Dachapally, P. R., Pawar, S. S., and Chen, Y. Query-key normalization for transformers. InFindings of the Association for Computational Linguistics: EMNLP 2020, pp. 4246–4253,

work page 2020

[13] [13]

Kim, J., Lee, B., Park, C., Oh, Y., Kim, B., Yoo, T., Shin, S., Han, D., Shin, J., and Yoo, K. M. Peri-ln: Revisiting nor- malization layer in the transformer architecture.arXiv preprint arXiv:2502.02732,

work page arXiv

[14] [14]

Reformer: The Efficient Transformer

Kitaev, N., Kaiser, Ł., and Levskaya, A. Reformer: The efficient transformer.arXiv preprint arXiv:2001.04451,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[15] [15]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Understand- ing the difficulty of training transformers

Liu, L., Liu, X., Gao, J., Chen, W., and Han, J. Understand- ing the difficulty of training transformers. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5747–5763,

work page 2020

[17] [17]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391,

work page 2018

[18] [18]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., et al. Gated attention for large language models: Non- linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

GLU Variants Improve Transformer

10 SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm Shazeer, N. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002

[20] [20]

Highway Networks

Srivastava, R. K., Greff, K., and Schmidhuber, J. Highway networks.arXiv preprint arXiv:1505.00387,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

Sun, W., Song, X., Li, P., Yin, L., Zheng, Y., and Liu, S. The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

work page arXiv

[22] [22]

Kimi Linear: An Expressive, Efficient Attention Architecture

Team, K., Zhang, Y., Lin, Z., Yao, X., Hu, J., Meng, F., Liu, C., Men, X., Yang, S., Li, Z., et al. Kimi linear: An expressive, efficientattentionarchitecture.arXivpreprint arXiv:2510.26692,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., etal. Llama: Openandefficientfoundationlan- guage models.arXiv preprint arXiv:2302.13971, 2023a. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et a...

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

H., Menezes, A., Qin, T., and Yan, R

Xie, S., Zhang, H., Guo, J., Tan, X., Bian, J., Awadalla, H. H., Menezes, A., Qin, T., and Yan, R. Residual: Trans- former with dual residual connections.arXiv preprint arXiv:2304.14802,

work page arXiv

[25] [25]

mHC: Manifold-Constrained Hyper-Connections

Xie, Z., Wei, Y., Cao, H., Zhao, C., Deng, C., Li, J., Dai, D., Gao, H., Chang, J., Zhao, L., et al. mhc: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Hyper-connections

Zhu, D., Huang, H., Huang, Z., Zeng, Y., Mao, Y., Wu, B., Min, Q., and Zhou, X. Hyper-connections. InThe Thirteenth International Conference on Learning Repre- sentations, 2025a. Zhu, D., Huang, H., Zhou, J., Huang, Z., Zeng, Y., Wu, B., Min, Q., and Zhou, X. Frac-connections: Frac- tional extension of hyper-connections.arXiv preprint arXiv:2503.14125, 20...

work page arXiv

[28] [28]

Appendix A.1

11 SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm A. Appendix A.1. Comparison with Existing Multi-path Designs Input f × 𝑵 LN Output LN Figure 7|Architecture of Residual (Xie et al., 2023). ResiDual (Xie et al., 2023)The work most structurally similar to ours is ResiDual (Xie et al., 2023), as illustrated in Fig

work page 2023

[29] [29]

However, a fundamental difference lies in the topology: in ResiDual, the Pre-Norm stream (Y-stream) is not connected to the input of the residual block. This implies that the 𝑌-stream acts as a global shortcut that aggregatestheoutputofeachresidualblockdirectlytoward the final output, rather than an active participant in the iterative transformation proce...

work page 2025

[30] [30]

It should be noted that the learning rate and the total number of training tokens vary across our different experimental setups. Table 4|Detailed Experimental Settings for OLMo-1.3B Category Configuration / Value Model architecture Number of Layers 16 Hidden Size 2048 Attention Heads 16 Key-Value heads 16 FFN Intermediate Size 8192 Activation Function Swi...

work page 2048