SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm
Pith reviewed 2026-05-22 11:07 UTC · model grok-4.3
The pith
SiameseNorm uses a two-stream design with shared residual blocks to combine Pre-Norm stability and Post-Norm capacity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The long-standing tension between Pre- and Post-Norm reflects a fundamental trade-off between training stability and representational capacity. Single-stream architectures struggle to reconcile Pre-Norm's stable identity-gradient propagation with Post-Norm's normalization of the main residual path. SiameseNorm addresses this by proposing a two-stream architecture that couples Pre-Norm-like and Post-Norm-like streams through shared residual blocks, allowing each residual block to receive optimization signals from both pathways with negligible overhead. Extensive experiments on 400M and 1.3B dense language models, 15B MoE models, Vision Transformers, and Diffusion Transformers show that Siames
What carries the argument
SiameseNorm's two-stream architecture with shared residual blocks that supplies optimization signals from both Pre-Norm-like and Post-Norm-like pathways.
Load-bearing premise
That a two-stream design with shared residual blocks can deliver optimization signals from both Pre-Norm-like and Post-Norm-like pathways without introducing new instabilities or conflicts that offset the reported gains.
What would settle it
A training run on a 1.3B language model using SiameseNorm that shows no performance gain or reduced stability compared to standard Pre-Norm would disprove the claim.
read the original abstract
The long-standing tension between Pre- and Post-Norm remains an open problem in Transformer architecture, reflecting a fundamental trade-off between training stability and representational capacity. Prior attempts to combine their strengths have made progress, but often show limited robustness across training settings, restricting their broader applicability. We revisit this dilemma, showing that single-stream architectures struggle to reconcile Pre-Norm's stable identity-gradient propagation with Post-Norm's normalization of the main residual path. To address this structural tension, we propose SiameseNorm, a simple yet effective two-stream architecture that remains compatible with Pre-Norm training recipes. SiameseNorm couples Pre-Norm-like and Post-Norm-like streams through shared residual blocks, allowing each residual block to receive optimization signals from both pathways with negligible overhead. Extensive experiments on 400M and 1.3B dense language models, 15B MoE models, Vision Transformers, and Diffusion Transformers show that SiameseNorm consistently improves performance while maintaining strong training stability across architectures and modalities. Code is available at https://github.com/Qwen-Applications/SiameseNorm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SiameseNorm, a two-stream Transformer architecture that couples Pre-Norm-like and Post-Norm-like streams via shared residual blocks. This design is intended to reconcile the stability of Pre-Norm's identity-gradient path with the representational benefits of Post-Norm's normalized main path. The authors report that the approach maintains compatibility with standard Pre-Norm training recipes and delivers consistent performance gains with strong stability on dense language models (400M and 1.3B), 15B MoE models, Vision Transformers, and Diffusion Transformers.
Significance. If the central claim holds, SiameseNorm would provide a practical, low-overhead architectural fix for a persistent tension in Transformer design, with potential impact on large-scale training across modalities. Strengths include the scale of experiments (up to 15B parameters), coverage of multiple architectures and modalities, and public code release. These elements support practical significance beyond incremental empirical tuning.
major comments (2)
- [Section 3] Section 3 (Architecture description): The coupling of streams through identical residual blocks is presented as delivering compatible optimization signals, yet no analysis of gradient magnitudes, directions, or potential averaging effects is provided. This leaves open the possibility that the stable identity path and attenuated normalized path produce conflicting updates on shared weights, which is load-bearing for the reconciliation claim and the reported stability.
- [Section 4] Section 4 (Experiments): Results claim consistent improvements and strong stability across scales and modalities, but lack ablations isolating the contribution of each stream or testing under varied optimizers and initializations. Without these, it remains possible that observed gains depend on the specific training recipe rather than the two-stream structure itself.
minor comments (2)
- [Abstract] The abstract states 'negligible overhead' without a concrete comparison of parameter count or FLOPs relative to a standard single-stream baseline.
- [Figures] Figure captions and method diagrams would benefit from explicit labels distinguishing the Pre-Norm-like and Post-Norm-like pathways to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the analysis and experimental validation without altering the core claims.
read point-by-point responses
-
Referee: [Section 3] Section 3 (Architecture description): The coupling of streams through identical residual blocks is presented as delivering compatible optimization signals, yet no analysis of gradient magnitudes, directions, or potential averaging effects is provided. This leaves open the possibility that the stable identity path and attenuated normalized path produce conflicting updates on shared weights, which is load-bearing for the reconciliation claim and the reported stability.
Authors: We agree that explicit analysis of gradient flow would better substantiate the compatibility of optimization signals. Although the consistent stability observed across 400M–15B models and multiple modalities provides indirect evidence against severe conflicts, we will add a dedicated subsection in the revised Section 3. This will include quantitative comparisons of gradient magnitudes and directional alignment for the shared residual blocks under both streams, along with a brief discussion of any averaging effects. revision: yes
-
Referee: [Section 4] Section 4 (Experiments): Results claim consistent improvements and strong stability across scales and modalities, but lack ablations isolating the contribution of each stream or testing under varied optimizers and initializations. Without these, it remains possible that observed gains depend on the specific training recipe rather than the two-stream structure itself.
Authors: We acknowledge the value of these ablations for isolating the architectural contribution. In the revised manuscript we will expand Section 4 (and supplementary material) with (i) controlled ablations that disable one stream at a time while keeping the other fixed, and (ii) additional runs using alternative optimizers and varied initialization schemes. These results will be reported at the same scales to demonstrate that performance gains are attributable to the two-stream coupling rather than the specific training recipe. revision: yes
Circularity Check
No circularity: new two-stream architecture validated empirically
full rationale
The paper introduces SiameseNorm as a structural proposal: a two-stream design with shared residual blocks that supplies optimization signals from both Pre-Norm-like and Post-Norm-like pathways. The central argument rests on identifying a tension in single-stream architectures and then defining the new coupling mechanism, followed by direct experimental validation on 400M–15B models, ViTs, and diffusion models. No equations, fitted parameters, or self-citations are used to derive performance claims; the reported stability and gains are presented as outcomes of the architecture itself rather than reductions to prior inputs or definitions. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Single-stream architectures cannot simultaneously achieve Pre-Norm gradient stability and Post-Norm main-path normalization.
invented entities (1)
-
SiameseNorm two-stream architecture
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SiameseNorm couples Pre-Norm-like and Post-Norm-like streams through shared residual blocks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 5 Pith papers
-
Rethinking Cross-Layer Information Routing in Diffusion Transformers
DAR replaces residual addition in DiTs with learnable timestep-adaptive non-incremental aggregation of sublayer outputs, improving FID by 2.11 on ImageNet 256x256 and accelerating convergence by 8.75x.
-
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP introduces three-level alignment for visual latent reasoning in MLLMs, achieving top aggregate perception and reasoning performance on Qwen2.5-VL 7B by addressing decoder-input norm mismatch.
-
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding best aggregate perception/reasoning scores on Qwen2.5-VL 7B among supervised variants while showing task-relevant signal i...
-
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...
-
Attention Residuals
Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter mo...
Reference graph
Works this paper leans on
-
[1]
Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Longformer: The Long-Document Transformer
Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[3]
D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,
work page 1901
-
[4]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Chen, A., Li, A., Gong, B., Jiang, B., Fei, B., Yang, B., Shan, B., Yu, C., Wang, C., Zhu, C., et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Generating Long Sequences with Sparse Transformers
URL https://arxiv.org/abs/1904.10509. Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[6]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
9 SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[8]
Trans- former feed-forward layers are key-value memories
Geva, M., Schuster, R., Berant, J., and Levy, O. Trans- former feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484–5495,
work page 2021
-
[9]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Step by step network.arXiv preprint arXiv:2511.14329,
Han, D., Ye, T., Xia, Z., Chen, K., Wang, Y., Chen, H., and Huang, G. Step by step network.arXiv preprint arXiv:2511.14329,
-
[11]
Gaussian Error Linear Units (GELUs)
Hendrycks, D. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Henry, A., Dachapally, P. R., Pawar, S. S., and Chen, Y. Query-key normalization for transformers. InFindings of the Association for Computational Linguistics: EMNLP 2020, pp. 4246–4253,
work page 2020
- [13]
-
[14]
Reformer: The Efficient Transformer
Kitaev, N., Kaiser, Ł., and Levskaya, A. Reformer: The efficient transformer.arXiv preprint arXiv:2001.04451,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[15]
Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Understand- ing the difficulty of training transformers
Liu, L., Liu, X., Gao, J., Chen, W., and Han, J. Understand- ing the difficulty of training transformers. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5747–5763,
work page 2020
-
[17]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391,
work page 2018
-
[18]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., et al. Gated attention for large language models: Non- linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
GLU Variants Improve Transformer
10 SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm Shazeer, N. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[20]
Srivastava, R. K., Greff, K., and Schmidhuber, J. Highway networks.arXiv preprint arXiv:1505.00387,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
The curse of depth in large language models.arXiv preprint arXiv:2502.05795,
Sun, W., Song, X., Li, P., Yin, L., Zheng, Y., and Liu, S. The curse of depth in large language models.arXiv preprint arXiv:2502.05795,
-
[22]
Kimi Linear: An Expressive, Efficient Attention Architecture
Team, K., Zhang, Y., Lin, Z., Yao, X., Hu, J., Meng, F., Liu, C., Men, X., Yang, S., Li, Z., et al. Kimi linear: An expressive, efficientattentionarchitecture.arXivpreprint arXiv:2510.26692,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., etal. Llama: Openandefficientfoundationlan- guage models.arXiv preprint arXiv:2302.13971, 2023a. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et a...
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
H., Menezes, A., Qin, T., and Yan, R
Xie, S., Zhang, H., Guo, J., Tan, X., Bian, J., Awadalla, H. H., Menezes, A., Qin, T., and Yan, R. Residual: Trans- former with dual residual connections.arXiv preprint arXiv:2304.14802,
-
[25]
mHC: Manifold-Constrained Hyper-Connections
Xie, Z., Wei, Y., Cao, H., Zhao, C., Deng, C., Li, J., Dai, D., Gao, H., Chang, J., Zhao, L., et al. mhc: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Zhu, D., Huang, H., Huang, Z., Zeng, Y., Mao, Y., Wu, B., Min, Q., and Zhou, X. Hyper-connections. InThe Thirteenth International Conference on Learning Repre- sentations, 2025a. Zhu, D., Huang, H., Zhou, J., Huang, Z., Zeng, Y., Wu, B., Min, Q., and Zhou, X. Frac-connections: Frac- tional extension of hyper-connections.arXiv preprint arXiv:2503.14125, 20...
-
[28]
11 SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm A. Appendix A.1. Comparison with Existing Multi-path Designs Input f × 𝑵 LN Output LN Figure 7|Architecture of Residual (Xie et al., 2023). ResiDual (Xie et al., 2023)The work most structurally similar to ours is ResiDual (Xie et al., 2023), as illustrated in Fig
work page 2023
-
[29]
However, a fundamental difference lies in the topology: in ResiDual, the Pre-Norm stream (Y-stream) is not connected to the input of the residual block. This implies that the 𝑌-stream acts as a global shortcut that aggregatestheoutputofeachresidualblockdirectlytoward the final output, rather than an active participant in the iterative transformation proce...
work page 2025
-
[30]
It should be noted that the learning rate and the total number of training tokens vary across our different experimental setups. Table 4|Detailed Experimental Settings for OLMo-1.3B Category Configuration / Value Model architecture Number of Layers 16 Hidden Size 2048 Attention Heads 16 Key-Value heads 16 FFN Intermediate Size 8192 Activation Function Swi...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.