pith. sign in

arxiv: 2605.20798 · v1 · pith:ECXVOYFLnew · submitted 2026-05-20 · 💻 cs.LG · cs.CL

Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor

Pith reviewed 2026-05-21 06:11 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords transformer modificationsarchitecture evaluationscaling experimentsdownstream evaluationreproducibilitynoise floorCLIMB benchmarktransferability
0
0 comments X

The pith

Most post-2021 Transformer modifications do not transfer to 1-3B models under strict controls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper revisits earlier work on Transformer changes by testing twenty modifications from 2021 onward at 1.2B and 3B parameter scales. It applies iso-data, iso-compute, and iso-recipe constraints plus a multi-seed noise floor, with CLIMB-12 downstream scores as the main measure. Only two modifications clear the Bonferroni-corrected significance bar at 1.2B, and one of those fails stable training at 3B. The study also shows that attention-output modifications can match baseline validation loss yet lose 6-16 points downstream. These results indicate that rigorous evaluation practices are needed to separate real gains from noise at current scales.

Core claim

Of the twenty modifications examined, only two reach statistical significance after multiple-comparison correction at 1.2B parameters under the shared training recipe; one of those two becomes unstable at 3B. The loss-to-downstream gap widens markedly for attention-output changes, with two near-baseline loss runs still dropping 6-16 CLIMB points. The work concludes that noise-floor reporting, downstream evaluation, and cross-scale stability checks have become necessary for credible architecture comparisons at 1-3B.

What carries the argument

Iso-data, iso-compute, and iso-recipe controls paired with multi-seed baseline noise floor and CLIMB-12 downstream evaluation.

If this is right

  • Architecture comparisons at 1-3B require noise-floor reporting to avoid false positives from single-run variance.
  • Downstream metrics must take precedence over validation loss, especially for attention-output changes.
  • Cross-scale stability testing between 1.2B and 3B is required before claiming transfer.
  • Most modifications will not meet the threshold when all factors are held equal.
  • The gap between loss and downstream performance can enlarge several-fold for certain modification classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Implementation details and optimizer interactions may explain more variance than the modification itself at these scales.
  • Allowing each modification its own tuned recipe could raise the transfer rate, offering a testable next experiment.
  • The results link to the broader question of whether data scaling or architectural novelty drives recent gains.
  • Repeating the protocol on a different data mixture would test whether the non-transfer finding is data-specific.

Load-bearing premise

Holding the training recipe, data, and compute fixed isolates each modification's effect without hidden biases from implementation details or stability differences at 3B.

What would settle it

A modification that fails Bonferroni correction here but shows reliable gains when the same authors rerun it with a different optimizer schedule or additional seeds at 3B.

Figures

Figures reproduced from arXiv: 2605.20798 by Bin Huang, Guhua Zhang, Jiahao Lu, Jie Zhou, Yang Zhao.

Figure 2
Figure 2. Figure 2: Validation-loss curves for six representative [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-task CLIMB deltas vs baseline at 1.2B. Rows are methods sorted by CLIMB-avg; columns are the 12 [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
read the original abstract

Narang et al. (2021) evaluated 40+ Transformer modifications at T5-base scale and concluded that most did not transfer. Five years later, the typical working regime has moved to 1-3B parameters, downstream evaluation has replaced pretraining perplexity, and a substantially different catalogue of modifications has emerged. We revisit their question by testing 20 post-2021 Transformer modifications at 1.2B and 3B under strict iso-data, iso-compute, iso-recipe control, with a multi-seed baseline noise floor and CLIMB-12 downstream evaluation as the primary metric. The central finding reproduces theirs at this curated set: most modifications do not transfer. Of the 20 modifications, only two clear Bonferroni correction at 1.2B; one of those two further fails to train stably at 3B under the shared recipe. We also find that the loss-downstream gap reported by Tay et al. (2023) enlarges several-fold for attention-output modifications: two significant failures converge to within 2-3% of baseline validation loss yet drop 6-16 CLIMB-points. We conclude that noise-floor reporting, downstream evaluation, and cross-scale stability testing are now prerequisites for architecture comparisons at 1-3B.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper updates Narang et al. (2021) by evaluating 20 post-2021 Transformer modifications at 1.2B and 3B scales under strict iso-data, iso-compute, and iso-recipe controls. It incorporates multi-seed baseline noise floors, Bonferroni correction, and CLIMB-12 downstream evaluation as the primary metric. The central finding is that most modifications do not transfer: only two clear Bonferroni correction at 1.2B, and one of those fails to train stably at 3B under the shared recipe. The work also reports that the loss-downstream performance gap enlarges several-fold for attention-output modifications.

Significance. If the controls hold, this provides a timely empirical update showing that rigorous statistical thresholds, downstream tasks, and cross-scale stability checks are now essential for credible architecture comparisons at 1-3B. The reproduction of the negative result with modern methods and the explicit noise-floor reporting are strengths that could help calibrate expectations in the field.

major comments (2)
  1. [Methods / Experimental Setup] Methods / Experimental Setup (iso-recipe control): The claim that modifications 'do not transfer' is load-bearing on the assumption that a single fixed hyperparameter recipe (LR schedule, optimizer, etc.) fairly isolates each change's effect. Architectural modifications can alter gradient variance, activation scales, or optimization curvature, so the baseline recipe may be mismatched; the reported stability failure at 3B for one of the two 1.2B-significant modifications is direct evidence of this risk. Without per-modification retuning or sensitivity analysis, apparent non-transfer could be a tuning artifact rather than an intrinsic result.
  2. [Results] Results (Bonferroni and stability reporting): The abstract states that only two of 20 modifications clear Bonferroni at 1.2B and one fails stability at 3B, but exact p-values, effect sizes, and full multi-seed statistics for all 20 (including the non-significant ones) are needed to verify the correction was applied uniformly and that the curated set does not introduce selection bias.
minor comments (2)
  1. [Abstract] Abstract: briefly define or cite what CLIMB-12 consists of (task composition, number of examples) to improve accessibility.
  2. [Figures] Figures: ensure error bars from the multi-seed runs are visible and labeled consistently across performance plots.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our update to Narang et al. (2021). We address each major point below, indicating where revisions will be made to improve clarity and completeness while preserving the core experimental design.

read point-by-point responses
  1. Referee: [Methods / Experimental Setup] Methods / Experimental Setup (iso-recipe control): The claim that modifications 'do not transfer' is load-bearing on the assumption that a single fixed hyperparameter recipe (LR schedule, optimizer, etc.) fairly isolates each change's effect. Architectural modifications can alter gradient variance, activation scales, or optimization curvature, so the baseline recipe may be mismatched; the reported stability failure at 3B for one of the two 1.2B-significant modifications is direct evidence of this risk. Without per-modification retuning or sensitivity analysis, apparent non-transfer could be a tuning artifact rather than an intrinsic result.

    Authors: We agree that architectural modifications can influence optimization dynamics and that the observed instability at 3B for one modification provides direct evidence of this. Our iso-recipe protocol is intentional and follows the design of Narang et al. (2021) to evaluate whether modifications transfer under a shared, fixed training recipe without per-modification hyperparameter retuning. This mirrors how many modifications are proposed and initially tested in the literature. A full per-modification retuning study would address a different question and require substantially more compute. We will expand the discussion section to explicitly note this limitation, discuss the implications of potential optimization mismatches, and clarify that the reported non-transfer results are conditioned on the fixed recipe. revision: yes

  2. Referee: [Results] Results (Bonferroni and stability reporting): The abstract states that only two of 20 modifications clear Bonferroni at 1.2B and one fails stability at 3B, but exact p-values, effect sizes, and full multi-seed statistics for all 20 (including the non-significant ones) are needed to verify the correction was applied uniformly and that the curated set does not introduce selection bias.

    Authors: We agree that detailed statistics aid verification. The manuscript already reports the multi-seed noise floor, Bonferroni correction, and stability observations in the main text and appendix. In revision we will add a table providing exact p-values, effect sizes, and complete per-seed performance for all 20 modifications. For the curated set, the 20 modifications were chosen to cover major post-2021 categories (attention, normalization, activation, and positional variants); we will add explicit selection criteria to the methods to address potential bias concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical reproduction via new controlled experiments

full rationale

The paper reports fresh experimental results at 1.2B and 3B scales under iso-recipe, iso-data, iso-compute conditions, measuring downstream CLIMB-12 performance and stability for 20 post-2021 modifications. No derivation chain, equations, or fitted parameters are presented that reduce by construction to prior inputs or self-cited results; the central claim (most modifications fail to transfer) rests on direct multi-seed measurements and Bonferroni-corrected comparisons against a reported noise floor. The work cites Narang et al. (2021) only as the historical baseline being updated, not as a load-bearing uniqueness theorem or ansatz. This is a standard self-contained empirical study against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

This empirical study rests on experimental design choices and statistical assumptions rather than new theoretical axioms or invented entities.

free parameters (2)
  • 1.2B and 3B model sizes
    Chosen to represent the current typical working regime for practical models.
  • CLIMB-12 benchmark selection
    Adopted as the primary downstream evaluation metric in place of pretraining perplexity.
axioms (2)
  • domain assumption Training recipe, data, and compute can be held strictly identical across modifications
    Invoked to enable fair iso-recipe comparisons as stated in the abstract.
  • domain assumption Multi-seed runs provide a reliable baseline noise floor for statistical testing
    Used to apply Bonferroni correction and determine which modifications clear significance.

pith-pipeline@v0.9.0 · 5793 in / 1487 out tokens · 67763 ms · 2026-05-21T06:11:38.106951+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 8 internal anchors

  1. [1]

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA : Reasoning about physical commonsense in natural language. AAAI

  2. [2]

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ : Exploring the surprising difficulty of natural yes/no questions. In NAACL

  3. [5]

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, and 2 others. 2023. Scaling V ision T ransformers...

  4. [7]

    Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. 2021. CogView : Mastering text-to-image generation via transformers. In NeurIPS. Introduces Sandwich Normalization for stabilizing large Transformers

  5. [8]

    Bradley Efron. 1987. Better bootstrap confidence intervals. Journal of the American Statistical Association, 82(397):171--185

  6. [10]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. In ICLR

  7. [11]

    Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. 2020. Query-key normalization for transformers. EMNLP Findings

  8. [13]

    Andrej Karpathy. 2025. nanochat: The best ChatGPT that \ 100 can buy. https://github.com/karpathy/nanochat. Open-source small-LM reproduction stack; source of the 65 , 664-entry BPE tokenizer used in this work

  9. [14]

    Attention Residuals

    Kimi Team , Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, Yutian Chen, Junjie Yan, Ming Wei, Y. Zhang, Fanqing Meng, Chao Hong, Xiaotong Xie, Shaowei Liu, Enzhe Lu, and 18 others. 2026. Attention residuals. arXiv:2603.15031. Submitted 16 Mar 2026

  10. [15]

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE : Large-scale ReAding comprehension dataset from examinations. In EMNLP

  11. [17]

    Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA : Measuring how models mimic human falsehoods. In ACL

  12. [18]

    Hong Liu, Sang Michael Xie, Zhiyuan Li, and Tengyu Ma. 2023. Same pre-training loss, better downstream: Implicit bias matters for language models. In ICML

  13. [19]

    Hu, and Kyunghyun Cho

    Nicholas Lourie, Michael Y. Hu, and Kyunghyun Cho. 2025. Scaling laws are unreliable for downstream tasks: A reality check. In Findings of EMNLP. ArXiv:2507.00885

  14. [20]

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. EMNLP

  15. [22]

    Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, and Colin Raffel. 2021. Do Transformer modifications transfer across implementations and applications? In EMNLP

  16. [24]

    Denis Paperno, Germ\' a n Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern\' a ndez. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. ACL

  17. [25]

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2025. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. In NeurIPS. ArXiv:2505.06708; per-head sigmoid gate on attention output

  18. [27]

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2020. WinoGrande : An adversarial Winograd schema challenge at scale. AAAI

  19. [28]

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. 2019. Social IQa : Commonsense reasoning about social interactions. In EMNLP

  20. [30]

    So, Wojciech Ma\' n ke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V

    David R. So, Wojciech Ma\' n ke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V. Le. 2022. Primer: Searching for efficient transformers for language modeling. In ICML. Introduces non-gated ReLU ^ 2 activation; adopted by PaLM-540B

  21. [31]

    Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler. 2023. Scaling laws vs.\ model architectures: How does inductive bias influence scaling? In Findings of EMNLP. ArXiv:2207.10551

  22. [32]

    Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv \'e J \'e gou. 2021. Going deeper with image transformers. In ICCV

  23. [34]

    Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. 2020. On layer normalization in the transformer architecture. In ICML. Demonstrates Post-LN activation variance growth with depth; motivates Pre-LN

  24. [36]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag : Can a machine really finish your sentence? In ACL

  25. [37]

    Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Fares Obeid, and Zhenzhong Lan. 2025. Value residual learning for alleviating attention concentration in transformers. In ACL. ArXiv:2410.17897

  26. [39]

    Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, and Jinwen Ma. 2025. HybridNorm : Towards stable and efficient transformer training via hybrid normalization. NeurIPS. ArXiv:2503.04598; Attention block Pre-Norm, FFN block Post-Norm

  27. [41]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton Ferrer and Moya Chen and Guillem Cucurull and David Esiobu and Jude Fernandes and Jeremy Fu and Wenyin Fu and Brian Fuller ...

  28. [42]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann and Sebastian Borgeaud and Arthur Mensch and Elena Buchatskaya and Trevor Cai and Eliza Rutherford and Diego de Las Casas and Lisa Anne Hendricks and Johannes Welbl and Aidan Clark and Tom Hennigan and Eric Noland and Katie Millican and George van den Driessche and Bogdan Damoc and Aurelia Guy and Simon Osindero and Karen Simonyan and Eric...

  29. [43]

    EMNLP , year =

    Sharan Narang and Hyung Won Chung and Yi Tay and William Fedus and Thibault Fevry and Michael Matena and Karishma Malkan and Noah Fiedel and Noam Shazeer and Zhenzhong Lan and Yanqi Zhou and Wei Li and Nan Ding and Jake Marcus and Adam Roberts and Colin Raffel , title =. EMNLP , year =

  30. [44]

    Findings of EMNLP , year =

    Yi Tay and Mostafa Dehghani and Samira Abnar and Hyung Won Chung and Sharan Narang and Dani Yogatama and Ashish Vaswani and Donald Metzler , title =. Findings of EMNLP , year =

  31. [45]

    ICML , year =

    Hong Liu and Sang Michael Xie and Zhiyuan Li and Tengyu Ma , title =. ICML , year =

  32. [46]

    Hu and Kyunghyun Cho , title =

    Nicholas Lourie and Michael Y. Hu and Kyunghyun Cho , title =. Findings of EMNLP , year =

  33. [47]

    EMNLP Findings , year =

    Alex Henry and Prudhvi Raj Dachapally and Shubham Pawar and Yuxuan Chen , title =. EMNLP Findings , year =

  34. [48]

    ICML , year =

    Mostafa Dehghani and Josip Djolonga and Basil Mustafa and Piotr Padlewski and Jonathan Heek and Justin Gilmer and Andreas Steiner and Mathilde Caron and Robert Geirhos and Ibrahim Alabdulmohsin and Rodolphe Jenatton and Lucas Beyer and Michael Tschannen and Anurag Arnab and Xiao Wang and Carlos Riquelme and Matthias Minderer and Joan Puigcerver and Utku E...

  35. [49]

    arXiv:2410.02703 , year =

    Yaniv Leviathan and Matan Kalman and Yossi Matias , title =. arXiv:2410.02703 , year =

  36. [50]

    Differential transformer, 2024

    Tianzhu Ye and Li Dong and Yuqing Xia and Yutao Sun and Yi Zhu and Gao Huang and Furu Wei , title =. arXiv:2410.05258 , year =

  37. [51]

    ACL , year =

    Zhanchao Zhou and Tianyi Wu and Zhiyun Jiang and Fares Obeid and Zhenzhong Lan , title =. ACL , year =

  38. [52]

    Zayd M. K. Zuhri and Erland Hilman Fuadi and Alham Fikri Aji , title =. arXiv:2504.20966 , year =

  39. [53]

    arXiv:2409.04431 , year =

    Jason Ramapuram and Federico Danieli and Eeshan Dhekane and Floris Weers and Dan Busbridge and Pierre Ablin and Tatiana Likhomanenko and Jagrit Digani and Zijin Gu and Amitis Shidani and Russ Webb , title =. arXiv:2409.04431 , year =

  40. [54]

    Nakanishi , title =

    Ken M. Nakanishi , title =. arXiv:2501.19399 , year =

  41. [55]

    arXiv:2402.02622 , year =

    Matteo Pagliardini and Amirkeivan Mohtashami and Francois Fleuret and Martin Jaggi , title =. arXiv:2402.02622 , year =

  42. [56]

    Hyper-Connections.arXiv preprint arXiv:2409.19606, 2024

    Defa Zhu and Hongzhi Huang and Zihao Huang and Yutao Zeng and Yunyao Mao and Banggu Wu and Qiyang Min and Xun Zhou , title =. arXiv:2409.19606 , year =

  43. [57]

    DeepSeek-V3 Technical Report

    DeepSeek-AI , title =. arXiv:2412.19437 , year =

  44. [58]

    Going Deeper with Image Transformers , booktitle =

    Hugo Touvron and Matthieu Cord and Alexandre Sablayrolles and Gabriel Synnaeve and Herv. Going Deeper with Image Transformers , booktitle =

  45. [59]

    2026 , note =

    Attention Residuals , journal =. 2026 , note =

  46. [60]

    GLU Variants Improve Transformer

    Noam Shazeer , title =. arXiv:2002.05202 , year =

  47. [61]

    AAAI , year =

    Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. AAAI , year =

  48. [62]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. arXiv:1803.05457 , year =

  49. [63]

    ACL , year =

    Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi , title =. ACL , year =

  50. [64]

    AAAI , year =

    Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi , title =. AAAI , year =

  51. [65]

    EMNLP , year =

    Maarten Sap and Hannah Rashkin and Derek Chen and Ronan LeBras and Yejin Choi , title =. EMNLP , year =

  52. [66]

    ICLR , year =

    Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. ICLR , year =

  53. [67]

    EMNLP , year =

    Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal , title =. EMNLP , year =

  54. [68]

    NAACL , year =

    Christopher Clark and Kenton Lee and Ming-Wei Chang and Tom Kwiatkowski and Michael Collins and Kristina Toutanova , title =. NAACL , year =

  55. [69]

    EMNLP , year =

    Guokun Lai and Qizhe Xie and Hanxiao Liu and Yiming Yang and Eduard Hovy , title =. EMNLP , year =

  56. [70]

    Denis Paperno and Germ\'. The. ACL , year =

  57. [71]

    ACL , year =

    Stephanie Lin and Jacob Hilton and Owain Evans , title =. ACL , year =

  58. [72]

    2023 , publisher =

    Leo Gao and Jonathan Tow and Baber Abbasi and Stella Biderman and Sid Black and Anthony DiPofi and Charles Foster and Laurence Golding and Jeffrey Hsu and Alain Le Noac'h and Haonan Li and Kyle McDonell and Niklas Muennighoff and Chris Ociepa and Jason Phang and Laria Reynolds and Hailey Schoelkopf and Aviya Skowron and Lintang Sutawika and Eric Tang and ...

  59. [73]

    Journal of the American Statistical Association , volume =

    Bradley Efron , title =. Journal of the American Statistical Association , volume =

  60. [74]

    NeurIPS , year =

    Ming Ding and Zhuoyi Yang and Wenyi Hong and Wendi Zheng and Chang Zhou and Da Yin and Junyang Lin and Xu Zou and Zhou Shao and Hongxia Yang and Jie Tang , title =. NeurIPS , year =

  61. [75]

    NeurIPS , year =

    Zhijian Zhuo and Yutao Zeng and Ya Wang and Sijun Zhang and Jian Yang and Xiaoqing Li and Xun Zhou and Jinwen Ma , title =. NeurIPS , year =

  62. [76]

    So and Wojciech Ma\'

    David R. So and Wojciech Ma\'. Primer: Searching for Efficient Transformers for Language Modeling , booktitle =. 2022 , note =

  63. [77]

    NeurIPS , year =

    Zihan Qiu and Zekun Wang and Bo Zheng and Zeyu Huang and Kaiyue Wen and Songlin Yang and Rui Men and Le Yu and Fei Huang and Suozhi Huang and Dayiheng Liu and Jingren Zhou and Junyang Lin , title =. NeurIPS , year =

  64. [78]

    ICML , year =

    Ruibin Xiong and Yunchang Yang and Di He and Kai Zheng and Shuxin Zheng and Chen Xing and Huishuai Zhang and Yanyan Lan and Liwei Wang and Tie-Yan Liu , title =. ICML , year =

  65. [79]

    Nemotron-climb: Clustering-based iterative data mixture bootstrap- ping for language model pre-training.arXiv preprint arXiv:2504.13161, 2025

    Shizhe Diao and Yu Yang and Yonggan Fu and Xin Dong and Dan Su and Markus Kliegl and Zijia Chen and Peter Belcak and Yoshi Suhara and Hongxu Yin and Mostofa Patwary and Celine Lin and Jan Kautz and Pavlo Molchanov , title =. arXiv:2504.13161 , year =

  66. [80]

    2025 , howpublished =

    Andrej Karpathy , title =. 2025 , howpublished =