Tapered Language Models
Pith reviewed 2026-06-26 09:09 UTC · model grok-4.3
The pith
Tapered allocation of MLP capacity to earlier layers improves language model perplexity over uniform baselines under fixed budget.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a fixed parameter budget, monotonically tapering MLP widths from wider early layers to narrower late layers via a cosine schedule yields lower perplexity and better downstream performance than uniform-width models, and this benefit is consistent across Transformer, Gated Attention, Hope-attention, and Titans architectures at multiple scales.
What carries the argument
The cosine tapering schedule applied to MLP width, which enforces a smooth monotonic decrease in capacity across depth while keeping total parameters constant.
If this is right
- Early-heavy capacity allocation outperforms uniform allocation on perplexity.
- The tapering benefit transfers to multiple distinct LM architectures without other changes.
- Downstream benchmark scores improve alongside perplexity under the same schedule.
- No increase in parameter count or training FLOPs is required to obtain the gains.
Where Pith is reading between the lines
- The same monotonic tapering principle could be tested on attention projection widths or other per-layer parameter groups.
- The cosine schedule's specific shape might be replaced by other monotonic functions to measure sensitivity.
- At larger scales the optimal taper ratio may shift, offering a new axis to explore alongside width and depth scaling laws.
- Reversing the taper direction should reliably degrade performance if the early-layer emphasis is the true driver.
Load-bearing premise
Layers contribute non-uniformly to the output, so capacity should be allocated more to early layers than to late layers.
What would settle it
A controlled run in which reverse tapering (narrow early layers, wide late layers) matches or beats the forward-tapered model on perplexity would falsify the directional claim.
read the original abstract
Modern language models, including transformer, recurrent, and memory-based variants, share a common chassis: a stack of identical layers in which parameters are allocated uniformly across depth. This is a default inherited from the original transformer and largely unchanged since, yet a growing body of evidence suggests that layers contribute non-uniformly to the final output, with later layers refining the residual stream rather than transforming it. We ask whether parameter capacity should reflect this asymmetry. Our controlled experiment shows that, under a fixed budget, allocating more capacity to earlier layers and less to later layers improves perplexity over a uniform-width baseline, while the reverse allocation hurts. Building on this result, we introduce Tapered Language Models (TLMs), an architectural principle in which a parameter-bearing component is monotonically tapered across depth under a fixed total budget. MLPs are the natural site for this instantiation: they dominate parameter count across all modern LM families and expose width as a single, clean axis of variation. Across three model scales and four architectures (Transformer, Gated Attention, Hope-attention, and Titans), tapering MLP width via a smooth cosine schedule consistently improves perplexity and downstream benchmark performance over uniform baselines, at no additional parameter or compute cost. These findings establish depth-aware capacity allocation as a simple, architecture-agnostic axis of language model design, a free lever hidden in plain sight.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that, under a fixed parameter budget, monotonically tapering MLP widths across depth via a cosine schedule (allocating more capacity to earlier layers) yields consistent improvements in perplexity and downstream benchmarks over uniform-width baselines, while reverse tapering hurts performance. This holds across three scales and four architectures (Transformer, Gated Attention, Hope-attention, Titans), establishing depth-aware capacity allocation as an architecture-agnostic design lever at no extra cost.
Significance. If the empirical results hold under fuller reporting, the work identifies a simple, zero-cost axis for LM design that directly tests non-uniform layer contributions via controlled ablations. The fixed-budget comparisons and cross-architecture consistency are strengths; the approach is reproducible in principle via the described schedule and could be adopted broadly if the gains prove robust.
major comments (2)
- [Experimental Results] The experimental protocol (methods and results sections) does not report run-to-run variance, number of seeds, or statistical tests for the reported perplexity and benchmark gains; without these, the claim of 'consistent' improvements across scales cannot be fully assessed for reliability.
- [Methods] The cosine tapering schedule is described as monotonic and parameter-preserving, but the methods do not specify the exact functional form (e.g., the cosine arguments or discretization to integer widths) or provide pseudocode; this detail is load-bearing for exact reproduction of the reported allocations.
minor comments (2)
- [Figures/Tables] Figure captions and table headers could more explicitly state that all comparisons use identical total parameter counts and matched compute.
- [Introduction] The abstract and introduction reference prior evidence on non-uniform layer contributions; adding 1-2 key citations would strengthen the motivation without altering the empirical focus.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below and will incorporate the requested clarifications in the revised manuscript.
read point-by-point responses
-
Referee: [Experimental Results] The experimental protocol (methods and results sections) does not report run-to-run variance, number of seeds, or statistical tests for the reported perplexity and benchmark gains; without these, the claim of 'consistent' improvements across scales cannot be fully assessed for reliability.
Authors: We acknowledge the value of reporting run-to-run variance for assessing reliability. All experiments used a single fixed seed per configuration for computational efficiency and reproducibility. The observed gains were replicated consistently across three scales and four distinct architectures, providing supporting evidence for the claims. In the revision we will add an explicit statement in the methods section detailing the seed usage and noting that variance was not computed due to resource constraints. revision: yes
-
Referee: [Methods] The cosine tapering schedule is described as monotonic and parameter-preserving, but the methods do not specify the exact functional form (e.g., the cosine arguments or discretization to integer widths) or provide pseudocode; this detail is load-bearing for exact reproduction of the reported allocations.
Authors: We agree that the precise functional form and discretization details are necessary for reproduction. The revised manuscript will include the exact cosine formula (with arguments and normalization), the integer rounding procedure that preserves total parameter count, and pseudocode for the width allocation. revision: yes
Circularity Check
No significant circularity
full rationale
The paper's core contribution consists of controlled empirical ablations that directly compare uniform-width MLPs against cosine-tapered and reverse-tapered allocations under identical total parameter budgets, measuring perplexity and downstream metrics across multiple scales and architectures. No derivation, equation, or fitted parameter is presented whose output reduces by construction to the input; the directional performance differences are the measured result rather than a renamed fit. The background premise on non-uniform layer contributions is referenced as prior evidence but is not load-bearing for the claim, which rests on the new experiments themselves.
Axiom & Free-Parameter Ledger
free parameters (1)
- cosine tapering schedule parameters
axioms (1)
- domain assumption Layers contribute non-uniformly to the final output, with later layers refining rather than transforming the residual stream
Reference graph
Works this paper leans on
-
[1]
Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns.arXiv preprint arXiv:2401.15024,
-
[2]
Crown, frame, reverse: Layer-wise scaling variants for llm pre-training
Andrei Baroian and Kasper Notebomer. Crown, frame, reverse: Layer-wise scaling variants for llm pre-training. arXiv preprint arXiv:2509.06518,
-
[3]
Steering large language model activations in sparse spaces.arXiv preprint arXiv:2503.00177,
Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, and Pascal Vincent. Steering large language model activations in sparse spaces.arXiv preprint arXiv:2503.00177,
-
[4]
Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,
10 Tapered Language Models Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,
-
[5]
AliBehrouz,ZemanLi,PraneethKacham,MajidDaliri,YuanDeng, PeilinZhong,MeisamRazaviyayn,andVahab Mirrokni. Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025a. Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. Nested learning: The illusion of deep learning architectures.arXiv preprint arXiv:25...
-
[6]
Boolq: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), ...
2019
-
[7]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,
-
[8]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,
-
[9]
Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427,
-
[10]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,
Pith/arXiv arXiv 2010
-
[11]
URLhttps://openreview.net/forum?id=SJg7KhVKPH. Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al. Layerskip: Enabling early exit inference and self-speculative decoding.arXiv preprint arXiv:2404.16710,
-
[12]
Reducing transformer depth on demand with structured dropout.arXiv preprint arXiv:1909.11556,
Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with structured dropout.arXiv preprint arXiv:1909.11556,
arXiv 1909
-
[13]
Transformer feed-forward layers are key-value memories
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495,
2021
-
[14]
The unreasonable ineffectiveness of the deeper layers.arXiv preprint arXiv:2403.17887,
11 Tapered Language Models Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A Roberts. The unreasonable ineffectiveness of the deeper layers.arXiv preprint arXiv:2403.17887,
-
[15]
Liquid structural state-space models.arXiv preprint arXiv:2209.12951,
Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. Liquid structural state-space models.arXiv preprint arXiv:2209.12951,
-
[16]
Wataru Ikeda, Kazuki Yano, Ryosuke Takahashi, Jaesung Lee, Keigo Shibata, and Jun Suzuki. Layerwise importance analysis of feed-forward networks in transformer-based language models.arXiv preprint arXiv:2508.17734,
-
[17]
The remarkable robustness of llms: Stages of inference?arXiv preprint arXiv:2406.19384,
Vedang Lad, Jin Hwa Lee, Wes Gurnee, and Max Tegmark. The remarkable robustness of llms: Stages of inference?arXiv preprint arXiv:2406.19384,
-
[18]
Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu
URLhttps://openreview.net/forum?id= rajioNWfRs. Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. Longhorn: State space models are amortized online learners.arXiv preprint arXiv:2407.14207,
-
[19]
Delight: Deep and light-weight transformer.arXiv preprint arXiv:2008.00623,
Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Delight: Deep and light-weight transformer.arXiv preprint arXiv:2008.00623,
arXiv 2008
-
[20]
Shortgpt: Layers in large language models are more redundant than you expect
Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20192–20204,
2025
-
[21]
Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,
-
[22]
Olmo hybrid: From theory to practice and back.arXiv preprint arXiv:2604.03444,
William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, et al. Olmo hybrid: From theory to practice and back.arXiv preprint arXiv:2604.03444,
-
[23]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Association for Computational Linguistics. doi: 10.18653/v1/P16-1144. URL https://aclanthology.org/P16-1144/. William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205,
-
[24]
Rwkv: Reinventing rnns for the transformer era
12 Tapered Language Models Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077,
2023
-
[25]
Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, et al. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence.arXiv preprint arXiv:2404.05892,
-
[26]
Rwkv-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,
Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Haowen Hou, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, et al. Rwkv-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,
-
[27]
Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. arXiv preprint arXiv:2505.06708,
-
[28]
David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam Santoro. Mixture-of-depths: Dynamically allocating compute in transformer-based language models.arXiv preprint arXiv:2404.02258,
-
[29]
Samba: Simple hybrid state space models for efficient unlimited context language modeling
Liliang Ren, Yang Liu, Yadong Lu, Chen Liang, Weizhu Chen, et al. Samba: Simple hybrid state space models for efficient unlimited context language modeling. InInternational Conference on Learning Representations, volume 2025, pages 53551–53575,
2025
-
[30]
Social iqa: Commonsense reasoning about social interactions
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 4463–4473,
2019
-
[31]
Tran, Yi Tay, and Donald Metzler
Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Q. Tran, Yi Tay, and Donald Metzler. Confident adaptive language modeling.arXiv preprint arXiv: 2207.07061,
-
[32]
Pratyusha Sharma, Jordan T Ash, and Dipendra Misra. The truth is in there: Improving reasoning in language models with layer-selective rank reduction.arXiv preprint arXiv:2312.13558,
-
[33]
Glu variants improve transformer.arXiv preprint arXiv:2002.05202,
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,
Pith/arXiv arXiv 2002
-
[34]
URLhttps://openreview.net/forum?id= SoRiaijTGr. Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013,
-
[35]
URL https:// openreview.net/forum?id=Ai8Hw3AXqks. 13 Tapered Language Models Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620,
-
[36]
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,
-
[37]
Matteo Tiezzi, Michele Casoni, Alessandro Betti, Tommaso Guidi, Marco Gori, and Stefano Melacci. On the resurgence of recurrent models for long sequences: Survey and research opportunities in the transformer era.arXiv preprint arXiv:2402.08132,
-
[38]
Ke Alexander Wang, Jiaxin Shi, and Emily B Fox. Test-time regression: a unifying framework for designing sequence models with associative memory.arXiv preprint arXiv:2501.12352,
-
[39]
Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,
Pith/arXiv arXiv 1905
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.