From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression
Pith reviewed 2026-06-28 14:45 UTC · model grok-4.3
The pith
SubFit compresses LLMs at the submodule level by replacing non-contiguous Attention and FeedForward components with individual fitted residual bypasses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SubFit moves replacement-based compression from full-layer granularity with contiguous selection to submodule granularity, where Attention and FeedForward submodules are chosen independently across depth and each receives its own fitted residual bypass. Across five base and five instruction-tuned LLMs and sparsity levels from 12.5 percent to 37.5 percent, this yields the best overall perplexity-accuracy trade-off, with larger margins at higher sparsity; at 25 percent sparsity it retains 84.6 percent of dense downstream accuracy at 2.42 times perplexity degradation versus 81.6 percent and 4.34 times for the strongest baselines, plus measurable inference speedup and KV-cache savings.
What carries the argument
Submodule-level Fitted residual replacement (SubFit), which independently selects and approximates non-contiguous Attention and FeedForward submodules with per-submodule lightweight fitted residuals.
If this is right
- SubFit improves the perplexity-accuracy trade-off at every evaluated sparsity level from 12.5 percent to 37.5 percent.
- The gains widen under more aggressive compression.
- The method applies equally to base and instruction-tuned models.
- It delivers inference speedup and KV-cache reduction as direct by-products of the compression.
Where Pith is reading between the lines
- Optimal replacement strategies likely differ between Attention and FeedForward submodules, suggesting type-specific fitting could be explored further.
- The non-contiguous selection pattern may transfer to other compression families such as structured pruning or quantization.
- Hybrid pipelines could combine submodule replacement for some layers with full-layer methods for others.
Load-bearing premise
Redundancy in pretrained transformers is not confined to contiguous regions and does not evenly distribute between Attention and FeedForward outputs.
What would settle it
If a contiguous full-layer replacement method achieved equal or superior perplexity and downstream accuracy to SubFit on the same ten models at the same sparsity levels, the claimed advantage of submodule granularity would be falsified.
Figures
read the original abstract
Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them with fitted modules. Existing replacement-based methods share two design constraints: full-layer granularity and contiguous selection. We argue that this is overly restrictive: in fact, redundancy in pretrained transformers is not confined to contiguous regions, nor does it evenly distribute between Attention and FeedForward outputs, implying that different strategies best approximate different submodule types and that removable components need not cluster within contiguous depth ranges. Based on this intuition, we introduce SubFit (Submodule-level Fitted residual replacement), which compresses LLMs at the submodule level: Attention and FeedForward submodules are selected non-contiguously, and each receives its own lightweight fitted residual bypass. SubFit operates post-training and requires only calibration data. Across ten LLMs (five base, five instruction-tuned), five sparsity levels from 12.5% to 37.5%, and four replacement-based baselines, SubFit achieves the best aggregate perplexity-accuracy trade-off across the evaluated sparsity levels, with larger gains under aggressive compression. At 25% sparsity, it retains 84.6% of dense downstream accuracy and incurs 2.42x perplexity degradation, against 81.6% and 4.34x for the strongest baselines, while delivering measurable inference speedup and KV-cache savings. Code is available at https://github.com/eliacunegatti/SubFit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SubFit, a post-training replacement-based compression technique for LLMs that selects and replaces individual Attention and FeedForward submodules (rather than full layers) in a non-contiguous manner, each with its own lightweight fitted residual bypass. It evaluates the method on ten LLMs across five sparsity levels (12.5%–37.5%) against four layer-granularity baselines, claiming superior aggregate perplexity-accuracy trade-offs, with specific gains at 25% sparsity (84.6% downstream accuracy retention and 2.42× perplexity degradation vs. 81.6% and 4.34× for the strongest baseline), plus inference and KV-cache benefits.
Significance. If the empirical comparisons hold under matched compression ratios, the result would demonstrate that submodule-level granularity can better exploit non-uniform redundancy in transformers, improving the efficiency frontier for replacement-based compression without requiring retraining. The availability of code supports reproducibility.
major comments (1)
- [Experimental setup / sparsity definition (likely §3.2 or §4)] The central empirical claim (abstract and §4) compares SubFit and layer-granularity baselines at identical reported sparsity percentages (e.g., 25%). However, each transformer layer contains two submodules of unequal parameter count; removing 25% of submodules therefore does not produce the same total parameter reduction, FLOPs, or functional impact as removing 25% of layers. The manuscript must define sparsity explicitly (parameter count, FLOPs, or submodule count) and demonstrate that the reported levels are matched on actual compression ratio before the perplexity-accuracy numbers can be directly compared.
minor comments (2)
- [Abstract] The abstract states 'measurable inference speedup and KV-cache savings' without numbers; add quantitative values or a reference to the relevant table/figure.
- [Method description] Notation for the fitted residual bypass (e.g., its parameter count relative to the replaced submodule) should be introduced once and used consistently in equations and text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the experimental setup. We address the major comment below.
read point-by-point responses
-
Referee: [Experimental setup / sparsity definition (likely §3.2 or §4)] The central empirical claim (abstract and §4) compares SubFit and layer-granularity baselines at identical reported sparsity percentages (e.g., 25%). However, each transformer layer contains two submodules of unequal parameter count; removing 25% of submodules therefore does not produce the same total parameter reduction, FLOPs, or functional impact as removing 25% of layers. The manuscript must define sparsity explicitly (parameter count, FLOPs, or submodule count) and demonstrate that the reported levels are matched on actual compression ratio before the perplexity-accuracy numbers can be directly compared.
Authors: We agree that the manuscript requires an explicit definition of sparsity and verification that comparisons occur at matched compression ratios. Sparsity in the current version is defined as the fraction of submodules (SubFit) or layers (baselines) replaced, which does not guarantee identical parameter or FLOP reduction given the unequal sizes of Attention and FeedForward submodules. In the revision we will (1) state this definition clearly in §3, (2) add a table reporting the actual parameter reduction and estimated FLOP reduction for each reported sparsity level under both SubFit and the baselines, and (3) either adjust the sparsity levels or provide side-by-side matched-ratio comparisons so that the perplexity-accuracy results can be interpreted under equivalent compression budgets. revision: yes
Circularity Check
No circularity; empirical evaluation is self-contained
full rationale
The paper proposes SubFit as a post-training compression technique at submodule granularity and reports direct empirical comparisons of perplexity and downstream accuracy against four replacement-based baselines across ten LLMs and five sparsity levels. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the abstract or described method. The central claims rest on external benchmark results rather than any reduction to the paper's own inputs or prior author work.
Axiom & Free-Parameter Ledger
free parameters (1)
- parameters of fitted residual bypasses
axioms (1)
- domain assumption Redundancy in LLMs is distributed non-contiguously and differently across submodule types
invented entities (1)
-
Submodule-level fitted residual bypass
no independent evidence
Reference graph
Works this paper leans on
-
[3]
Elia Cunegatti and Leonardo Lucio Custode and Giovanni Iacca , year = 2025, journal =
2025
-
[4]
Xiaodong Chen and Yuxuan Hu and Jing Zhang and Yanling Wang and Cuiping Li and Hong Chen , year = 2025, booktitle =
2025
-
[5]
Shopkhoev, Dmitriy and Ali, Ammar and Zhussip, Magauiya and Malykh, Valentin and Lefkimmiatis, Stamatios and Komodakis, Nikos and Zagoruyko, Sergey , year = 2025, booktitle =
2025
-
[8]
Edward J Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , year = 2022, booktitle =
2022
-
[10]
Frantar, Elias and Alistarh, Dan , year = 2023, booktitle =
2023
-
[11]
Sun, Mingjie and Liu, Zhuang and Bair, Anna and Kolter, Zico , year = 2024, booktitle =
2024
-
[12]
Ashkboos, Saleh and Croci, Maximilian and Gennari do Nascimento, Marcelo and Hoefler, Torsten and Hensman, James , year = 2024, booktitle =
2024
-
[13]
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , year = 2017, booktitle =
2017
-
[15]
Anderson, Theodore Wilbur , year = 1951, journal =
1951
-
[16]
doi:https://doi.org/10.1016/0047-259X(75)90042-1 , issn =
Alan Julian Izenman , year = 1975, journal =. doi:https://doi.org/10.1016/0047-259X(75)90042-1 , issn =
-
[17]
Eckart, Carl and Young, Gale , year = 1936, journal =
1936
-
[18]
Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song , year = 2024, journal =
2024
-
[19]
Xiao, Guangxuan and Lin, Ji and Seznec, Mickael and Wu, Hao and Demouth, Julien and Han, Song , year = 2023, booktitle =
2023
-
[20]
Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan , year = 2022, journal =
2022
-
[21]
Shao, Wenqi and Chen, Mengzhao and Zhang, Zhaoyang and Xu, Peng and Zhao, Lirui and Li, Zhiqian and Zhang, Kaipeng and Peng, Gao and Qiao, Yu and Luo, Ping , year = 2024, booktitle =
2024
-
[22]
Xinyin Ma and Gongfan Fang and Xinchao Wang , year = 2023, booktitle =
2023
-
[23]
Fabrizio Sandri and Elia Cunegatti and Giovanni Iacca , year = 2025, journal =
2025
-
[24]
Men, Xin and Xu, Mingyu and Zhang, Qingyu and Yuan, Qianhao and Wang, Bingning and Lin, Hongyu and Lu, Yaojie and Han, Xianpei and Chen, Weipeng , year = 2025, booktitle =
2025
-
[25]
Yang, Yifei and Cao, Zouying and Zhao, Hai , year = 2024, booktitle =
2024
-
[26]
Lui , year = 2026, booktitle =
Yikun Jiang and Huanyu Wang and Tianhong Ding and Wenhu Zhang and Yiming Wu and Hanbin Zhao and John C.S. Lui , year = 2026, booktitle =
2026
-
[27]
An, Yongqi and Zhao, Xu and Yu, Tao and Tang, Ming and Wang, Jinqiao , year = 2024, booktitle =
2024
-
[28]
Ash and Dipendra Misra , year = 2024, booktitle =
Pratyusha Sharma and Jordan T. Ash and Dipendra Misra , year = 2024, booktitle =
2024
-
[29]
Xia, Mengzhou and Gao, Tianyu and Zeng, Zhiyuan and Chen, Danqi , year = 2024, booktitle =
2024
-
[30]
Liu, Yijiang and Yang, Huanrui and Chen, Youxin and Zhang, Rongyu and Wang, Miao and Du, Yuan and Du, Li , year = 2025, booktitle =
2025
-
[31]
Liu, Zechun and Oguz, Barlas and Zhao, Changsheng and Chang, Ernie and Stock, Pierre and Mehdad, Yashar and Shi, Yangyang and Krishnamoorthi, Raghuraman and Chandra, Vikas , year = 2024, booktitle =
2024
-
[32]
Chen, Mengzhao and Shao, Wenqi and Xu, Peng and Wang, Jiahao and Gao, Peng and Zhang, Kaipeng and Luo, Ping , year = 2025, booktitle =
2025
-
[33]
Tang, Shengkun and Sieberling, Oliver and Kurtic, Eldar and Shen, Zhiqiang and Alistarh, Dan , year = 2025, journal =
2025
-
[34]
Liu, Kainan and Zhang, Yong and Cheng, Ning and Li, Zhitao and Wang, Shaojun and Xiao, Jing , year = 2025, booktitle =
2025
-
[35]
Zhiqiang Shen and Tianhua Tao and Liqun Ma and Willie Neiswanger and Zhengzhong Liu and Hongyi Wang and Bowen Tan and Joel Hestness and Natalia Vassilieva and Daria Soboleva and Eric Xing , year = 2023, journal =
2023
-
[36]
Subhabrata Mukherjee and Arindam Mitra and Ganesh Jawahar and Sahaj Agarwal and Hamid Palangi and Ahmed Awadallah , year = 2023, journal =
2023
-
[37]
Sclar, Melanie and Choi, Yejin and Tsvetkov, Yulia and Suhr, Alane , year = 2024, booktitle =
2024
-
[38]
Shrestha, Safal and Shrestha, Anubhav and Nepal, Aadim and Kim, Minwu and Ross, Keith , year = 2026, journal =
2026
-
[40]
, year = 2020, journal =
Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. , year = 2020, journal =
2020
-
[41]
Merity, Stephen and Xiong, Caiming and Bradbury, James and Socher, Richard , year = 2016, journal =
2016
-
[42]
Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , year = 2019, booktitle =
2019
-
[43]
Bisk, Yonatan and Zellers, Rowan and Bras, Ronan Le and Gao, Jianfeng and Choi, Yejin , year = 2020, booktitle =
2020
-
[44]
Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin , year = 2021, month = aug, journal =
2021
-
[45]
Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish , year = 2018, booktitle =
2018
-
[46]
Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin , year = 2019, booktitle =
2019
-
[47]
Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , year = 2019, booktitle =
2019
-
[48]
Lin, Stephanie and Hilton, Jacob and Evans, Owain , year = 2022, booktitle =
2022
-
[49]
Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , year = 2021, booktitle =
2021
-
[50]
Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , year = 2018, journal =
2018
-
[51]
Amini, Aida and Gabriel, Saadia and Lin, Shanchuan and Koncel-Kedziorski, Rik and Choi, Yejin and Hajishirzi, Hannaneh , year = 2019, booktitle =
2019
-
[52]
Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John , year = 2021, journal =
2021
-
[53]
and Gardner, Matt , year = 2017, booktitle =
Welbl, Johannes and Liu, Nelson F. and Gardner, Matt , year = 2017, booktitle =
2017
-
[54]
Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...
2024
-
[55]
Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan , year = 2019, booktitle =
2019
-
[56]
Xinrui Chen and Haoli Bai and Tao Yuan and Ruikang Liu and Kang Zhao and Xianzhi Yu and Lu Hou and Tian Guan and Yonghong He and Chun Yuan , year = 2026, booktitle =
2026
-
[57]
Dettmers, Tim and Svirschevski, Ruslan and Egiazarian, Vage and Kuznedelev, Denis and Frantar, Elias and Ashkboos, Saleh and Borzunov, Alexander and Hoefler, Torsten and Alistarh, Dan , year = 2023, journal =
2023
-
[58]
Mahoney and Kurt Keutzer , year = 2024, booktitle =
Sehoon Kim and Coleman Richard Charles Hooper and Amir Gholami and Zhen Dong and Xiuyu Li and Sheng Shen and Michael W. Mahoney and Kurt Keutzer , year = 2024, booktitle =
2024
-
[59]
Andrey Gromov and Kushal Tirumala and Hassan Shapourian and Paolo Glorioso and Dan Roberts , year = 2025, booktitle =
2025
-
[60]
Siddiqui, Shoaib Ahmed and Dong, Xin and Heinrich, Greg and Breuel, Thomas and Kautz, Jan and Krueger, David and Molchanov, Pavlo , year = 2024, booktitle =
2024
-
[62]
Shazeer, Noam , year = 2020, journal =
2020
-
[63]
Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , year = 2024, journal =
2024
-
[64]
Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and others , year = 2025, journal =
2025
-
[65]
Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Lu, Keming and others , year = 2024, journal =
2024
-
[66]
Bi, Xiao and Chen, Deli and Chen, Guanting and Chen, Shanhuang and Dai, Damai and Deng, Chengqi and Ding, Honghui and Dong, Kai and Du, Qiushi and Fu, Zhe and others , year = 2024, journal =
2024
-
[67]
Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. 2021. https://doi.org/10.18653/v1/2021.acl-long.568 Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V...
-
[68]
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
2019
-
[69]
Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. 2024. Fluctuation-based adaptive structured pruning for large language models . In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 10865--10873
2024
-
[70]
Saleh Ashkboos, Maximilian Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/316648eb8b4ffb6010f531b07848c300-Paper-Conference.pdf SliceGPT: Compress Large Language Models by Deleting Rows and Columns . In International Conference on Learning Representations, volume 2024...
2024
-
[71]
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, and 1 others. 2024. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism . arXiv preprint arXiv:2401.02954
Pith/arXiv arXiv 2024
-
[72]
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA: Reasoning about Physical Commonsense in Natural Language . In Proceedings of the AAAI Conference on Artificial Intelligence
2020
-
[73]
Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. 2025 a . EfficientQAT: Efficient Quantization-Aware Training for Large Language Models . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10081--10100
2025
-
[74]
Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, and Hong Chen. 2025 b . https://openreview.net/forum?id=IC5RJvRoMp Streamlining Redundant Layers to Compress Large Language Models . In International Conference on Learning Representations
2025
-
[75]
Xinrui Chen, Haoli Bai, Tao Yuan, Ruikang Liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, and Chun Yuan. 2026 a . https://openreview.net/forum?id=AwsiYZ2ets A Simple Linear Patch Revives Layer-Pruned Large Language Models . In Advances in Neural Information Processing Systems
2026
-
[76]
Xinrui Chen, Hongxing Zhang, Fanyi Zeng, Yongxian Wei, Yizhi Wang, Xitong Ling, Guanghao Li, and Chun Yuan. 2026 b . https://doi.org/10.1609/aaai.v40i24.39120 Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation . Proceedings of the AAAI Conference on Artificial Intelligence, 40(24):20316–20324
-
[77]
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
2019
-
[78]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think You Have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge . arXiv preprint arXiv:1803.05457
Pith/arXiv arXiv 2018
-
[79]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems . arXiv preprint arXiv:2110.14168
Pith/arXiv arXiv 2021
-
[80]
Elia Cunegatti, Leonardo Lucio Custode, and Giovanni Iacca. 2025. https://openreview.net/forum?id=uPyNaNqFK2 Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training . Transactions on Machine Learning Research
2025
-
[81]
Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot . In International Conference on Machine Learning, pages 10323--10337. PMLR
2023
-
[82]
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers . arXiv preprint arXiv:2210.17323
Pith/arXiv arXiv 2022
-
[83]
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and 5 others. 2024. https://zenodo.org/records/12608602 A framework for...
arXiv 2024
-
[84]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The Llama 3 Herd of Models . arXiv preprint arXiv:2407.21783
Pith/arXiv arXiv 2024
-
[85]
Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Dan Roberts. 2025. https://openreview.net/forum?id=ngmEcEer8a The Unreasonable Ineffectiveness of the Deeper Layers . In International Conference on Learning Representations
2025
-
[86]
Yuto Harada, Yusuke Yamauchi, Yusuke Oda, Yohei Oseki, Yusuke Miyao, and Yu Takagi. 2025. https://arxiv.org/abs/2506.14681 Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
arXiv 2025
-
[87]
Shwai He, Guoheng Sun, Zheyu Shen, and Ang Li. 2024. https://arxiv.org/abs/2406.15786 What Matters in Transformers? Not All Attention is Needed . Preprint, arXiv:2406.15786
arXiv 2024
-
[88]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding . In International Conference on Learning Representations
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.