pith. sign in

arxiv: 2606.24747 · v1 · pith:SMQCWGIUnew · submitted 2026-06-23 · 💻 cs.AI · cs.CE

Scaling Laws for Task-Specific LLM Distillation

Pith reviewed 2026-06-25 23:27 UTC · model grok-4.3

classification 💻 cs.AI cs.CE
keywords scaling lawsLLM distillationmodel pruningchain-of-thoughtdomain-specific compressionknowledge retentionquantitative finance
0
0 comments X

The pith

Chain-of-thought supervision recovers general knowledge that pruning erases in task-specific LLM distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives empirical scaling laws for domain-specific LLM compression by quantifying how in-domain and general knowledge performance scale with dataset size, compression ratio, supervision format, and iterative pruning schedule. Using quantitative finance as the application domain, it compares logit-based and LoRA-based distillation under iterative structural pruning and introduces a blended chain-of-thought supervision loss that stabilizes KL-divergence distillation over reasoning traces. In-domain task quality degrades predictably under compression while general-knowledge benchmarks collapse earlier, and the central result is that supervision format drives this tradeoff with chain-of-thought supervision actively recovering general knowledge that pruning erases.

Core claim

The paper claims that empirical scaling laws for domain-specific LLM compression reveal supervision format as the key driver of the tradeoff between in-domain task performance and retention of general knowledge, with chain-of-thought supervision actively recovering general knowledge that pruning erases while a blended chain-of-thought loss stabilizes the distillation process.

What carries the argument

The blended chain-of-thought supervision loss that stabilizes KL-divergence distillation over reasoning traces during iterative structural pruning.

Load-bearing premise

The observed scaling behaviors and the stabilizing effect of the blended chain-of-thought loss on KL-divergence distillation are driven primarily by supervision format rather than unmeasured factors such as specific model architectures, dataset composition, or benchmark choices.

What would settle it

An experiment showing that the same scaling laws and recovery effect hold when supervision format is varied but model architecture, dataset composition, and benchmarks are changed would support the claim; finding that architecture or dataset shifts alter the tradeoff independently of supervision format would falsify it.

Figures

Figures reproduced from arXiv: 2606.24747 by Dhruv Desai, Ioana Boier, Lavinia Ghita.

Figure 1
Figure 1. Figure 1: Train set label distribution over the 35 event classes in the financial news dataset (training [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Spectrum of teacher supervision and thinking depth (four configurations used in this paper). [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Uniform iterative pruning and knowledge distillation: at each stage the model is pruned by [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: In-domain performance as a function of training dataset size. (a) Macro F1; (b) gold-label [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: General-knowledge benchmarks: MMLU and MMLU-Pro scores as a function of training [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: In-domain performance under uniform iterative pruning and distillation. (a) Macro F1; (b) [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: General-knowledge benchmarks for uniform iterative compression: MMLU and MMLU [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: In-domain performance under decayed and uniform iterative schedules. (a) Macro F1; (b) [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: General-knowledge benchmarks for decayed and uniform iterative schedules: MMLU and [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: UMAP visualization of the dataset embeddings colored by event category. The clustering [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: In-domain classification: weighted F1 and accuracy vs. dataset size for data scaling laws [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: In-domain classification: weighted F1 and accuracy vs. model size for uniform iterative [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: In-domain classification: weighted F1 and accuracy vs. model size for decayed iterative [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
read the original abstract

Large Language Models (LLMs) achieve strong performance across a growing range of domains, yet their scale poses deployment challenges in applications where latency and cost constraints are critical. This paper derives empirical scaling laws for domain-specific LLM compression, quantifying how in-domain and general knowledge performance scale with dataset size, compression ratio, supervision format, and iterative pruning schedule. Using quantitative finance as our application domain, we compare logit-based and LoRA-based distillation under iterative structural pruning, introducing a blended chain-of-thought supervision loss that stabilizes KL-divergence distillation over reasoning traces. In-domain task quality degrades predictably under compression while general-knowledge benchmarks collapse well before the same point; supervision format is the key driver of this tradeoff, with chain-of-thought supervision actively recovering general knowledge that pruning erases. We release the headline dataset FinHeadlineMix, scaling law results, and practical recommendations to provide a reusable framework for domain-specific compression decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper derives empirical scaling laws for domain-specific LLM compression using quantitative finance as the application domain. It compares logit-based and LoRA-based distillation under iterative structural pruning, introduces a blended chain-of-thought supervision loss to stabilize KL-divergence distillation over reasoning traces, and claims that supervision format is the key driver of the in-domain vs. general-knowledge tradeoff, with CoT supervision recovering knowledge erased by pruning. The work releases the FinHeadlineMix dataset along with scaling-law results and practical recommendations.

Significance. If the scaling laws prove robust and the attribution to supervision format can be isolated from confounds, the work would supply a reusable empirical framework for compression decisions in specialized domains. The dataset release and emphasis on practical recommendations are strengths that support reproducibility and downstream use.

major comments (2)
  1. [Abstract] Abstract: the claim that 'supervision format is the key driver of this tradeoff' is load-bearing for the central contribution, yet the described experiments do not isolate it. No ablations are mentioned that hold architecture, pruning schedule, loss components, and dataset fixed while varying only supervision format (or that swap domain while holding format fixed). Without such controls, the observed in-domain stability and general-knowledge recovery could be driven by interactions with the quantitative-finance domain or FinHeadlineMix composition rather than format per se.
  2. [Abstract] Abstract (results paragraph): the stabilizing effect of the blended chain-of-thought loss on KL-divergence distillation is asserted without reference to concrete metrics (e.g., KL values, scaling exponents, or recovery deltas on general-knowledge benchmarks) that would allow verification that the effect is format-driven rather than an artifact of the chosen benchmarks or model sizes.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'iterative pruning schedule' is introduced without any indication of the schedule parameters or how they enter the scaling laws.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments identifying gaps in experimental controls and metric reporting. We agree these points strengthen the central claims and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'supervision format is the key driver of this tradeoff' is load-bearing for the central contribution, yet the described experiments do not isolate it. No ablations are mentioned that hold architecture, pruning schedule, loss components, and dataset fixed while varying only supervision format (or that swap domain while holding format fixed). Without such controls, the observed in-domain stability and general-knowledge recovery could be driven by interactions with the quantitative-finance domain or FinHeadlineMix composition rather than format per se.

    Authors: We agree that the current experiments do not fully isolate supervision format from domain-specific factors. The revised manuscript will add ablation studies holding architecture, pruning schedule, loss components, and dataset fixed while varying only supervision format. We will also discuss limitations of domain-specific results and why full domain-swapping experiments are outside the current scope. revision: yes

  2. Referee: [Abstract] Abstract (results paragraph): the stabilizing effect of the blended chain-of-thought loss on KL-divergence distillation is asserted without reference to concrete metrics (e.g., KL values, scaling exponents, or recovery deltas on general-knowledge benchmarks) that would allow verification that the effect is format-driven rather than an artifact of the chosen benchmarks or model sizes.

    Authors: We acknowledge that concrete metrics were not provided in the abstract or results summary. The revision will report explicit values including KL divergence before/after the blended loss, scaling exponents across formats, and recovery deltas on general-knowledge benchmarks, presented in tables and figures. revision: yes

Circularity Check

0 steps flagged

Empirical scaling laws from experiments exhibit no circularity

full rationale

The paper frames its contribution as deriving empirical scaling laws from controlled experiments on LLM distillation, pruning, and supervision formats in a quantitative-finance domain. Performance trends (in-domain vs. general-knowledge degradation, stabilization via blended CoT loss) are reported as observed outcomes across dataset sizes, compression ratios, and loss variants. No mathematical derivation chain, fitted-parameter-as-prediction step, self-definitional equations, or load-bearing self-citations appear in the provided abstract or described methodology. The central attribution to supervision format rests on comparative experimental results rather than any reduction to prior inputs by construction. This is a standard empirical study whose claims can be evaluated against the released dataset and benchmarks without internal circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim rests on empirical observations from distillation experiments; free parameters are the fitted coefficients of the scaling laws relating performance to dataset size and compression ratio.

free parameters (1)
  • scaling law coefficients for performance versus compression ratio and dataset size
    Empirical scaling laws are derived by fitting functional forms to observed performance metrics.
axioms (1)
  • domain assumption In-domain task quality and general-knowledge benchmark scores are valid proxies for the underlying capabilities affected by compression and supervision.
    Used throughout the abstract to quantify degradation and recovery effects.

pith-pipeline@v0.9.1-grok · 5685 in / 1269 out tokens · 36815 ms · 2026-06-25T23:27:53.714231+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 2 canonical work pages

  1. [1]

    Nvidia, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, and Tomasz Grzegorz...

  2. [2]

    Optimal brain damage

    Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. In D. Touret- zky, editor,Advances in Neural Information Processing Systems, volume 2. Morgan- Kaufmann, 1989. URL https://proceedings.neurips.cc/paper_files/paper/1989/ file/6c9882bbac1c7093bd25041881277658-Paper.pdf

  3. [3]

    Compact language models via pruning and knowledge distillation

    Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distillation. 07 2024. URL https: //arxiv.org/pdf/2407.14679.pdf

  4. [4]

    Distilling the knowledge in a neural network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. 03 2015. URLhttps://arxiv.org/pdf/1503.02531.pdf

  5. [5]

    Llama-nemotron: Efficient reasoning models

    Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Gerald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, and ...

  6. [6]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, and Chengda Lu. Deepseek-r1: Incentivizing reasoning capability in llms via...

  7. [7]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. 06 2021. URLhttps://arxiv.org/pdf/2106.09685.pdf

  8. [8]

    Lora without regret.Thinking Machines Lab: Connectionism, 2025

    John Schulman and Thinking Machines Lab. Lora without regret.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20250929. https://thinkingmachines.ai/blog/lora/

  9. [9]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. 01 2022. URLhttps://arxiv.org/pdf/2201.11903.pdf

  10. [10]

    Bucilua, R

    C. Bucilua, R. Caruana, and A. Niculescu-Mizil. Model compression.Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06: 535–541, 2006. 21

  11. [11]

    Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. 05 2023. URL https://arxiv.org/pdf/2305.02301.pdf

  12. [12]

    Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. 06 2016. URL https://arxiv.org/pdf/1606.07947.pdf

  13. [13]

    On the efficacy of knowledge distillation.ICCV 2019, 10 2019

    Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation.ICCV 2019, 10 2019. URLhttps://arxiv.org/pdf/1910.01348.pdf

  14. [14]

    Knowledge distillation: Bad models can be good role models

    Gal Kaplun, Eran Malach, Preetum Nakkiran, and Shai Shalev-Shwartz. Knowledge distillation: Bad models can be good role models. 03 2022. URLhttps://arxiv.org/pdf/2203.14649. pdf

  15. [15]

    Distillation scaling laws

    Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russ Webb. Distillation scaling laws. 02 2025. URLhttps://arxiv.org/pdf/2502.08606.pdf

  16. [16]

    Reddi, Seungyeon Kim, and Sanjiv Kumar

    Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Seungyeon Kim, and Sanjiv Kumar. Why distillation helps: a statistical perspective. 05 2020. URL https://arxiv.org/ pdf/2005.10419.pdf

  17. [17]

    Dahl, and Geof- frey E

    Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E. Dahl, and Geof- frey E. Hinton. Large scale distributed neural network training through online distillation. 04

  18. [18]

    URLhttps://arxiv.org/pdf/1804.03235.pdf

  19. [19]

    Knowledge distillation: A good teacher is patient and consistent

    Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. Knowledge distillation: A good teacher is patient and consistent. 06 2021. URL https://arxiv.org/pdf/2106.05237.pdf

  20. [20]

    On-policy distillation of language models: Learning from self- generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2306.13649

  21. [21]

    Hewett, Mojan Javaheripi, Piero Kauffmann, James R

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu,...

  22. [22]

    Llama 3.2: Revolutionizing edge ai and vision with open, cus- tomizable models, September 2024

    MetaAI. Llama 3.2: Revolutionizing edge ai and vision with open, cus- tomizable models, September 2024. URL https://ai.meta.com/blog/ llama-3-2-connect-2024-vision-edge-mobile-devices/

  23. [23]

    Is deeper better only when shallow is good? InAdvances in Neural Information Processing Systems (NeurIPS), volume 32, 2019

    Eran Malach and Shai Shalev-Shwartz. Is deeper better only when shallow is good? InAdvances in Neural Information Processing Systems (NeurIPS), volume 32, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ 606555cf42a6719782a952aa33cfa2cb-Abstract.html

  24. [24]

    difficult

    Lu Yin, Ajay Jaiswal, Shiwei Liu, Souvik Kundu, and Zhangyang Wang. Junk DNA hypothesis: Pruning small pre-trained weights irreversibly and monotonically impairs “difficult” downstream tasks in LLMs. 2023. URLhttps://arxiv.org/abs/2310.02277

  25. [25]

    Large language model adaptation for financial sentiment analysis

    Pau Rodriguez Inserte, Mariam Nakhlé, Raheel Qader, Gaetan Caillaut, and Jingshu Liu. Large language model adaptation for financial sentiment analysis. In Chung-Chi Chen, Hen- Hsen Huang, Hiroya Takamura, Hsin-Hsi Chen, Hiroki Sakaji, and Kiyoshi Izumi, editors, Proceedings of the Sixth Workshop on Financial Technology and Natural Language Processing, pag...

  26. [26]

    Small models struggle to learn from strong reasoners,

    Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ra- masubramanian, and Radha Poovendran. Small models struggle to learn from strong reasoners,

  27. [27]

    URLhttps://arxiv.org/abs/2502.12143

  28. [28]

    Bert vs gpt for financial engineering

    Edward Sharkey and Philip Treleaven. Bert vs gpt for financial engineering. 05 2024. URL https://arxiv.org/pdf/2405.12990.pdf

  29. [29]

    Nemotron-3-Nano-30B-A3B: Model card, 2025

    NVIDIA. Nemotron-3-Nano-30B-A3B: Model card, 2025. URL https://build.nvidia. com/nvidia/nemotron-3-nano-30b-a3b/modelcard. NVIDIA AI Foundation Model

  30. [30]

    NeMo data designer

    NVIDIA. NeMo data designer. https://nvidia-nemo.github.io/DataDesigner/ latest/, 2025. Accessed: 2025

  31. [31]

    NeMo-Curator: a toolkit for data curation

    Joseph Jennings, Mostofa Patwary, Sandeep Subramanian, Shrimai Prabhumoye, Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ryan Wolf, Sarah Yurick, and Varun Singh. NeMo-Curator: a toolkit for data curation. URLhttps://github.com/NVIDIA/NeMo-Curator

  32. [32]

    Puzzle: Distillation-based nas for inference-optimized llms

    Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Karpas, Roi Koren, Itay Levy, Pavlo Molchanov, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin, Itamar Schen, Ido Shahaf, Oren Tropp, Omer Ullman Argov, and Ran Zilberstein. Puzzl...

  33. [33]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388. Model:https://huggingface.co/Qwen/Qwen3-32B

  34. [34]

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023. URL https://arxiv.org/abs/2305. 13245

  35. [35]

    Llama 3 model card

    AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md

  36. [36]

    Introducing GPT-oss

    OpenAI. Introducing GPT-oss. https://openai.com/index/introducing-gpt-oss/,

  37. [37]

    Defeating nondeterminism in LLM inference.Think- ing Machines Lab: Connectionism, 2025

    Horace He and Thinking Machines Lab. Defeating nondeterminism in LLM inference.Think- ing Machines Lab: Connectionism, 2025. URL https://thinkingmachines.ai/blog/ defeating-nondeterminism-in-llm-inference/. Blog post

  38. [38]

    Shortgpt: Layers in large language models are more redundant than you expect

    Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. 03 2024. URLhttps://arxiv.org/pdf/2403.03853.pdf

  39. [39]

    The Annals of Mathematical Statistics , author =

    S. Kullback and R. A. Leibler. On information and sufficiency.The Annals of Mathematical Statistics, 22(1):79–86, 1951. doi: 10.1214/aoms/1177729694. URL http://dx.doi.org/ 10.1214/aoms/1177729694

  40. [40]

    Minillm: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. 06 2023. URLhttps://arxiv.org/pdf/2306.08543.pdf

  41. [41]

    Patient knowledge distillation for bert model compression

    Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for bert model compression. 08 2019. URLhttps://arxiv.org/pdf/1908.09355.pdf

  42. [42]

    NeMo: a toolkit for Conversational AI and Large Language Models

    Eric Harper, Somshubra Majumdar, Oleksii Kuchaiev, Li Jason, Yang Zhang, Evelina Bakh- turina, Vahid Noroozi, Sandeep Subramanian, Koluguri Nithin, Huang Jocelyn, Fei Jia, Jagadeesh Balam, Xuesong Yang, Micha Livne, Yi Dong, Sean Naren, and Boris Gins- burg. NeMo: a toolkit for Conversational AI and Large Language Models. URL https: //github.com/NVIDIA/NeMo

  43. [43]

    Tensorrt model optimizer (modelopt)

    NVIDIA. Tensorrt model optimizer (modelopt). https://github.com/NVIDIA/ TensorRT-Model-Optimizer, 2025. Version 0.33.0. 23

  44. [44]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019. URL https://arxiv.org/abs/ 1711.05101

  45. [45]

    SGDR: Stochastic gradient descent with warm restarts

    Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR), 2017. URL https://arxiv. org/abs/1608.03983

  46. [46]

    Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021. URL https://arxiv. org/abs/2009.03300

  47. [47]

    MMLU- Pro: A more robust and challenging multi-task language understanding benchmark, 2024

    Xiaotian Zhang, Yintao Du, Gen Luo, Yunshan Zhong, Shuo Zhang, and Rongrong Ji. MMLU- Pro: A more robust and challenging multi-task language understanding benchmark, 2024. URL https://arxiv.org/abs/2402.13383

  48. [48]

    Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13): 35...

  49. [49]

    Better fine-tuning by reducing representational collapse

    Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta, Naman Goyal, Luke Zettlemoyer, and Sonal Gupta. Better fine-tuning by reducing representational collapse. InInternational Conference on Learning Representations (ICLR), 2021. URL https://arxiv.org/abs/ 2008.03156

  50. [50]

    GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

    Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. URLhttps://arxiv.org/abs/2002.05202

  51. [51]

    Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

    Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019. URLhttps://arxiv.org/abs/1911.02150

  52. [52]

    Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines. InInternational Conference on Learning Representations (ICLR),

  53. [53]

    URLhttps://arxiv.org/abs/2310.03714

  54. [54]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), 2023. URLhttps://arxiv.org/abs/2309.06180. 24