Scaling Laws for Task-Specific LLM Distillation

Dhruv Desai; Ioana Boier; Lavinia Ghita

arxiv: 2606.24747 · v1 · pith:SMQCWGIUnew · submitted 2026-06-23 · 💻 cs.AI · cs.CE

Scaling Laws for Task-Specific LLM Distillation

Lavinia Ghita , Dhruv Desai , Ioana Boier This is my paper

Pith reviewed 2026-06-25 23:27 UTC · model grok-4.3

classification 💻 cs.AI cs.CE

keywords scaling lawsLLM distillationmodel pruningchain-of-thoughtdomain-specific compressionknowledge retentionquantitative finance

0 comments

The pith

Chain-of-thought supervision recovers general knowledge that pruning erases in task-specific LLM distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives empirical scaling laws for domain-specific LLM compression by quantifying how in-domain and general knowledge performance scale with dataset size, compression ratio, supervision format, and iterative pruning schedule. Using quantitative finance as the application domain, it compares logit-based and LoRA-based distillation under iterative structural pruning and introduces a blended chain-of-thought supervision loss that stabilizes KL-divergence distillation over reasoning traces. In-domain task quality degrades predictably under compression while general-knowledge benchmarks collapse earlier, and the central result is that supervision format drives this tradeoff with chain-of-thought supervision actively recovering general knowledge that pruning erases.

Core claim

The paper claims that empirical scaling laws for domain-specific LLM compression reveal supervision format as the key driver of the tradeoff between in-domain task performance and retention of general knowledge, with chain-of-thought supervision actively recovering general knowledge that pruning erases while a blended chain-of-thought loss stabilizes the distillation process.

What carries the argument

The blended chain-of-thought supervision loss that stabilizes KL-divergence distillation over reasoning traces during iterative structural pruning.

Load-bearing premise

The observed scaling behaviors and the stabilizing effect of the blended chain-of-thought loss on KL-divergence distillation are driven primarily by supervision format rather than unmeasured factors such as specific model architectures, dataset composition, or benchmark choices.

What would settle it

An experiment showing that the same scaling laws and recovery effect hold when supervision format is varied but model architecture, dataset composition, and benchmarks are changed would support the claim; finding that architecture or dataset shifts alter the tradeoff independently of supervision format would falsify it.

Figures

Figures reproduced from arXiv: 2606.24747 by Dhruv Desai, Ioana Boier, Lavinia Ghita.

**Figure 2.** Figure 2: Spectrum of teacher supervision and thinking depth (four configurations used in this paper). [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Uniform iterative pruning and knowledge distillation: at each stage the model is pruned by [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: In-domain performance as a function of training dataset size. (a) Macro F1; (b) gold-label [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: General-knowledge benchmarks: MMLU and MMLU-Pro scores as a function of training [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: In-domain performance under uniform iterative pruning and distillation. (a) Macro F1; (b) [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: General-knowledge benchmarks for uniform iterative compression: MMLU and MMLU [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: In-domain performance under decayed and uniform iterative schedules. (a) Macro F1; (b) [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: General-knowledge benchmarks for decayed and uniform iterative schedules: MMLU and [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: UMAP visualization of the dataset embeddings colored by event category. The clustering [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: In-domain classification: weighted F1 and accuracy vs. dataset size for data scaling laws [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: In-domain classification: weighted F1 and accuracy vs. model size for uniform iterative [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: In-domain classification: weighted F1 and accuracy vs. model size for decayed iterative [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

read the original abstract

Large Language Models (LLMs) achieve strong performance across a growing range of domains, yet their scale poses deployment challenges in applications where latency and cost constraints are critical. This paper derives empirical scaling laws for domain-specific LLM compression, quantifying how in-domain and general knowledge performance scale with dataset size, compression ratio, supervision format, and iterative pruning schedule. Using quantitative finance as our application domain, we compare logit-based and LoRA-based distillation under iterative structural pruning, introducing a blended chain-of-thought supervision loss that stabilizes KL-divergence distillation over reasoning traces. In-domain task quality degrades predictably under compression while general-knowledge benchmarks collapse well before the same point; supervision format is the key driver of this tradeoff, with chain-of-thought supervision actively recovering general knowledge that pruning erases. We release the headline dataset FinHeadlineMix, scaling law results, and practical recommendations to provide a reusable framework for domain-specific compression decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract claims scaling laws for finance-specific LLM distillation with a blended CoT loss recovering general knowledge, but provides no ablations or numbers to back the attribution to supervision format.

read the letter

The key takeaway is that the authors derive empirical scaling laws for compressing LLMs in the quantitative finance domain and introduce a blended chain-of-thought supervision loss to mitigate loss of general knowledge during pruning. They compare logit-based and LoRA-based distillation under iterative structural pruning and release the FinHeadlineMix dataset along with their scaling law results. This is new in applying these ideas specifically to finance and in proposing the blended loss for stabilizing distillation over reasoning traces. The work does a decent job of framing the practical problem of latency and cost in domain-specific deployment and outlining how in-domain performance and general benchmarks behave differently under compression. The main soft spot is around the central claim that supervision format is the key driver of the tradeoff. The abstract states that chain-of-thought supervision recovers general knowledge that pruning erases, but there is no indication of ablations that hold architecture, pruning schedule, and other factors fixed while varying only the supervision format. The stress-test concern is on point here; the effect could be tied to the finance dataset or benchmark choices rather than the format itself. Without those controls, the attribution isn't secured. Additionally, the abstract provides no quantitative results, error bars, or details on model sizes and data rules, so it's impossible to assess how solid the scaling laws actually are. This paper would be of interest to researchers and engineers working on efficient LLM deployment in specialized domains like finance. It deserves a serious referee to check the full experiments and see if the ablations are there or if the claims can be strengthened.

Referee Report

2 major / 1 minor

Summary. The paper derives empirical scaling laws for domain-specific LLM compression using quantitative finance as the application domain. It compares logit-based and LoRA-based distillation under iterative structural pruning, introduces a blended chain-of-thought supervision loss to stabilize KL-divergence distillation over reasoning traces, and claims that supervision format is the key driver of the in-domain vs. general-knowledge tradeoff, with CoT supervision recovering knowledge erased by pruning. The work releases the FinHeadlineMix dataset along with scaling-law results and practical recommendations.

Significance. If the scaling laws prove robust and the attribution to supervision format can be isolated from confounds, the work would supply a reusable empirical framework for compression decisions in specialized domains. The dataset release and emphasis on practical recommendations are strengths that support reproducibility and downstream use.

major comments (2)

[Abstract] Abstract: the claim that 'supervision format is the key driver of this tradeoff' is load-bearing for the central contribution, yet the described experiments do not isolate it. No ablations are mentioned that hold architecture, pruning schedule, loss components, and dataset fixed while varying only supervision format (or that swap domain while holding format fixed). Without such controls, the observed in-domain stability and general-knowledge recovery could be driven by interactions with the quantitative-finance domain or FinHeadlineMix composition rather than format per se.
[Abstract] Abstract (results paragraph): the stabilizing effect of the blended chain-of-thought loss on KL-divergence distillation is asserted without reference to concrete metrics (e.g., KL values, scaling exponents, or recovery deltas on general-knowledge benchmarks) that would allow verification that the effect is format-driven rather than an artifact of the chosen benchmarks or model sizes.

minor comments (1)

[Abstract] Abstract: the phrase 'iterative pruning schedule' is introduced without any indication of the schedule parameters or how they enter the scaling laws.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments identifying gaps in experimental controls and metric reporting. We agree these points strengthen the central claims and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'supervision format is the key driver of this tradeoff' is load-bearing for the central contribution, yet the described experiments do not isolate it. No ablations are mentioned that hold architecture, pruning schedule, loss components, and dataset fixed while varying only supervision format (or that swap domain while holding format fixed). Without such controls, the observed in-domain stability and general-knowledge recovery could be driven by interactions with the quantitative-finance domain or FinHeadlineMix composition rather than format per se.

Authors: We agree that the current experiments do not fully isolate supervision format from domain-specific factors. The revised manuscript will add ablation studies holding architecture, pruning schedule, loss components, and dataset fixed while varying only supervision format. We will also discuss limitations of domain-specific results and why full domain-swapping experiments are outside the current scope. revision: yes
Referee: [Abstract] Abstract (results paragraph): the stabilizing effect of the blended chain-of-thought loss on KL-divergence distillation is asserted without reference to concrete metrics (e.g., KL values, scaling exponents, or recovery deltas on general-knowledge benchmarks) that would allow verification that the effect is format-driven rather than an artifact of the chosen benchmarks or model sizes.

Authors: We acknowledge that concrete metrics were not provided in the abstract or results summary. The revision will report explicit values including KL divergence before/after the blended loss, scaling exponents across formats, and recovery deltas on general-knowledge benchmarks, presented in tables and figures. revision: yes

Circularity Check

0 steps flagged

Empirical scaling laws from experiments exhibit no circularity

full rationale

The paper frames its contribution as deriving empirical scaling laws from controlled experiments on LLM distillation, pruning, and supervision formats in a quantitative-finance domain. Performance trends (in-domain vs. general-knowledge degradation, stabilization via blended CoT loss) are reported as observed outcomes across dataset sizes, compression ratios, and loss variants. No mathematical derivation chain, fitted-parameter-as-prediction step, self-definitional equations, or load-bearing self-citations appear in the provided abstract or described methodology. The central attribution to supervision format rests on comparative experimental results rather than any reduction to prior inputs by construction. This is a standard empirical study whose claims can be evaluated against the released dataset and benchmarks without internal circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim rests on empirical observations from distillation experiments; free parameters are the fitted coefficients of the scaling laws relating performance to dataset size and compression ratio.

free parameters (1)

scaling law coefficients for performance versus compression ratio and dataset size
Empirical scaling laws are derived by fitting functional forms to observed performance metrics.

axioms (1)

domain assumption In-domain task quality and general-knowledge benchmark scores are valid proxies for the underlying capabilities affected by compression and supervision.
Used throughout the abstract to quantify degradation and recovery effects.

pith-pipeline@v0.9.1-grok · 5685 in / 1269 out tokens · 36815 ms · 2026-06-25T23:27:53.714231+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 2 canonical work pages

[1]

Nvidia, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, and Tomasz Grzegorz...

arXiv 2024
[2]

Optimal brain damage

Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. In D. Touret- zky, editor,Advances in Neural Information Processing Systems, volume 2. Morgan- Kaufmann, 1989. URL https://proceedings.neurips.cc/paper_files/paper/1989/ file/6c9882bbac1c7093bd25041881277658-Paper.pdf

1989
[3]

Compact language models via pruning and knowledge distillation

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distillation. 07 2024. URL https: //arxiv.org/pdf/2407.14679.pdf

arXiv 2024
[4]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. 03 2015. URLhttps://arxiv.org/pdf/1503.02531.pdf

Pith/arXiv arXiv 2015
[5]

Llama-nemotron: Efficient reasoning models

Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Gerald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, and ...

arXiv 2025
[6]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, and Chengda Lu. Deepseek-r1: Incentivizing reasoning capability in llms via...

Pith/arXiv arXiv 2025
[7]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. 06 2021. URLhttps://arxiv.org/pdf/2106.09685.pdf

Pith/arXiv arXiv 2021
[8]

Lora without regret.Thinking Machines Lab: Connectionism, 2025

John Schulman and Thinking Machines Lab. Lora without regret.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20250929. https://thinkingmachines.ai/blog/lora/

work page doi:10.64434/tml.20250929 2025
[9]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. 01 2022. URLhttps://arxiv.org/pdf/2201.11903.pdf

Pith/arXiv arXiv 2022
[10]

Bucilua, R

C. Bucilua, R. Caruana, and A. Niculescu-Mizil. Model compression.Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06: 535–541, 2006. 21

2006
[11]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. 05 2023. URL https://arxiv.org/pdf/2305.02301.pdf

Pith/arXiv arXiv 2023
[12]

Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. 06 2016. URL https://arxiv.org/pdf/1606.07947.pdf

Pith/arXiv arXiv 2016
[13]

On the efficacy of knowledge distillation.ICCV 2019, 10 2019

Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation.ICCV 2019, 10 2019. URLhttps://arxiv.org/pdf/1910.01348.pdf

arXiv 2019
[14]

Knowledge distillation: Bad models can be good role models

Gal Kaplun, Eran Malach, Preetum Nakkiran, and Shai Shalev-Shwartz. Knowledge distillation: Bad models can be good role models. 03 2022. URLhttps://arxiv.org/pdf/2203.14649. pdf

arXiv 2022
[15]

Distillation scaling laws

Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russ Webb. Distillation scaling laws. 02 2025. URLhttps://arxiv.org/pdf/2502.08606.pdf

arXiv 2025
[16]

Reddi, Seungyeon Kim, and Sanjiv Kumar

Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Seungyeon Kim, and Sanjiv Kumar. Why distillation helps: a statistical perspective. 05 2020. URL https://arxiv.org/ pdf/2005.10419.pdf

arXiv 2020
[17]

Dahl, and Geof- frey E

Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E. Dahl, and Geof- frey E. Hinton. Large scale distributed neural network training through online distillation. 04
[18]

URLhttps://arxiv.org/pdf/1804.03235.pdf

arXiv
[19]

Knowledge distillation: A good teacher is patient and consistent

Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. Knowledge distillation: A good teacher is patient and consistent. 06 2021. URL https://arxiv.org/pdf/2106.05237.pdf

arXiv 2021
[20]

On-policy distillation of language models: Learning from self- generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2306.13649

arXiv 2024
[21]

Hewett, Mojan Javaheripi, Piero Kauffmann, James R

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu,...

2024
[22]

Llama 3.2: Revolutionizing edge ai and vision with open, cus- tomizable models, September 2024

MetaAI. Llama 3.2: Revolutionizing edge ai and vision with open, cus- tomizable models, September 2024. URL https://ai.meta.com/blog/ llama-3-2-connect-2024-vision-edge-mobile-devices/

2024
[23]

Is deeper better only when shallow is good? InAdvances in Neural Information Processing Systems (NeurIPS), volume 32, 2019

Eran Malach and Shai Shalev-Shwartz. Is deeper better only when shallow is good? InAdvances in Neural Information Processing Systems (NeurIPS), volume 32, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ 606555cf42a6719782a952aa33cfa2cb-Abstract.html

2019
[24]

difficult

Lu Yin, Ajay Jaiswal, Shiwei Liu, Souvik Kundu, and Zhangyang Wang. Junk DNA hypothesis: Pruning small pre-trained weights irreversibly and monotonically impairs “difficult” downstream tasks in LLMs. 2023. URLhttps://arxiv.org/abs/2310.02277

Pith/arXiv arXiv 2023
[25]

Large language model adaptation for financial sentiment analysis

Pau Rodriguez Inserte, Mariam Nakhlé, Raheel Qader, Gaetan Caillaut, and Jingshu Liu. Large language model adaptation for financial sentiment analysis. In Chung-Chi Chen, Hen- Hsen Huang, Hiroya Takamura, Hsin-Hsi Chen, Hiroki Sakaji, and Kiyoshi Izumi, editors, Proceedings of the Sixth Workshop on Financial Technology and Natural Language Processing, pag...

2023
[26]

Small models struggle to learn from strong reasoners,

Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ra- masubramanian, and Radha Poovendran. Small models struggle to learn from strong reasoners,
[27]

URLhttps://arxiv.org/abs/2502.12143

arXiv
[28]

Bert vs gpt for financial engineering

Edward Sharkey and Philip Treleaven. Bert vs gpt for financial engineering. 05 2024. URL https://arxiv.org/pdf/2405.12990.pdf

arXiv 2024
[29]

Nemotron-3-Nano-30B-A3B: Model card, 2025

NVIDIA. Nemotron-3-Nano-30B-A3B: Model card, 2025. URL https://build.nvidia. com/nvidia/nemotron-3-nano-30b-a3b/modelcard. NVIDIA AI Foundation Model

2025
[30]

NeMo data designer

NVIDIA. NeMo data designer. https://nvidia-nemo.github.io/DataDesigner/ latest/, 2025. Accessed: 2025

2025
[31]

NeMo-Curator: a toolkit for data curation

Joseph Jennings, Mostofa Patwary, Sandeep Subramanian, Shrimai Prabhumoye, Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ryan Wolf, Sarah Yurick, and Varun Singh. NeMo-Curator: a toolkit for data curation. URLhttps://github.com/NVIDIA/NeMo-Curator
[32]

Puzzle: Distillation-based nas for inference-optimized llms

Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Karpas, Roi Koren, Itay Levy, Pavlo Molchanov, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin, Itamar Schen, Ido Shahaf, Oren Tropp, Omer Ullman Argov, and Ran Zilberstein. Puzzl...

arXiv 2024
[33]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388. Model:https://huggingface.co/Qwen/Qwen3-32B

Pith/arXiv arXiv 2025
[34]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023. URL https://arxiv.org/abs/2305. 13245

Pith/arXiv arXiv 2023
[35]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md

2024
[36]

Introducing GPT-oss

OpenAI. Introducing GPT-oss. https://openai.com/index/introducing-gpt-oss/,
[37]

Defeating nondeterminism in LLM inference.Think- ing Machines Lab: Connectionism, 2025

Horace He and Thinking Machines Lab. Defeating nondeterminism in LLM inference.Think- ing Machines Lab: Connectionism, 2025. URL https://thinkingmachines.ai/blog/ defeating-nondeterminism-in-llm-inference/. Blog post

2025
[38]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. 03 2024. URLhttps://arxiv.org/pdf/2403.03853.pdf

arXiv 2024
[39]

The Annals of Mathematical Statistics , author =

S. Kullback and R. A. Leibler. On information and sufficiency.The Annals of Mathematical Statistics, 22(1):79–86, 1951. doi: 10.1214/aoms/1177729694. URL http://dx.doi.org/ 10.1214/aoms/1177729694

work page doi:10.1214/aoms/1177729694 1951
[40]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. 06 2023. URLhttps://arxiv.org/pdf/2306.08543.pdf

Pith/arXiv arXiv 2023
[41]

Patient knowledge distillation for bert model compression

Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for bert model compression. 08 2019. URLhttps://arxiv.org/pdf/1908.09355.pdf

arXiv 2019
[42]

NeMo: a toolkit for Conversational AI and Large Language Models

Eric Harper, Somshubra Majumdar, Oleksii Kuchaiev, Li Jason, Yang Zhang, Evelina Bakh- turina, Vahid Noroozi, Sandeep Subramanian, Koluguri Nithin, Huang Jocelyn, Fei Jia, Jagadeesh Balam, Xuesong Yang, Micha Livne, Yi Dong, Sean Naren, and Boris Gins- burg. NeMo: a toolkit for Conversational AI and Large Language Models. URL https: //github.com/NVIDIA/NeMo
[43]

Tensorrt model optimizer (modelopt)

NVIDIA. Tensorrt model optimizer (modelopt). https://github.com/NVIDIA/ TensorRT-Model-Optimizer, 2025. Version 0.33.0. 23

2025
[44]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019. URL https://arxiv.org/abs/ 1711.05101

Pith/arXiv arXiv 2019
[45]

SGDR: Stochastic gradient descent with warm restarts

Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR), 2017. URL https://arxiv. org/abs/1608.03983

Pith/arXiv arXiv 2017
[46]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021. URL https://arxiv. org/abs/2009.03300

Pith/arXiv arXiv 2021
[47]

MMLU- Pro: A more robust and challenging multi-task language understanding benchmark, 2024

Xiaotian Zhang, Yintao Du, Gen Luo, Yunshan Zhong, Shuo Zhang, and Rongrong Ji. MMLU- Pro: A more robust and challenging multi-task language understanding benchmark, 2024. URL https://arxiv.org/abs/2402.13383

arXiv 2024
[48]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13): 35...

arXiv 2017
[49]

Better fine-tuning by reducing representational collapse

Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta, Naman Goyal, Luke Zettlemoyer, and Sonal Gupta. Better fine-tuning by reducing representational collapse. InInternational Conference on Learning Representations (ICLR), 2021. URL https://arxiv.org/abs/ 2008.03156

arXiv 2021
[50]

GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. URLhttps://arxiv.org/abs/2002.05202

Pith/arXiv arXiv 2002
[51]

Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019. URLhttps://arxiv.org/abs/1911.02150

Pith/arXiv arXiv 1911
[52]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines. InInternational Conference on Learning Representations (ICLR),
[53]

URLhttps://arxiv.org/abs/2310.03714

Pith/arXiv arXiv
[54]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), 2023. URLhttps://arxiv.org/abs/2309.06180. 24

Pith/arXiv arXiv 2023

[1] [1]

Nvidia, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, and Tomasz Grzegorz...

arXiv 2024

[2] [2]

Optimal brain damage

Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. In D. Touret- zky, editor,Advances in Neural Information Processing Systems, volume 2. Morgan- Kaufmann, 1989. URL https://proceedings.neurips.cc/paper_files/paper/1989/ file/6c9882bbac1c7093bd25041881277658-Paper.pdf

1989

[3] [3]

Compact language models via pruning and knowledge distillation

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distillation. 07 2024. URL https: //arxiv.org/pdf/2407.14679.pdf

arXiv 2024

[4] [4]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. 03 2015. URLhttps://arxiv.org/pdf/1503.02531.pdf

Pith/arXiv arXiv 2015

[5] [5]

Llama-nemotron: Efficient reasoning models

Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Gerald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, and ...

arXiv 2025

[6] [6]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, and Chengda Lu. Deepseek-r1: Incentivizing reasoning capability in llms via...

Pith/arXiv arXiv 2025

[7] [7]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. 06 2021. URLhttps://arxiv.org/pdf/2106.09685.pdf

Pith/arXiv arXiv 2021

[8] [8]

Lora without regret.Thinking Machines Lab: Connectionism, 2025

John Schulman and Thinking Machines Lab. Lora without regret.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20250929. https://thinkingmachines.ai/blog/lora/

work page doi:10.64434/tml.20250929 2025

[9] [9]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. 01 2022. URLhttps://arxiv.org/pdf/2201.11903.pdf

Pith/arXiv arXiv 2022

[10] [10]

Bucilua, R

C. Bucilua, R. Caruana, and A. Niculescu-Mizil. Model compression.Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06: 535–541, 2006. 21

2006

[11] [11]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. 05 2023. URL https://arxiv.org/pdf/2305.02301.pdf

Pith/arXiv arXiv 2023

[12] [12]

Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. 06 2016. URL https://arxiv.org/pdf/1606.07947.pdf

Pith/arXiv arXiv 2016

[13] [13]

On the efficacy of knowledge distillation.ICCV 2019, 10 2019

Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation.ICCV 2019, 10 2019. URLhttps://arxiv.org/pdf/1910.01348.pdf

arXiv 2019

[14] [14]

Knowledge distillation: Bad models can be good role models

Gal Kaplun, Eran Malach, Preetum Nakkiran, and Shai Shalev-Shwartz. Knowledge distillation: Bad models can be good role models. 03 2022. URLhttps://arxiv.org/pdf/2203.14649. pdf

arXiv 2022

[15] [15]

Distillation scaling laws

Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russ Webb. Distillation scaling laws. 02 2025. URLhttps://arxiv.org/pdf/2502.08606.pdf

arXiv 2025

[16] [16]

Reddi, Seungyeon Kim, and Sanjiv Kumar

Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Seungyeon Kim, and Sanjiv Kumar. Why distillation helps: a statistical perspective. 05 2020. URL https://arxiv.org/ pdf/2005.10419.pdf

arXiv 2020

[17] [17]

Dahl, and Geof- frey E

Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E. Dahl, and Geof- frey E. Hinton. Large scale distributed neural network training through online distillation. 04

[18] [18]

URLhttps://arxiv.org/pdf/1804.03235.pdf

arXiv

[19] [19]

Knowledge distillation: A good teacher is patient and consistent

Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. Knowledge distillation: A good teacher is patient and consistent. 06 2021. URL https://arxiv.org/pdf/2106.05237.pdf

arXiv 2021

[20] [20]

On-policy distillation of language models: Learning from self- generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2306.13649

arXiv 2024

[21] [21]

Hewett, Mojan Javaheripi, Piero Kauffmann, James R

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu,...

2024

[22] [22]

Llama 3.2: Revolutionizing edge ai and vision with open, cus- tomizable models, September 2024

MetaAI. Llama 3.2: Revolutionizing edge ai and vision with open, cus- tomizable models, September 2024. URL https://ai.meta.com/blog/ llama-3-2-connect-2024-vision-edge-mobile-devices/

2024

[23] [23]

Is deeper better only when shallow is good? InAdvances in Neural Information Processing Systems (NeurIPS), volume 32, 2019

Eran Malach and Shai Shalev-Shwartz. Is deeper better only when shallow is good? InAdvances in Neural Information Processing Systems (NeurIPS), volume 32, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ 606555cf42a6719782a952aa33cfa2cb-Abstract.html

2019

[24] [24]

difficult

Lu Yin, Ajay Jaiswal, Shiwei Liu, Souvik Kundu, and Zhangyang Wang. Junk DNA hypothesis: Pruning small pre-trained weights irreversibly and monotonically impairs “difficult” downstream tasks in LLMs. 2023. URLhttps://arxiv.org/abs/2310.02277

Pith/arXiv arXiv 2023

[25] [25]

Large language model adaptation for financial sentiment analysis

Pau Rodriguez Inserte, Mariam Nakhlé, Raheel Qader, Gaetan Caillaut, and Jingshu Liu. Large language model adaptation for financial sentiment analysis. In Chung-Chi Chen, Hen- Hsen Huang, Hiroya Takamura, Hsin-Hsi Chen, Hiroki Sakaji, and Kiyoshi Izumi, editors, Proceedings of the Sixth Workshop on Financial Technology and Natural Language Processing, pag...

2023

[26] [26]

Small models struggle to learn from strong reasoners,

Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ra- masubramanian, and Radha Poovendran. Small models struggle to learn from strong reasoners,

[27] [27]

URLhttps://arxiv.org/abs/2502.12143

arXiv

[28] [28]

Bert vs gpt for financial engineering

Edward Sharkey and Philip Treleaven. Bert vs gpt for financial engineering. 05 2024. URL https://arxiv.org/pdf/2405.12990.pdf

arXiv 2024

[29] [29]

Nemotron-3-Nano-30B-A3B: Model card, 2025

NVIDIA. Nemotron-3-Nano-30B-A3B: Model card, 2025. URL https://build.nvidia. com/nvidia/nemotron-3-nano-30b-a3b/modelcard. NVIDIA AI Foundation Model

2025

[30] [30]

NeMo data designer

NVIDIA. NeMo data designer. https://nvidia-nemo.github.io/DataDesigner/ latest/, 2025. Accessed: 2025

2025

[31] [31]

NeMo-Curator: a toolkit for data curation

Joseph Jennings, Mostofa Patwary, Sandeep Subramanian, Shrimai Prabhumoye, Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ryan Wolf, Sarah Yurick, and Varun Singh. NeMo-Curator: a toolkit for data curation. URLhttps://github.com/NVIDIA/NeMo-Curator

[32] [32]

Puzzle: Distillation-based nas for inference-optimized llms

Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Karpas, Roi Koren, Itay Levy, Pavlo Molchanov, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin, Itamar Schen, Ido Shahaf, Oren Tropp, Omer Ullman Argov, and Ran Zilberstein. Puzzl...

arXiv 2024

[33] [33]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388. Model:https://huggingface.co/Qwen/Qwen3-32B

Pith/arXiv arXiv 2025

[34] [34]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023. URL https://arxiv.org/abs/2305. 13245

Pith/arXiv arXiv 2023

[35] [35]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md

2024

[36] [36]

Introducing GPT-oss

OpenAI. Introducing GPT-oss. https://openai.com/index/introducing-gpt-oss/,

[37] [37]

Defeating nondeterminism in LLM inference.Think- ing Machines Lab: Connectionism, 2025

Horace He and Thinking Machines Lab. Defeating nondeterminism in LLM inference.Think- ing Machines Lab: Connectionism, 2025. URL https://thinkingmachines.ai/blog/ defeating-nondeterminism-in-llm-inference/. Blog post

2025

[38] [38]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. 03 2024. URLhttps://arxiv.org/pdf/2403.03853.pdf

arXiv 2024

[39] [39]

The Annals of Mathematical Statistics , author =

S. Kullback and R. A. Leibler. On information and sufficiency.The Annals of Mathematical Statistics, 22(1):79–86, 1951. doi: 10.1214/aoms/1177729694. URL http://dx.doi.org/ 10.1214/aoms/1177729694

work page doi:10.1214/aoms/1177729694 1951

[40] [40]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. 06 2023. URLhttps://arxiv.org/pdf/2306.08543.pdf

Pith/arXiv arXiv 2023

[41] [41]

Patient knowledge distillation for bert model compression

Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for bert model compression. 08 2019. URLhttps://arxiv.org/pdf/1908.09355.pdf

arXiv 2019

[42] [42]

NeMo: a toolkit for Conversational AI and Large Language Models

Eric Harper, Somshubra Majumdar, Oleksii Kuchaiev, Li Jason, Yang Zhang, Evelina Bakh- turina, Vahid Noroozi, Sandeep Subramanian, Koluguri Nithin, Huang Jocelyn, Fei Jia, Jagadeesh Balam, Xuesong Yang, Micha Livne, Yi Dong, Sean Naren, and Boris Gins- burg. NeMo: a toolkit for Conversational AI and Large Language Models. URL https: //github.com/NVIDIA/NeMo

[43] [43]

Tensorrt model optimizer (modelopt)

NVIDIA. Tensorrt model optimizer (modelopt). https://github.com/NVIDIA/ TensorRT-Model-Optimizer, 2025. Version 0.33.0. 23

2025

[44] [44]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019. URL https://arxiv.org/abs/ 1711.05101

Pith/arXiv arXiv 2019

[45] [45]

SGDR: Stochastic gradient descent with warm restarts

Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR), 2017. URL https://arxiv. org/abs/1608.03983

Pith/arXiv arXiv 2017

[46] [46]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021. URL https://arxiv. org/abs/2009.03300

Pith/arXiv arXiv 2021

[47] [47]

MMLU- Pro: A more robust and challenging multi-task language understanding benchmark, 2024

Xiaotian Zhang, Yintao Du, Gen Luo, Yunshan Zhong, Shuo Zhang, and Rongrong Ji. MMLU- Pro: A more robust and challenging multi-task language understanding benchmark, 2024. URL https://arxiv.org/abs/2402.13383

arXiv 2024

[48] [48]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13): 35...

arXiv 2017

[49] [49]

Better fine-tuning by reducing representational collapse

Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta, Naman Goyal, Luke Zettlemoyer, and Sonal Gupta. Better fine-tuning by reducing representational collapse. InInternational Conference on Learning Representations (ICLR), 2021. URL https://arxiv.org/abs/ 2008.03156

arXiv 2021

[50] [50]

GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. URLhttps://arxiv.org/abs/2002.05202

Pith/arXiv arXiv 2002

[51] [51]

Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019. URLhttps://arxiv.org/abs/1911.02150

Pith/arXiv arXiv 1911

[52] [52]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines. InInternational Conference on Learning Representations (ICLR),

[53] [53]

URLhttps://arxiv.org/abs/2310.03714

Pith/arXiv arXiv

[54] [54]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), 2023. URLhttps://arxiv.org/abs/2309.06180. 24

Pith/arXiv arXiv 2023