pith. machine review for the scientific record. sign in

arxiv: 2604.05426 · v2 · submitted 2026-04-07 · 💻 cs.LG · cs.AI· cs.DC

Recognition: 2 theorem links

· Lean Theorem

ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads

Fanjiang Ye, Jingwei Zuo, Kaijian Wang, Xinze Feng, Ye Cao, Yuke Wang, Zhuang Wang, Zien Liu

Pith reviewed 2026-05-10 20:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC
keywords LoRA tuninghyperparameter optimizationearly terminationGPU schedulingparameter-efficient fine-tuningmulti-tenant systemslarge language modelsadaptive orchestration
0
0 comments X

The pith

ALTO speeds concurrent LoRA hyperparameter tuning by up to 13.8 times by terminating weak configurations early and orchestrating jobs on shared backbones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ALTO as a system that manages many simultaneous LoRA tuning jobs running on the same frozen model backbone in multi-tenant clusters. It claims that loss-trajectory monitoring can safely stop unpromising runs early, while fused grouped matrix operations and rank-local parallelism let surviving adapters share GPU resources tightly and reclaim idle capacity. A hybrid scheduler then places the remaining jobs across heterogeneous tasks by using their predictable runtimes. Readers would care because LoRA hyperparameter search is expensive and often leaves hardware underused when each job is handled alone. If the approach holds, it reduces the compute cost of finding good adapters without lowering their final quality.

Core claim

ALTO is a co-designed training system for LoRA hyperparameter tuning and orchestration. Its central insight is that concurrent jobs over a shared frozen backbone create opportunities single-job systems miss. The system monitors loss trajectories to end weak configurations early, applies fused grouped GEMM together with rank-local adapter parallelism to co-locate surviving adapters and free GPU capacity, and combines intra-task and inter-task scheduling that exploits predictable LoRA job durations. Evaluation reports up to 13.8 times speedup over prior systems while preserving adapter quality.

What carries the argument

Loss-trajectory monitoring for early termination of unpromising LoRA configurations, fused grouped GEMM with rank-local adapter parallelism for co-locating survivors on shared backbones, and hybrid intra-task plus inter-task scheduling that uses predictable job durations.

If this is right

  • Early termination cuts wasted computation on low-performing LoRA candidates.
  • Reclaimed GPU capacity from terminated jobs increases the number of concurrent adapters that can run.
  • Hybrid scheduling improves overall cluster utilization when tasks have different resource needs.
  • The system maintains final adapter quality across heterogeneous multi-tenant workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The loss-monitoring approach for pruning tuning runs could extend to other hyperparameter searches in deep learning.
  • Cloud platforms could adopt similar orchestration to lower costs for users running batches of fine-tuning experiments.
  • Rank-local parallelism ideas might apply to other parameter-efficient methods that also attach small modules to a frozen backbone.

Load-bearing premise

Loss trajectories can reliably flag unpromising LoRA configurations for early termination without discarding high-quality ones, and LoRA job durations are predictable enough that the combined scheduling produces the claimed gains.

What would settle it

A controlled run in which a configuration terminated early by ALTO later reaches higher final performance than configurations allowed to finish, or a heterogeneous workload test that shows no measurable speedup despite the scheduling changes.

Figures

Figures reproduced from arXiv: 2604.05426 by Fanjiang Ye, Jingwei Zuo, Kaijian Wang, Xinze Feng, Ye Cao, Yuke Wang, Zhuang Wang, Zien Liu.

Figure 1
Figure 1. Figure 1: Different hyperparameters yield significantly vary [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System Overview of ALTO. metrics such as accuracy in SFT or reward in RL. Similarly to model pre-training, LoRA fine-tuning also requires specifying hyperparameters, such as learning rate, batch size, and adapter rank. We denote a LoRA fine-tuning job as the training of a LoRA adapter under a specific hyperparameter configuration. A LoRA task may consist of one or more fine-tuning jobs. Although a LoRA fin… view at source ↗
Figure 4
Figure 4. Figure 4: GPU memory and average SM utilization when [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: The illustration of the loss curves of three typical [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Rank correlation between validation loss at the end [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Two executor modes in ALTO’s batched execu￾tion engine. (a) Multi-GPU executor with adapter parallelism: base model weights are sharded across ranks and synchro￾nized via all-gather, while each rank trains a disjoint set of LoRA jobs locally via grouped GEMM, therefore there is no adapter gradients cross rank boundaries. (b) Single-GPU ex￾ecutor: the full base model and multiple LoRA jobs reside on a singl… view at source ↗
Figure 9
Figure 9. Figure 9: End-to-end training speedup of ALTO across single-GPU and multi-GPU configurations. From left to right: Llama-3.1- 8B and Qwen2.5-7B on a single H100 GPU, Qwen2.5-32B on 2× H100 and Llama-3.1-70B on 4× H100. Each configuration trains 60 (single-GPU) or 64 (multi-GPU) heterogeneous LoRA adapters with varied ranks, batch sizes, and learning rates across three datasets. ALTO achieves up to 9.5× speedup on sin… view at source ↗
Figure 10
Figure 10. Figure 10: Model quality of the best configuration found by [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Evalution of ALTO components on 8-GPU train￾ing makespan. B = Batched LoRA, S = Scheduler, EE = Early Exit. The full system (B+S+EE) achieves a 5.2× reduction in makespan compared to batching alone (B), with early exit contributing the largest individual gain. lines. AP’s advantage is largest at small batch sizes (BS = 1 to 2) where FSDP’s all-reduce dominates; at BS = 4 to 8, mLoRA and TP fall below FSDP… view at source ↗
Figure 13
Figure 13. Figure 13: Adapter Parallelism (AP) microbenchmark on [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Training samples saved by each early-exit pat [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Sensitivity of early exit predictions to warmup percentage. [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
read the original abstract

Low-Rank Adaptation (LoRA) is now the dominant method for parameter-efficient fine-tuning of large language models, but achieving a high-quality adapter often requires systematic hyperparameter tuning because LoRA performance is highly sensitive to configuration choices. In practice, this leads to many concurrent LoRA jobs, often spanning heterogeneous tasks in multi-tenant environments. Existing systems largely handle these jobs independently, which both wastes computation on weak candidates and leaves GPUs underutilized. We present ALTO (Adaptive LoRA Tuning and Orchestration), a co-designed training system that accelerates LoRA hyperparameter tuning while enabling efficient cluster sharing across heterogeneous tasks. The central insight behind ALTO is that when multiple tuning jobs run concurrently over a shared frozen backbone, they expose optimization opportunities that single-job designs cannot exploit. Building on this, ALTO monitors loss trajectories to terminate unpromising configurations early, uses fused grouped GEMM together with a new rank-local adapter parallelism to co-locate surviving adapters and reclaim freed GPU capacity, and combines intra-task and inter-task scheduling to improve multi-task placement by leveraging the predictable duration of LoRA jobs. Extensive evaluation shows that ALTO achieves up to $13.8\times$ speedup over state-of-the-art without sacrificing adapter quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces ALTO, a co-designed system for LoRA hyperparameter tuning and orchestration across heterogeneous tasks in multi-tenant GPU clusters. It monitors loss trajectories to early-terminate unpromising adapter configurations, employs fused grouped GEMM and rank-local adapter parallelism to co-locate surviving jobs and reclaim capacity, and combines intra- and inter-task scheduling that exploits predictable LoRA job durations. The central empirical claim is an up to 13.8× speedup over state-of-the-art baselines while preserving adapter quality.

Significance. If the performance and quality claims are substantiated, ALTO would represent a practical advance in efficient multi-tenant LoRA serving by reducing wasted computation on weak hyperparameter candidates and improving cluster utilization. The combination of trajectory-based pruning with fused execution and heterogeneous scheduling is a concrete systems contribution that could influence production fine-tuning pipelines. The work is grounded in empirical measurements rather than new theory, which is appropriate for the systems/ML intersection but places a premium on experimental rigor.

major comments (3)
  1. [§3.2] §3.2 (Loss Trajectory Monitoring): The early-termination policy is load-bearing for both the reported speedup and the 'no quality sacrifice' guarantee, yet the manuscript provides insufficient detail on the exact heuristic (threshold, slope test, or window size) and its behavior on non-monotonic or task-dependent loss curves. LoRA convergence is known to be heterogeneous; a configuration that appears inferior after 10-20% of steps can overtake others later. Without an ablation showing false-negative rates or a comparison of final adapter quality (e.g., downstream task metrics) between pruned and fully trained runs, the quality invariant cannot be verified.
  2. [§4.2] §4.2 (Experimental Methodology): The abstract asserts a 13.8× speedup and preserved quality, but the evaluation section lacks explicit enumeration of baselines (which SOTA systems?), datasets, number of independent runs, statistical significance tests, or error bars. Without these, the magnitude of the speedup and the claim of no quality degradation cannot be assessed for robustness across heterogeneous task mixes.
  3. [§4.3] §4.3 (Quality Evaluation): Adapter quality is asserted to be preserved, but the metrics used (validation loss only, or downstream accuracy/F1 on held-out tasks?) are not clearly stated. If quality is measured solely on the same loss trajectories used for pruning, the evaluation is circular and does not demonstrate that pruned configurations would not have yielded superior final adapters on the target tasks.
minor comments (3)
  1. [Figure 3] Figure 3 (scheduling diagram): The caption and legend do not clearly distinguish intra-task vs. inter-task placement decisions; add explicit annotations or a small example trace.
  2. [§3.1] Notation in §3.1: The definition of 'rank-local parallelism' reuses the symbol R for both adapter rank and a runtime variable; introduce distinct symbols to avoid confusion.
  3. [§2] Missing reference: The related-work discussion of prior LoRA tuning systems should cite the specific papers whose throughput numbers are used as baselines in Table 2.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of the early-termination policy, experimental methodology, and quality evaluation.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Loss Trajectory Monitoring): The early-termination policy is load-bearing for both the reported speedup and the 'no quality sacrifice' guarantee, yet the manuscript provides insufficient detail on the exact heuristic (threshold, slope test, or window size) and its behavior on non-monotonic or task-dependent loss curves. LoRA convergence is known to be heterogeneous; a configuration that appears inferior after 10-20% of steps can overtake others later. Without an ablation showing false-negative rates or a comparison of final adapter quality (e.g., downstream task metrics) between pruned and fully trained runs, the quality invariant cannot be verified.

    Authors: We agree that the description of the loss trajectory monitoring heuristic requires greater precision. In the revised manuscript we will expand §3.2 to specify the exact threshold, slope test, window size, and the mechanism used to accommodate non-monotonic or task-dependent curves. We will also add an ablation study (in the main text or appendix) that reports false-negative rates across the evaluated tasks and directly compares downstream task metrics (accuracy/F1) of pruned versus fully trained adapters, thereby substantiating the quality-preservation claim. revision: yes

  2. Referee: [§4.2] §4.2 (Experimental Methodology): The abstract asserts a 13.8× speedup and preserved quality, but the evaluation section lacks explicit enumeration of baselines (which SOTA systems?), datasets, number of independent runs, statistical significance tests, or error bars. Without these, the magnitude of the speedup and the claim of no quality degradation cannot be assessed for robustness across heterogeneous task mixes.

    Authors: We concur that §4.2 should be more explicit. The revised version will enumerate the precise state-of-the-art baselines, list all datasets and heterogeneous task mixes, report the number of independent runs, and include statistical significance tests together with error bars so that the speedup magnitude and quality claims can be evaluated for robustness. revision: yes

  3. Referee: [§4.3] §4.3 (Quality Evaluation): Adapter quality is asserted to be preserved, but the metrics used (validation loss only, or downstream accuracy/F1 on held-out tasks?) are not clearly stated. If quality is measured solely on the same loss trajectories used for pruning, the evaluation is circular and does not demonstrate that pruned configurations would not have yielded superior final adapters on the target tasks.

    Authors: We will revise §4.3 to state explicitly that adapter quality is measured by downstream task metrics (accuracy and F1 on held-out test sets) that are independent of the validation loss used for pruning decisions. We will further add side-by-side comparisons of these downstream metrics for early-terminated versus fully trained adapters, directly addressing the concern that the evaluation might be circular. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems evaluation with no self-referential derivations

full rationale

The paper presents ALTO as an engineering system for concurrent LoRA tuning, relying on loss-trajectory monitoring for early termination, fused GEMM kernels, rank-local parallelism, and combined intra/inter-task scheduling. All performance claims (e.g., 13.8× speedup) are supported by direct cluster measurements on heterogeneous workloads rather than any closed-form derivation, fitted parameter renamed as prediction, or self-citation chain that reduces the central result to its own inputs. No equations, ansatzes, or uniqueness theorems appear; the evaluation is externally falsifiable via reproduction on the same hardware and tasks. This is a standard non-circular empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical systems paper; the central claims rest on engineering design choices and measured speedups rather than mathematical axioms or new physical entities.

pith-pipeline@v0.9.0 · 5545 in / 1146 out tokens · 26140 ms · 2026-05-10T20:16:40.659620+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

68 extracted references · 20 canonical work pages · 12 internal anchors

  1. [1]

    The claude 3 model family: Opus, son- net, haiku

    Anthropic. The claude 3 model family: Opus, son- net, haiku. https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/ Model_Card_Claude_3.pdf, 2024

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  4. [4]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Live- codebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

  5. [5]

    arXiv preprint arXiv:2210.03057 , year=

    Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush V osoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Lan- guage models are multilingual chain-of-thought reason- ers.arXiv preprint arXiv:2210.03057, 2022

  6. [6]

    Judging llm-as- a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595– 46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo- han Li, Dacheng Li, Eric Xing, et al. Judging llm-as- a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595– 46623, 2023

  7. [7]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anasta- sios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonza- lez, et al. Chatbot arena: An open platform for evaluat- ing llms by human preference, 2024.URL https://arxiv. org/abs/2403.04132, 2(10), 2024

  8. [8]

    Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models

    Neel Guha, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Alex Chohlas-Wood, Austin Peters, Bran- don Waldon, Daniel Rockmore, Diego Zambrano, et al. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. Advances in neural information processing systems, 36:44123–44279, 2023

  9. [9]

    Gptutor: Great personalized tutor with large language models for personalized learning content gen- eration

    Eason Chen, Jia-En Lee, Jionghao Lin, and Kenneth Koedinger. Gptutor: Great personalized tutor with large language models for personalized learning content gen- eration. InProceedings of the Eleventh ACM Conference on Learning@ Scale, pages 539–541, 2024

  10. [10]

    Exploring the potential of large language models for automation in technical customer service.arXiv preprint arXiv:2405.09161, 2024

    Jochen Wulf and Juerg Meierhofer. Exploring the poten- tial of large language models for automation in techni- cal customer service.arXiv preprint arXiv:2405.09161, 2024

  11. [11]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 13

  12. [12]

    Fireworks ai fine-tuning, 2024

    Fireworks AI. Fireworks ai fine-tuning, 2024

  13. [13]

    Together AI.https://www.together.ai/, 2024

  14. [14]

    Tinker: A flexible api for fine- tuning open source models with lora, 2025

    Thinking Machines Lab. Tinker: A flexible api for fine- tuning open source models with lora, 2025

  15. [15]

    Unsloth, 2023

    Michael Han Daniel Han and Unsloth team. Unsloth, 2023

  16. [16]

    LlamaFactory: Uni- fied efficient fine-tuning of 100+ language models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, and Yongqiang Ma. LlamaFactory: Uni- fied efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 3: System Demonstrations), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 400–410. Associatio...

  17. [17]

    Training language models to follow instructions with human feedback.Advances in neural information pro- cessing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information pro- cessing systems, 35:27730–27744, 2022

  18. [18]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

  19. [19]

    PEFT: State-of-the-art parameter- efficient fine-tuning methods

    Sourab Mangrulkar, Sylvain Gugger, Lysandre De- but, Younes Belkada, Sayak Paul, Benjamin Bossan, and Marian Tietz. PEFT: State-of-the-art parameter- efficient fine-tuning methods. https://github.com/ huggingface/peft, 2022

  20. [20]

    Lobra: Multi-tenant fine-tuning over heterogeneous data.Proc

    Sheng Lin, Fangcheng Fu, Haoyang Li, Hao Ge, Xuanyu Wang, Jiawen Niu, Yaofeng Tu, and Bin Cui. Lobra: Multi-tenant fine-tuning over heterogeneous data.Proc. VLDB Endow., 18(8):2616–2625, April 2025

  21. [21]

    mlora: Fine-tuning lora adapters via highly-efficient pipeline parallelism in multiple gpus.Proc

    Zhengmao Ye, Dengchun Li, Zetao Hu, Tingfeng Lan, Jian Sha, Shicong Zhang, Lei Duan, Jie Zuo, Hui Lu, Yuanchun Zhou, and Mingjie Tang. mlora: Fine-tuning lora adapters via highly-efficient pipeline parallelism in multiple gpus.Proc. VLDB Endow., 18(6):1948–1961, February 2025

  22. [22]

    Accessed: 2026-03-28

    Google Cloud.Tune Gemini models by using supervised fine-tuning, 2024. Accessed: 2026-03-28

  23. [23]

    Accessed: 2026-03-28

    Amazon Web Services.Fine-tune Foundation Models with Amazon SageMaker JumpStart, 2024. Accessed: 2026-03-28

  24. [24]

    Accessed: 2026-03-28

    Microsoft.Customize a model with fine-tuning in Azure OpenAI, 2024. Accessed: 2026-03-28

  25. [25]

    Plora: Efficient lora hyperparameter tuning for large models.arXiv preprint arXiv:2508.02932, 2025

    Minghao Yan, Zhuang Wang, Zhen Jia, Shivaram Venkataraman, and Yida Wang. Plora: Efficient lora hyperparameter tuning for large models.arXiv preprint arXiv:2508.02932, 2025

  26. [26]

    Lo- rafusion: A crossbar-aware multi-task adaption frame- work via efficient fusion of pretrained lora modules

    Jingkai Guo, Asmer Ali, Li Yang, and Deliang Fan. Lo- rafusion: A crossbar-aware multi-task adaption frame- work via efficient fusion of pretrained lora modules. In Proceedings of the Great Lakes Symposium on VLSI 2025, GLSVLSI ’25, page 777–783, New York, NY , USA, 2025. Association for Computing Machinery

  27. [27]

    tlora: Efficient multi-lora training with elastic shared super-models.arXiv preprint arXiv:2602.07263, 2026

    Kevin Li, Dibyadeep Saha, Avni Kanodia, and Fan Lai. tlora: Efficient multi-lora training with elastic shared super-models.arXiv preprint arXiv:2602.07263, 2026

  28. [28]

    Early stopping-but when? InNeural Networks: Tricks of the trade, pages 55–69

    Lutz Prechelt. Early stopping-but when? InNeural Networks: Tricks of the trade, pages 55–69. Springer, 2002

  29. [29]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Sho- janazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

  30. [30]

    ZeRO: Memory optimizations toward train- ing trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward train- ing trillion parameter models. InProceedings of the In- ternational Conference for High Performance Comput- ing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page 20. IEEE/ACM, 2020

  31. [31]

    Dapple: A pipelined data paral- lel approach for training large models

    Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al. Dapple: A pipelined data paral- lel approach for training large models. InProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021

  32. [32]

    Gpipe: Effi- cient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Effi- cient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

  33. [33]

    Pipedream: Gen- eralized pipeline parallelism for dnn training

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Gen- eralized pipeline parallelism for dnn training. InPro- ceedings of the 27th ACM symposium on operating sys- tems principles, pages 1–15, 2019. 14

  34. [34]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  35. [35]

    Efficient large-scale language model training on gpu clusters using megatron-lm

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the international conference for high per- formance computing, networking,...

  36. [36]

    Parameter- efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges- mundo, Mona Attariyan, and Sylvain Gelly. Parameter- efficient transfer learning for nlp. InInternational con- ference on machine learning, pages 2790–2799. PMLR, 2019

  37. [37]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 3045– 3059, 2021

  38. [38]

    Prefix-tuning: Optimiz- ing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz- ing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Com- putational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021

  39. [39]

    P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks

    Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengx- iao Du, Zhilin Yang, and Jie Tang. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, 2022

  40. [40]

    Qlora: Efficient finetuning of quan- tized llms.Advances in neural information processing systems, 36:10088–10115, 2023

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quan- tized llms.Advances in neural information processing systems, 36:10088–10115, 2023

  41. [41]

    Anyscale.https://www.anyscale.com/, 2024

  42. [42]

    Lambda.https://lambdalabs.com/, 2024

  43. [43]

    16 changes to the way enterprises are building and buying generative ai, 2024

    Andreessen Horowitz. 16 changes to the way enterprises are building and buying generative ai, 2024. Accessed: 2025

  44. [44]

    LoRAX: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

    Predibase. LoRAX: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs. GitHub, 2024

  45. [45]

    Lora with- out regret.Thinking Machines Lab: Connectionism,

    John Schulman and Thinking Machines Lab. Lora with- out regret.Thinking Machines Lab: Connectionism,

  46. [46]

    https://thinkingmachines.ai/blog/lora/

  47. [47]

    Tina: Tiny reasoning models via LoRA.arXiv preprint arXiv:2504.15777, 2025b

    Shangshang Wang, Julian Asilis, Ömer Faruk Akgül, Enes Burak Bilgin, Ollie Liu, and Willie Neiswanger. Tina: Tiny reasoning models via lora.arXiv preprint arXiv:2504.15777, 2025

  48. [48]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347, 2017

  49. [49]

    Di- rect preference optimization: Your language model is secretly a reward model.Advances in neural informa- tion processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Di- rect preference optimization: Your language model is secretly a reward model.Advances in neural informa- tion processing systems, 36:53728–53741, 2023

  50. [50]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shi- rong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  51. [51]

    {TensorFlow}: a system for {Large-Scale} machine learning

    Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, San- jay Ghemawat, Geoffrey Irving, Michael Isard, et al. {TensorFlow}: a system for {Large-Scale} machine learning. In12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016

  52. [52]

    arXiv preprint arXiv:2006.15704 , author =

    Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch dis- tributed: Experiences on accelerating data parallel train- ing.arXiv preprint arXiv:2006.15704, 2020

  53. [53]

    S-lora: Serving thousands of concurrent lora adapters,

    Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023

  54. [54]

    Punica: Multi-tenant lora serving.Proceedings of Machine Learning and Systems, 6:1–13, 2024

    Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. Punica: Multi-tenant lora serving.Proceedings of Machine Learning and Systems, 6:1–13, 2024

  55. [55]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 15

  56. [56]

    Qwen2.5 technical report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

  57. [57]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plap- pert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  58. [58]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

  59. [59]

    OpenThoughts: Data Recipes for Reasoning Models

    Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhu- rina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025

  60. [60]

    Efficient memory manage- ment for large language model serving with pagedatten- tion

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory manage- ment for large language model serving with pagedatten- tion. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  61. [61]

    Ultrafeedback: Boosting language models with high-quality feedback

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. 2023

  62. [62]

    Beware of the batch size: Hyperparameter bias in evaluating lora.arXiv preprint arXiv:2602.09492, 2026

    Sangyoon Lee and Jaeho Lee. Beware of the batch size: Hyperparameter bias in evaluating lora.arXiv preprint arXiv:2602.09492, 2026

  63. [63]

    Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, page 10–19, New York, NY , USA, 2019. Association for Computing Machinery

  64. [64]

    Laurent Perron and Frédéric Didier. Cp-sat

  65. [65]

    Hyperband: A novel bandit-based approach to hyperparameter optimization

    Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Ros- tamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of machine learning research, 18(185):1–52, 2018

  66. [66]

    A system for mas- sively parallel hyperparameter tuning.Proceedings of machine learning and systems, 2:230–246, 2020

    Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Eka- terina Gonina, Jonathan Ben-Tzur, Moritz Hardt, Ben- jamin Recht, and Ameet Talwalkar. A system for mas- sively parallel hyperparameter tuning.Proceedings of machine learning and systems, 2:230–246, 2020

  67. [67]

    Speeding up automatic hyperparameter optimiza- tion of deep neural networks by extrapolation of learning curves

    Tobias Domhan, Jost Tobias Springenberg, Frank Hutter, et al. Speeding up automatic hyperparameter optimiza- tion of deep neural networks by extrapolation of learning curves. InIJCAI, volume 15, pages 3460–8, 2015

  68. [68]

    {dLoRA}: Dynamically or- chestrating requests and adapters for {LoRA}{LLM} serving

    Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xu- anzhe Liu, and Xin Jin. {dLoRA}: Dynamically or- chestrating requests and adapters for {LoRA}{LLM} serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 911–927, 2024. 16 A Appendix A.1 Grouped GEMM Kernel Design Our Triton-based grouped GEMM kernel processes mu...