arxiv: 2604.05426 · v2 · submitted 2026-04-07 · 💻 cs.LG · cs.AI· cs.DC

Recognition: 2 theorem links

· Lean Theorem

ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads

Fanjiang Ye, Jingwei Zuo, Kaijian Wang, Xinze Feng, Ye Cao, Yuke Wang, Zhuang Wang, Zien Liu

Pith reviewed 2026-05-10 20:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC

keywords LoRA tuninghyperparameter optimizationearly terminationGPU schedulingparameter-efficient fine-tuningmulti-tenant systemslarge language modelsadaptive orchestration

0 comments

The pith

ALTO speeds concurrent LoRA hyperparameter tuning by up to 13.8 times by terminating weak configurations early and orchestrating jobs on shared backbones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ALTO as a system that manages many simultaneous LoRA tuning jobs running on the same frozen model backbone in multi-tenant clusters. It claims that loss-trajectory monitoring can safely stop unpromising runs early, while fused grouped matrix operations and rank-local parallelism let surviving adapters share GPU resources tightly and reclaim idle capacity. A hybrid scheduler then places the remaining jobs across heterogeneous tasks by using their predictable runtimes. Readers would care because LoRA hyperparameter search is expensive and often leaves hardware underused when each job is handled alone. If the approach holds, it reduces the compute cost of finding good adapters without lowering their final quality.

Core claim

ALTO is a co-designed training system for LoRA hyperparameter tuning and orchestration. Its central insight is that concurrent jobs over a shared frozen backbone create opportunities single-job systems miss. The system monitors loss trajectories to end weak configurations early, applies fused grouped GEMM together with rank-local adapter parallelism to co-locate surviving adapters and free GPU capacity, and combines intra-task and inter-task scheduling that exploits predictable LoRA job durations. Evaluation reports up to 13.8 times speedup over prior systems while preserving adapter quality.

What carries the argument

Loss-trajectory monitoring for early termination of unpromising LoRA configurations, fused grouped GEMM with rank-local adapter parallelism for co-locating survivors on shared backbones, and hybrid intra-task plus inter-task scheduling that uses predictable job durations.

If this is right

Early termination cuts wasted computation on low-performing LoRA candidates.
Reclaimed GPU capacity from terminated jobs increases the number of concurrent adapters that can run.
Hybrid scheduling improves overall cluster utilization when tasks have different resource needs.
The system maintains final adapter quality across heterogeneous multi-tenant workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The loss-monitoring approach for pruning tuning runs could extend to other hyperparameter searches in deep learning.
Cloud platforms could adopt similar orchestration to lower costs for users running batches of fine-tuning experiments.
Rank-local parallelism ideas might apply to other parameter-efficient methods that also attach small modules to a frozen backbone.

Load-bearing premise

Loss trajectories can reliably flag unpromising LoRA configurations for early termination without discarding high-quality ones, and LoRA job durations are predictable enough that the combined scheduling produces the claimed gains.

What would settle it

A controlled run in which a configuration terminated early by ALTO later reaches higher final performance than configurations allowed to finish, or a heterogeneous workload test that shows no measurable speedup despite the scheduling changes.

Figures

Figures reproduced from arXiv: 2604.05426 by Fanjiang Ye, Jingwei Zuo, Kaijian Wang, Xinze Feng, Ye Cao, Yuke Wang, Zhuang Wang, Zien Liu.

**Figure 2.** Figure 2: System Overview of ALTO. metrics such as accuracy in SFT or reward in RL. Similarly to model pre-training, LoRA fine-tuning also requires specifying hyperparameters, such as learning rate, batch size, and adapter rank. We denote a LoRA fine-tuning job as the training of a LoRA adapter under a specific hyperparameter configuration. A LoRA task may consist of one or more fine-tuning jobs. Although a LoRA fin… view at source ↗

**Figure 4.** Figure 4: GPU memory and average SM utilization when [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: The illustration of the loss curves of three typical [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Rank correlation between validation loss at the end [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Two executor modes in ALTO’s batched execution engine. (a) Multi-GPU executor with adapter parallelism: base model weights are sharded across ranks and synchronized via all-gather, while each rank trains a disjoint set of LoRA jobs locally via grouped GEMM, therefore there is no adapter gradients cross rank boundaries. (b) Single-GPU executor: the full base model and multiple LoRA jobs reside on a singl… view at source ↗

**Figure 9.** Figure 9: End-to-end training speedup of ALTO across single-GPU and multi-GPU configurations. From left to right: Llama-3.1- 8B and Qwen2.5-7B on a single H100 GPU, Qwen2.5-32B on 2× H100 and Llama-3.1-70B on 4× H100. Each configuration trains 60 (single-GPU) or 64 (multi-GPU) heterogeneous LoRA adapters with varied ranks, batch sizes, and learning rates across three datasets. ALTO achieves up to 9.5× speedup on sin… view at source ↗

**Figure 10.** Figure 10: Model quality of the best configuration found by [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 12.** Figure 12: Evalution of ALTO components on 8-GPU training makespan. B = Batched LoRA, S = Scheduler, EE = Early Exit. The full system (B+S+EE) achieves a 5.2× reduction in makespan compared to batching alone (B), with early exit contributing the largest individual gain. lines. AP’s advantage is largest at small batch sizes (BS = 1 to 2) where FSDP’s all-reduce dominates; at BS = 4 to 8, mLoRA and TP fall below FSDP… view at source ↗

**Figure 13.** Figure 13: Adapter Parallelism (AP) microbenchmark on [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 15.** Figure 15: Training samples saved by each early-exit pat [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗

**Figure 16.** Figure 16: Sensitivity of early exit predictions to warmup percentage. [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

read the original abstract

Low-Rank Adaptation (LoRA) is now the dominant method for parameter-efficient fine-tuning of large language models, but achieving a high-quality adapter often requires systematic hyperparameter tuning because LoRA performance is highly sensitive to configuration choices. In practice, this leads to many concurrent LoRA jobs, often spanning heterogeneous tasks in multi-tenant environments. Existing systems largely handle these jobs independently, which both wastes computation on weak candidates and leaves GPUs underutilized. We present ALTO (Adaptive LoRA Tuning and Orchestration), a co-designed training system that accelerates LoRA hyperparameter tuning while enabling efficient cluster sharing across heterogeneous tasks. The central insight behind ALTO is that when multiple tuning jobs run concurrently over a shared frozen backbone, they expose optimization opportunities that single-job designs cannot exploit. Building on this, ALTO monitors loss trajectories to terminate unpromising configurations early, uses fused grouped GEMM together with a new rank-local adapter parallelism to co-locate surviving adapters and reclaim freed GPU capacity, and combines intra-task and inter-task scheduling to improve multi-task placement by leveraging the predictable duration of LoRA jobs. Extensive evaluation shows that ALTO achieves up to $13.8\times$ speedup over state-of-the-art without sacrificing adapter quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ALTO brings some useful scheduling and co-location ideas to multi-LoRA tuning but the early-stopping part carries a real risk of quality loss.

read the letter

Hey, The core of this paper is a system called ALTO that tries to make hyperparameter tuning for multiple LoRA adapters more efficient when they run together on shared GPUs. It monitors loss curves to kill off weak configurations early, packs the remaining ones using fused GEMM operations and a new way to parallelize adapters by rank, and uses a hybrid scheduler that plans both within and across tasks. What is new here is the combination of those low-level optimizations with the scheduling that exploits the fact that all jobs share the same frozen model backbone. Prior work on LoRA tuning mostly treats jobs in isolation, so this co-design for concurrent heterogeneous workloads is a step forward. The fused grouped GEMM and rank-local parallelism sound like they could reclaim GPU capacity that would otherwise sit idle. The paper handles the practical issue well by focusing on real deployment constraints like multi-tenancy and variable task difficulties. The soft spot is the reliance on loss-trajectory monitoring for early termination. The stress test note points out that LoRA convergence isn't always smooth or predictable across tasks, so a configuration that trails early could still win later. If the paper doesn't show that their policy avoids discarding superior final adapters, that undercuts the no quality sacrifice claim. The 13.8x speedup is impressive on paper, but without details on the exact baselines, number of runs, statistical tests, or error bars, it's tough to know how much of the gain comes from the new methods versus just better engineering or favorable workloads. The experiments are described as extensive, but the abstract leaves out the specifics, which makes the central claims harder to assess right away. This paper is for systems researchers and practitioners who deal with scaling parameter-efficient fine-tuning across many jobs and users. Someone building or tuning large LLM services would find the orchestration parts useful. It has enough new mechanisms and addresses a clear gap to warrant peer review, though the reviewers will probably push on the evaluation rigor and the termination heuristic. I'd recommend sending it out for review rather than desk rejecting it.

Referee Report

3 major / 3 minor

Summary. The paper introduces ALTO, a co-designed system for LoRA hyperparameter tuning and orchestration across heterogeneous tasks in multi-tenant GPU clusters. It monitors loss trajectories to early-terminate unpromising adapter configurations, employs fused grouped GEMM and rank-local adapter parallelism to co-locate surviving jobs and reclaim capacity, and combines intra- and inter-task scheduling that exploits predictable LoRA job durations. The central empirical claim is an up to 13.8× speedup over state-of-the-art baselines while preserving adapter quality.

Significance. If the performance and quality claims are substantiated, ALTO would represent a practical advance in efficient multi-tenant LoRA serving by reducing wasted computation on weak hyperparameter candidates and improving cluster utilization. The combination of trajectory-based pruning with fused execution and heterogeneous scheduling is a concrete systems contribution that could influence production fine-tuning pipelines. The work is grounded in empirical measurements rather than new theory, which is appropriate for the systems/ML intersection but places a premium on experimental rigor.

major comments (3)

[§3.2] §3.2 (Loss Trajectory Monitoring): The early-termination policy is load-bearing for both the reported speedup and the 'no quality sacrifice' guarantee, yet the manuscript provides insufficient detail on the exact heuristic (threshold, slope test, or window size) and its behavior on non-monotonic or task-dependent loss curves. LoRA convergence is known to be heterogeneous; a configuration that appears inferior after 10-20% of steps can overtake others later. Without an ablation showing false-negative rates or a comparison of final adapter quality (e.g., downstream task metrics) between pruned and fully trained runs, the quality invariant cannot be verified.
[§4.2] §4.2 (Experimental Methodology): The abstract asserts a 13.8× speedup and preserved quality, but the evaluation section lacks explicit enumeration of baselines (which SOTA systems?), datasets, number of independent runs, statistical significance tests, or error bars. Without these, the magnitude of the speedup and the claim of no quality degradation cannot be assessed for robustness across heterogeneous task mixes.
[§4.3] §4.3 (Quality Evaluation): Adapter quality is asserted to be preserved, but the metrics used (validation loss only, or downstream accuracy/F1 on held-out tasks?) are not clearly stated. If quality is measured solely on the same loss trajectories used for pruning, the evaluation is circular and does not demonstrate that pruned configurations would not have yielded superior final adapters on the target tasks.

minor comments (3)

[Figure 3] Figure 3 (scheduling diagram): The caption and legend do not clearly distinguish intra-task vs. inter-task placement decisions; add explicit annotations or a small example trace.
[§3.1] Notation in §3.1: The definition of 'rank-local parallelism' reuses the symbol R for both adapter rank and a runtime variable; introduce distinct symbols to avoid confusion.
[§2] Missing reference: The related-work discussion of prior LoRA tuning systems should cite the specific papers whose throughput numbers are used as baselines in Table 2.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of the early-termination policy, experimental methodology, and quality evaluation.

read point-by-point responses

Referee: [§3.2] §3.2 (Loss Trajectory Monitoring): The early-termination policy is load-bearing for both the reported speedup and the 'no quality sacrifice' guarantee, yet the manuscript provides insufficient detail on the exact heuristic (threshold, slope test, or window size) and its behavior on non-monotonic or task-dependent loss curves. LoRA convergence is known to be heterogeneous; a configuration that appears inferior after 10-20% of steps can overtake others later. Without an ablation showing false-negative rates or a comparison of final adapter quality (e.g., downstream task metrics) between pruned and fully trained runs, the quality invariant cannot be verified.

Authors: We agree that the description of the loss trajectory monitoring heuristic requires greater precision. In the revised manuscript we will expand §3.2 to specify the exact threshold, slope test, window size, and the mechanism used to accommodate non-monotonic or task-dependent curves. We will also add an ablation study (in the main text or appendix) that reports false-negative rates across the evaluated tasks and directly compares downstream task metrics (accuracy/F1) of pruned versus fully trained adapters, thereby substantiating the quality-preservation claim. revision: yes
Referee: [§4.2] §4.2 (Experimental Methodology): The abstract asserts a 13.8× speedup and preserved quality, but the evaluation section lacks explicit enumeration of baselines (which SOTA systems?), datasets, number of independent runs, statistical significance tests, or error bars. Without these, the magnitude of the speedup and the claim of no quality degradation cannot be assessed for robustness across heterogeneous task mixes.

Authors: We concur that §4.2 should be more explicit. The revised version will enumerate the precise state-of-the-art baselines, list all datasets and heterogeneous task mixes, report the number of independent runs, and include statistical significance tests together with error bars so that the speedup magnitude and quality claims can be evaluated for robustness. revision: yes
Referee: [§4.3] §4.3 (Quality Evaluation): Adapter quality is asserted to be preserved, but the metrics used (validation loss only, or downstream accuracy/F1 on held-out tasks?) are not clearly stated. If quality is measured solely on the same loss trajectories used for pruning, the evaluation is circular and does not demonstrate that pruned configurations would not have yielded superior final adapters on the target tasks.

Authors: We will revise §4.3 to state explicitly that adapter quality is measured by downstream task metrics (accuracy and F1 on held-out test sets) that are independent of the validation loss used for pruning decisions. We will further add side-by-side comparisons of these downstream metrics for early-terminated versus fully trained adapters, directly addressing the concern that the evaluation might be circular. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems evaluation with no self-referential derivations

full rationale

The paper presents ALTO as an engineering system for concurrent LoRA tuning, relying on loss-trajectory monitoring for early termination, fused GEMM kernels, rank-local parallelism, and combined intra/inter-task scheduling. All performance claims (e.g., 13.8× speedup) are supported by direct cluster measurements on heterogeneous workloads rather than any closed-form derivation, fitted parameter renamed as prediction, or self-citation chain that reduces the central result to its own inputs. No equations, ansatzes, or uniqueness theorems appear; the evaluation is externally falsifiable via reproduction on the same hardware and tasks. This is a standard non-circular empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical systems paper; the central claims rest on engineering design choices and measured speedups rather than mathematical axioms or new physical entities.

pith-pipeline@v0.9.0 · 5545 in / 1146 out tokens · 26140 ms · 2026-05-10T20:16:40.659620+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
ALTO monitors loss trajectories to terminate unpromising configurations early... fused grouped GEMM together with a new rank-local adapter parallelism... hierarchical scheduling... MILP
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
loss-aware early-exit... Pattern-1: Divergence... Pattern-2: Overfitting... Warmup-Based Exiting

Reference graph

Works this paper leans on

68 extracted references · 20 canonical work pages · 12 internal anchors

[1]

The claude 3 model family: Opus, son- net, haiku

Anthropic. The claude 3 model family: Opus, son- net, haiku. https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/ Model_Card_Claude_3.pdf, 2024

2024
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Live- codebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

arXiv preprint arXiv:2210.03057 , year=

Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush V osoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Lan- guage models are multilingual chain-of-thought reason- ers.arXiv preprint arXiv:2210.03057, 2022

work page arXiv 2022
[6]

Judging llm-as- a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595– 46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo- han Li, Dacheng Li, Eric Xing, et al. Judging llm-as- a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595– 46623, 2023

2023
[7]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anasta- sios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonza- lez, et al. Chatbot arena: An open platform for evaluat- ing llms by human preference, 2024.URL https://arxiv. org/abs/2403.04132, 2(10), 2024

work page internal anchor Pith review arXiv 2024
[8]

Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models

Neel Guha, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Alex Chohlas-Wood, Austin Peters, Bran- don Waldon, Daniel Rockmore, Diego Zambrano, et al. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. Advances in neural information processing systems, 36:44123–44279, 2023

2023
[9]

Gptutor: Great personalized tutor with large language models for personalized learning content gen- eration

Eason Chen, Jia-En Lee, Jionghao Lin, and Kenneth Koedinger. Gptutor: Great personalized tutor with large language models for personalized learning content gen- eration. InProceedings of the Eleventh ACM Conference on Learning@ Scale, pages 539–541, 2024

2024
[10]

Exploring the potential of large language models for automation in technical customer service.arXiv preprint arXiv:2405.09161, 2024

Jochen Wulf and Juerg Meierhofer. Exploring the poten- tial of large language models for automation in techni- cal customer service.arXiv preprint arXiv:2405.09161, 2024

work page arXiv 2024
[11]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 13

2022
[12]

Fireworks ai fine-tuning, 2024

Fireworks AI. Fireworks ai fine-tuning, 2024

2024
[13]

Together AI.https://www.together.ai/, 2024

2024
[14]

Tinker: A flexible api for fine- tuning open source models with lora, 2025

Thinking Machines Lab. Tinker: A flexible api for fine- tuning open source models with lora, 2025

2025
[15]

Unsloth, 2023

Michael Han Daniel Han and Unsloth team. Unsloth, 2023

2023
[16]

LlamaFactory: Uni- fied efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, and Yongqiang Ma. LlamaFactory: Uni- fied efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 3: System Demonstrations), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 400–410. Associatio...

2024
[17]

Training language models to follow instructions with human feedback.Advances in neural information pro- cessing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information pro- cessing systems, 35:27730–27744, 2022

2022
[18]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

2025
[19]

PEFT: State-of-the-art parameter- efficient fine-tuning methods

Sourab Mangrulkar, Sylvain Gugger, Lysandre De- but, Younes Belkada, Sayak Paul, Benjamin Bossan, and Marian Tietz. PEFT: State-of-the-art parameter- efficient fine-tuning methods. https://github.com/ huggingface/peft, 2022

2022
[20]

Lobra: Multi-tenant fine-tuning over heterogeneous data.Proc

Sheng Lin, Fangcheng Fu, Haoyang Li, Hao Ge, Xuanyu Wang, Jiawen Niu, Yaofeng Tu, and Bin Cui. Lobra: Multi-tenant fine-tuning over heterogeneous data.Proc. VLDB Endow., 18(8):2616–2625, April 2025

2025
[21]

mlora: Fine-tuning lora adapters via highly-efficient pipeline parallelism in multiple gpus.Proc

Zhengmao Ye, Dengchun Li, Zetao Hu, Tingfeng Lan, Jian Sha, Shicong Zhang, Lei Duan, Jie Zuo, Hui Lu, Yuanchun Zhou, and Mingjie Tang. mlora: Fine-tuning lora adapters via highly-efficient pipeline parallelism in multiple gpus.Proc. VLDB Endow., 18(6):1948–1961, February 2025

1948
[22]

Accessed: 2026-03-28

Google Cloud.Tune Gemini models by using supervised fine-tuning, 2024. Accessed: 2026-03-28

2024
[23]

Accessed: 2026-03-28

Amazon Web Services.Fine-tune Foundation Models with Amazon SageMaker JumpStart, 2024. Accessed: 2026-03-28

2024
[24]

Accessed: 2026-03-28

Microsoft.Customize a model with fine-tuning in Azure OpenAI, 2024. Accessed: 2026-03-28

2024
[25]

Plora: Efficient lora hyperparameter tuning for large models.arXiv preprint arXiv:2508.02932, 2025

Minghao Yan, Zhuang Wang, Zhen Jia, Shivaram Venkataraman, and Yida Wang. Plora: Efficient lora hyperparameter tuning for large models.arXiv preprint arXiv:2508.02932, 2025

work page arXiv 2025
[26]

Lo- rafusion: A crossbar-aware multi-task adaption frame- work via efficient fusion of pretrained lora modules

Jingkai Guo, Asmer Ali, Li Yang, and Deliang Fan. Lo- rafusion: A crossbar-aware multi-task adaption frame- work via efficient fusion of pretrained lora modules. In Proceedings of the Great Lakes Symposium on VLSI 2025, GLSVLSI ’25, page 777–783, New York, NY , USA, 2025. Association for Computing Machinery

2025
[27]

tlora: Efficient multi-lora training with elastic shared super-models.arXiv preprint arXiv:2602.07263, 2026

Kevin Li, Dibyadeep Saha, Avni Kanodia, and Fan Lai. tlora: Efficient multi-lora training with elastic shared super-models.arXiv preprint arXiv:2602.07263, 2026

work page arXiv 2026
[28]

Early stopping-but when? InNeural Networks: Tricks of the trade, pages 55–69

Lutz Prechelt. Early stopping-but when? InNeural Networks: Tricks of the trade, pages 55–69. Springer, 2002

2002
[29]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Sho- janazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review arXiv 2023
[30]

ZeRO: Memory optimizations toward train- ing trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward train- ing trillion parameter models. InProceedings of the In- ternational Conference for High Performance Comput- ing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page 20. IEEE/ACM, 2020

2020
[31]

Dapple: A pipelined data paral- lel approach for training large models

Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al. Dapple: A pipelined data paral- lel approach for training large models. InProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021

2021
[32]

Gpipe: Effi- cient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Effi- cient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

2019
[33]

Pipedream: Gen- eralized pipeline parallelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Gen- eralized pipeline parallelism for dnn training. InPro- ceedings of the 27th ACM symposium on operating sys- tems principles, pages 1–15, 2019. 14

2019
[34]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[35]

Efficient large-scale language model training on gpu clusters using megatron-lm

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the international conference for high per- formance computing, networking,...

2021
[36]

Parameter- efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges- mundo, Mona Attariyan, and Sylvain Gelly. Parameter- efficient transfer learning for nlp. InInternational con- ference on machine learning, pages 2790–2799. PMLR, 2019

2019
[37]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 3045– 3059, 2021

2021
[38]

Prefix-tuning: Optimiz- ing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz- ing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Com- putational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021

2021
[39]

P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks

Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengx- iao Du, Zhilin Yang, and Jie Tang. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, 2022

2022
[40]

Qlora: Efficient finetuning of quan- tized llms.Advances in neural information processing systems, 36:10088–10115, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quan- tized llms.Advances in neural information processing systems, 36:10088–10115, 2023

2023
[41]

Anyscale.https://www.anyscale.com/, 2024

2024
[42]

Lambda.https://lambdalabs.com/, 2024

2024
[43]

16 changes to the way enterprises are building and buying generative ai, 2024

Andreessen Horowitz. 16 changes to the way enterprises are building and buying generative ai, 2024. Accessed: 2025

2024
[44]

LoRAX: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Predibase. LoRAX: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs. GitHub, 2024

2024
[45]

Lora with- out regret.Thinking Machines Lab: Connectionism,

John Schulman and Thinking Machines Lab. Lora with- out regret.Thinking Machines Lab: Connectionism,
[46]

https://thinkingmachines.ai/blog/lora/
[47]

Tina: Tiny reasoning models via LoRA.arXiv preprint arXiv:2504.15777, 2025b

Shangshang Wang, Julian Asilis, Ömer Faruk Akgül, Enes Burak Bilgin, Ollie Liu, and Willie Neiswanger. Tina: Tiny reasoning models via lora.arXiv preprint arXiv:2504.15777, 2025

work page arXiv 2025
[48]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[49]

Di- rect preference optimization: Your language model is secretly a reward model.Advances in neural informa- tion processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Di- rect preference optimization: Your language model is secretly a reward model.Advances in neural informa- tion processing systems, 36:53728–53741, 2023

2023
[50]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shi- rong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

{TensorFlow}: a system for {Large-Scale} machine learning

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, San- jay Ghemawat, Geoffrey Irving, Michael Isard, et al. {TensorFlow}: a system for {Large-Scale} machine learning. In12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016

2016
[52]

arXiv preprint arXiv:2006.15704 , author =

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch dis- tributed: Experiences on accelerating data parallel train- ing.arXiv preprint arXiv:2006.15704, 2020

work page arXiv 2006
[53]

S-lora: Serving thousands of concurrent lora adapters,

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023

work page arXiv 2023
[54]

Punica: Multi-tenant lora serving.Proceedings of Machine Learning and Systems, 6:1–13, 2024

Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. Punica: Multi-tenant lora serving.Proceedings of Machine Learning and Systems, 6:1–13, 2024

2024
[55]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Qwen2.5 technical report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

2025
[57]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plap- pert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[58]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review arXiv 2024
[59]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhu- rina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025

work page internal anchor Pith review arXiv 2025
[60]

Efficient memory manage- ment for large language model serving with pagedatten- tion

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory manage- ment for large language model serving with pagedatten- tion. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

2023
[61]

Ultrafeedback: Boosting language models with high-quality feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. 2023

2023
[62]

Beware of the batch size: Hyperparameter bias in evaluating lora.arXiv preprint arXiv:2602.09492, 2026

Sangyoon Lee and Jaeho Lee. Beware of the batch size: Hyperparameter bias in evaluating lora.arXiv preprint arXiv:2602.09492, 2026

work page arXiv 2026
[63]

Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, page 10–19, New York, NY , USA, 2019. Association for Computing Machinery

2019
[64]

Laurent Perron and Frédéric Didier. Cp-sat
[65]

Hyperband: A novel bandit-based approach to hyperparameter optimization

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Ros- tamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of machine learning research, 18(185):1–52, 2018

2018
[66]

A system for mas- sively parallel hyperparameter tuning.Proceedings of machine learning and systems, 2:230–246, 2020

Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Eka- terina Gonina, Jonathan Ben-Tzur, Moritz Hardt, Ben- jamin Recht, and Ameet Talwalkar. A system for mas- sively parallel hyperparameter tuning.Proceedings of machine learning and systems, 2:230–246, 2020

2020
[67]

Speeding up automatic hyperparameter optimiza- tion of deep neural networks by extrapolation of learning curves

Tobias Domhan, Jost Tobias Springenberg, Frank Hutter, et al. Speeding up automatic hyperparameter optimiza- tion of deep neural networks by extrapolation of learning curves. InIJCAI, volume 15, pages 3460–8, 2015

2015
[68]

{dLoRA}: Dynamically or- chestrating requests and adapters for {LoRA}{LLM} serving

Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xu- anzhe Liu, and Xin Jin. {dLoRA}: Dynamically or- chestrating requests and adapters for {LoRA}{LLM} serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 911–927, 2024. 16 A Appendix A.1 Grouped GEMM Kernel Design Our Triton-based grouped GEMM kernel processes mu...

2024