pith. machine review for the scientific record. sign in

arxiv: 2604.07173 · v1 · submitted 2026-04-08 · 💻 cs.DC

Recognition: 2 theorem links

· Lean Theorem

InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:19 UTC · model grok-4.3

classification 💻 cs.DC
keywords LoRAdisaggregated servingLLM inferencemulti-tenant systemsmixture of expertslatency SLOGPU optimization
0
0 comments X

The pith

Disaggregating LoRA execution from base model inference enables higher request throughput under strict latency constraints in multi-LoRA LLM serving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LoRA adapters allow efficient customization of large language models but create memory challenges when used with mixture-of-experts architectures in multi-tenant settings. The paper proposes to disaggregate the execution of these adapters from the main inference process by introducing a dedicated shared server for LoRA operations. This server incorporates parallelism across adapters, provisions based on latency targets, and uses direct hardware communication to minimize delays. If the design works as described, serving systems can process significantly more requests while keeping response times within required bounds and support more customized models at once. The approach focuses on practical optimizations to make the separation efficient rather than adding overhead.

Core claim

By decoupling LoRA adapter execution from base-model inference, InfiniLoRA uses a shared LoRA Server equipped with parallelism-aware execution, SLO-driven provisioning, and optimizations including GPU-initiated communication and specialized kernels to achieve an average 3.05 times increase in serviceable request rate under strict latency SLOs and raise the share of adapters meeting the SLO by 54 percent.

What carries the argument

A shared LoRA Server that performs adapter computations independently from the base model, managed with awareness of parallel execution and hardware-specific optimizations to reduce data transfer costs.

If this is right

  • The system can handle a higher volume of requests while respecting latency service level objectives.
  • A greater percentage of LoRA adapters become usable without breaching performance guarantees.
  • Memory usage becomes more efficient for models with high adapter overhead such as those based on mixture of experts.
  • Critical path optimizations keep added latency from separation low enough to yield net improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation technique could extend to other fine-tuning methods beyond LoRA in inference serving.
  • Pooling adapter resources in this way might lower the overall hardware requirements for providers offering many customized models.
  • Testing the system with non-MoE models would clarify how much the gains depend on the high memory cost of expert-based architectures.
  • Further work could explore dynamic migration of adapters between servers to balance load in real time.

Load-bearing premise

The communication and coordination costs introduced by running LoRA on a separate server stay small compared to the benefits from reduced memory contention and better parallelism.

What would settle it

Measuring request throughput on a cluster with slower interconnects between the base model servers and the LoRA server and finding that the gains fall below the reported levels or become negative.

Figures

Figures reproduced from arXiv: 2604.07173 by Bingsheng He, Hongyu Chen, Jingwen Leng, Letian Ruan, Minyi Guo, Shixuan Sun, Xinyu Chen, Yuchen Li, Zilin Xu.

Figure 1
Figure 1. Figure 1: (Top) LoRA cache capacity across model architec￾tures. (Bottom) Scale-out vs. scale-up performance. A natural response is to increase LoRA cache capacity by either scaling out, i.e., deploying more LLM instances, or scaling up, i.e., allocating more GPUs to a single instance. However, both approaches have fundamental limitations. Scaling out increases total cache capacity across instances but requires dupl… view at source ↗
Figure 3
Figure 3. Figure 3: LoRA computation on Dense and MoE models. Scheduler Req Resp LoRA Table ID Count 1 12 … … 512 0 Req Admission ① Batch ② LoRA Waiting Queue LLM Instance Host Memory Loading Adapters Req KV GPU 1 Model Adapter1 Adapter2 Running Queue Resp Adapter1 Adapter2 ... Adapter511 Adapter512 ... KV GPU 2 Model Adapter1 Adapter2 N N Y Y [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Coupled-design multi-LoRA serving architecture. R ℎ×𝑟 and 𝐵 ∈ R 𝑟×𝑑 are trainable matrices, called adapters. Given an input 𝑥, the output becomes 𝑦 ′ = 𝑥𝑊 ′ = 𝑥𝑊 +𝑥𝐴𝐵. The rank 𝑟 is typically small (e.g., 32–128), which signifi￾cantly reduces both training cost and inference overhead compared to fully fine-tuning [6, 13, 28]. In multi-task and multi-tenant serving environments, many LoRA adapters will be s… view at source ↗
Figure 2
Figure 2. Figure 2: Prefill–decode disaggregated architecture. LLM instances are deployed with 2 GPUs using expert parallelism. Pretrained Weights W ∈ R!×# A ∈ R!×$ B ∈ R$×# Input Output (a) Dense model. Input Output … Dispatch Expert 1 Pretrained Weights W ∈ R!×# A ∈ R!×$ B ∈ R$×# Combine Expert N Pretrained Weights W ∈ R!×# A ∈ R!×$ B ∈ R$×# (b) MoE model [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of LoRA cache ratio on TTFT performance and SLO attainment. (Left) P95 TTFT under varying cache ratios, with SLO of 0.25 seconds. (Right) Percentage of LoRA adapters for which the fraction of requests meeting the TTFT SLO exceeds specific thresholds (50%, 80%, and 90%). requests must wait until in-flight executions complete and GPU memory becomes available to load its adapter. This introduces additi… view at source ↗
Figure 6
Figure 6. Figure 6: Impact of LoRA cache ratio on batch size. Measure￾ments are collected during the steady-state interval (30–270s) of the 300s experiment. adapters to the base model execution. This tight coupling forces dynamic LoRA adapters to scale in lockstep with the heavyweight base model, limiting scalability and efficiency. 3 An Overview of InfiniLoRA To overcome the fundamental limitations of the coupled ar￾chitectu… view at source ↗
Figure 7
Figure 7. Figure 7: illustrates the architecture and execution work￾flow of InfiniLoRA. Unlike coupled designs, InfiniLoRA man￾ages LoRA adapters in a dedicated LoRA Server, which may span multiple nodes, while LLM instances remain LoRA-free and execute the base model using their existing optimiza￾tion strategies. During request processing, an LLM instance performs base-model computation and forwards the corre￾sponding activa… view at source ↗
Figure 8
Figure 8. Figure 8: LoRA adapter placement strategies across server GPUs. The three-dimensional block represents the adapter space, with axes corresponding to LoRA adapters, layers, and experts. Each color indicates the server GPU (GPU 1–4) to which a partition of adapters is assigned. Arrows depict the activation data flow between client GPUs and server GPUs. using four metrics summarized in [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 10
Figure 10. Figure 10: Layer-wise LoRA loading. Shaded blue blocks represent LoRA execution from any other LLM instance. pull-based communication incurs 2.63× higher latency than push-based. A similar asymmetry applies to the server-to-client path: pull-based designs require clients to repeatedly poll remote completion states, incurring additional round-trip latency. Consequently, InfiniLoRA adopts push-based communica￾tion in … view at source ↗
Figure 11
Figure 11. Figure 11: P95 TTFT, SLO attainment rate, throughput and average TPOT from top to bottom under varying loads. We compare InfiniLoRA against three baselines. The two values listed under each model name correspond to the LoRA cache capacity provided by S-LoRA w/ Less LoRA, S-LoRA (including w/ SJF) and InfiniLoRA, respectively. 1 2 3 4 5 6 LLM Instance Count 10 1 10 0 P95 TTFT (s) InfiniLoRA SLO (0.25s) 1 2 3 4 5 6 LL… view at source ↗
Figure 12
Figure 12. Figure 12: Performance of scaling #LLM instances, config￾ured with a request rate of 12 req/s per instance (keeping LoRA Server unchanged and using Mixtral-8x7B model). larger effective batch size enabled by increased LoRA cache capacity improves GPU utilization. At high request rates on Mixtral, InfiniLoRA exhibits higher TPOT than S-LoRA. This occurs because both systems cache a large number of adapters, but under… view at source ↗
Figure 14
Figure 14. Figure 14: Ablation study for quantifying the effectiveness of individual optimization techniques. +kernel represents the fully optimized system with all techniques enabled. layer-wise adapter loading (+loading). The full InfiniLoRA system further incorporates hardware-specialized kernels (+kernel). As shown in the results, despite having a slightly larger cache capacity (104 over 100), +disagg alone increases tail … view at source ↗
Figure 13
Figure 13. Figure 13: Performance of scaling LoRA Server resources (we keep resources for LLM instances unchanged, using Qwen3-30B-A3B model and request rate=35 req/s). are also consistent with the probabilistic model developed in Section 4.2: for LoRA cache capacities of 128, 192, and 256 in this setting, the model predicts immediate admission probabilities of 83.0%, 92.2%, and 100.0%, respectively, which closely matches the … view at source ↗
Figure 16
Figure 16. Figure 16: Our experimental setup consists of a 4-GPU LoRA Server serving two types of LLM instances: a Mixtral 8x7B model (2 GPUs) or a Scaled MoE model (4 GPUs). We observe that communication latency scales linearly with batch size, as it is strictly bound by the send/receive bandwidth of the LLM instance’s NICs. In contrast, LoRA computation time increases sub-linearly with batch size. This behavior stems from th… view at source ↗
Figure 15
Figure 15. Figure 15: Scalability under varying LoRA popularity distri￾butions and adapter counts. A.1.2 Scale with batch size. We evaluate LoRA Server’s processing latency across varying task sizes as shown in [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 17
Figure 17. Figure 17: Impact of interconnect bandwidth on In￾finiLoRA’s serving performance: NVLink vs. InfiniBand. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: End-to-end performance comparison of two LoRA data layouts (𝐸𝑃2-𝑃𝑃4 and 𝐸𝑃4-𝑃𝑃2) under the same LoRA cache capacity. bandwidth utilization. Specifically, InfiniLoRA-BGMV excels during the shrink phase but exhibits performance degrada￾tion in the expand phase, primarily due to the larger volume of data written back in the latter. In contrast, InfiniLoRA￾SGMV maintains consistent performance across both pha… view at source ↗
Figure 19
Figure 19. Figure 19: Characterization of latency and bandwidth for distinct LoRA kernels across shrink/expand phases. Dashed lines represent bandwidth and solid lines indicate latency. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗
read the original abstract

LoRA enables efficient customization of LLMs and is widely used in multi-tenant and multi-task serving. However, emerging model architectures such as MoE significantly increase LoRA memory cost, making existing coupled LoRA serving designs poorly scalable and prone to tail-latency inflation. We present InfiniLoRA, a disaggregated LoRA serving system that decouples LoRA execution from base-model inference. InfiniLoRA introduces a shared LoRA Server with parallelism-aware execution, SLO-driven provisioning, and critical-path optimizations, including GPU-initiated communication and hardware-specialized LoRA kernels. Experiments show that InfiniLoRA can achieve an average $3.05\times$ increase in serviceable request rate under strict latency SLOs, and improve the percentage of LoRA adapters satisfying the SLO requirement by 54.0\%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents InfiniLoRA, a disaggregated multi-LoRA serving system for LLMs that decouples LoRA adapter execution from base-model inference to address high memory costs in MoE architectures. It introduces a shared LoRA Server with parallelism-aware execution, SLO-driven provisioning, GPU-initiated communication, and specialized kernels. The central claim, supported by experiments, is an average 3.05× increase in serviceable request rate under strict latency SLOs together with a 54% improvement in the fraction of adapters meeting those SLOs.

Significance. If the measured gains prove robust, the disaggregation approach could meaningfully advance scalable multi-tenant LLM serving by improving memory efficiency and latency compliance for memory-intensive MoE models without proportional hardware increases. The work supplies a concrete systems design and empirical quantification that could inform future inference-stack research.

major comments (2)
  1. [§4] §4 (Evaluation): The headline 3.05× request-rate and +54% SLO-compliance claims are presented without a latency-component breakdown or explicit measurement showing that GPU-initiated communication and cross-server data movement remain negligible relative to the memory savings; this is load-bearing because the architecture's premise is that disaggregation overheads do not inflate tail latencies under the strict SLOs.
  2. [§4.2, §5] §4.2 and §5: No ablation or sensitivity results are reported for different interconnect bandwidths, MoE expert counts, or request-trace characteristics, leaving open whether the reported gains generalize beyond the specific high-bandwidth hardware and workloads tested.
minor comments (2)
  1. [Abstract, §4] The abstract and §4 omit the precise numerical SLO thresholds, model sizes, and baseline system versions used, which would aid immediate interpretation of the quantitative results.
  2. [§3.3] Notation for the SLO-driven provisioning algorithm in §3.3 could be clarified with a small pseudocode listing or explicit variable definitions to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The feedback highlights important aspects of our evaluation that we will strengthen in the revision. We address each major comment below and commit to incorporating the suggested improvements.

read point-by-point responses
  1. Referee: [§4] §4 (Evaluation): The headline 3.05× request-rate and +54% SLO-compliance claims are presented without a latency-component breakdown or explicit measurement showing that GPU-initiated communication and cross-server data movement remain negligible relative to the memory savings; this is load-bearing because the architecture's premise is that disaggregation overheads do not inflate tail latencies under the strict SLOs.

    Authors: We agree that an explicit breakdown of latency components would strengthen the central claims. While the manuscript describes the GPU-initiated communication and specialized kernels as critical-path optimizations intended to keep disaggregation overheads low, we did not provide a quantitative decomposition isolating communication versus computation times under load. In the revised version we will add a new subsection (or expanded figure) in §4 that reports per-component latency measurements (base-model inference, LoRA execution, GPU-initiated transfers, and cross-server movement) across the evaluated request rates. These measurements will explicitly show that the added communication remains a small fraction of the SLO budget and does not drive the observed tail-latency improvements. revision: yes

  2. Referee: [§4.2, §5] §4.2 and §5: No ablation or sensitivity results are reported for different interconnect bandwidths, MoE expert counts, or request-trace characteristics, leaving open whether the reported gains generalize beyond the specific high-bandwidth hardware and workloads tested.

    Authors: We acknowledge that the current evaluation is performed on a single high-bandwidth cluster and a limited set of MoE configurations and traces. To address generalizability, the revised manuscript will include additional sensitivity experiments. We will report results for (1) reduced interconnect bandwidth (e.g., PCIe-only versus NVLink), (2) varying numbers of MoE experts, and (3) alternative request-trace distributions. These results will be placed in an expanded §4.2 with corresponding discussion in §5, allowing readers to assess how the gains scale with hardware and workload parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical systems evaluation with no derivation chain

full rationale

The paper presents a disaggregated LoRA serving architecture for MoE models, supported by experimental measurements of request-rate gains and SLO compliance. No mathematical derivations, fitted parameters, or equations are described that could reduce to self-definition or fitted inputs. Claims rest on external benchmarks (measured throughput and latency under specific workloads) rather than internal construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in the abstract or context to justify core results. This is a standard empirical systems paper whose validity is testable via replication on the reported hardware and traces.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented physical entities are stated in the abstract; the work rests on standard assumptions of distributed systems engineering and the existence of the described hardware kernels.

pith-pipeline@v0.9.0 · 5466 in / 1101 out tokens · 49136 ms · 2026-05-10T17:19:21.433509+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 35 canonical work pages · 5 internal anchors

  1. [1]

    Anonymous. 2026. Understanding LoRA As Knowledge Memory: An Empirical Analysis.https://openreview.net/forum?id=i1Mi2R1TsU

  2. [2]

    Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi, Yue Zhao, Yinan Zhao, Hui Qu, Wei-An Lin, Yiru Shen, Ajinkya Kale, Irfan Essa, and Humphrey Shi. 2025. MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models. arXiv:2511.20629 [cs.CV]https://arxiv.org/abs/2511.20629

  3. [3]

    Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. 2023. Punica: Multi-Tenant LoRA Serving. arXiv:2310.18547 [cs.DC]https://arxiv.org/abs/2310.18547

  4. [4]

    Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. 2024. LongLoRA: Efficient Fine-tuning of Long- Context Large Language Models. arXiv:2309.12307 [cs.CL]https: //arxiv.org/abs/2309.12307

  5. [5]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

  6. [6]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer

  7. [7]

    InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23)

    QLORA: efficient finetuning of quantized LLMs. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 441, 28 pages

  8. [8]

    Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xiaoxuan Liu, Yifan Qiao, Ion Stoica, and Junchen Jiang. 2025. PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles(Lotte Hotel World, S...

  9. [9]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Trans- formers: Scaling to Trillion Parameter Models with Simple and Effi- cient Sparsity. arXiv:2101.03961 [cs.LG]https://arxiv.org/abs/2101. 03961

  10. [10]

    Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. 2025. How to Train Long-Context Language Models (Effectively). InACL

  11. [11]

    Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. 2024. MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool. arXiv:2406.17565 [cs.DC] https://arxiv.org/abs/2406.17565

  12. [12]

    Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Chenxi Wang, Jiang Xu, Shuang Chen, Hao Feng, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. 2025. ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads.ACM Trans. Archit. Code Optim.22, 2, Article 77 (July 2025), 24 pages. doi:10.1145/3732941

  13. [13]

    Chenghao Hu, Yufei Kang, and Baochun Li. 2025. Communication- Efficient MoE Fine-Tuning with Locality-Aware Expert Placement. In2025 IEEE 45th International Conference on Distributed Computing Systems (ICDCS). 166–176. doi:10.1109/ICDCS63083.2025.00025

  14. [14]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low- Rank Adaptation of Large Language Models. InInternational Confer- ence on Learning Representations.https://openreview.net/forum?id= nZeVKeeFYf9

  15. [15]

    Nikoleta Iliakopoulou, Jovan Stojkovic, Chloe Alverti, Tianyin Xu, Hubertus Franke, and Josep Torrellas. 2025. Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environ- ments. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO ’25). Association for Computing Machinery, New York, NY, USA, 217–231...

  16. [16]

    Shashwat Jaiswal, Shrikara Arun, Anjaly Parayil, Ankur Mallick, Spyros Mastorakis, Alind Khare, Chloi Alverti, Renee St Amant, Chetan Bansal, Victor Rühle, and Josep Torrellas. 2025. Serving Het- erogeneous LoRA Adapters in Distributed LLM Inference Systems. arXiv:2511.22880 [cs.DC]https://arxiv.org/abs/2511.22880

  17. [17]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Men- sch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...

  18. [18]

    Rui Kong, Qiyang Li, Xinyu Fang, Qingtian Feng, Qingfeng He, Yazhu Dong, Weijun Wang, Yuanchun Li, Linghe Kong, and Yunxin Liu

  19. [19]

    Lora-switch: Boosting the efficiency of dy- namic llm adapters via system-algorithm co-design.arXiv preprint arXiv:2405.17741, 2024

    LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design. arXiv:2405.17741 [cs.AI]https: //arxiv.org/abs/2405.17741

  20. [20]

    Xiaoyu Kong, Jiancan Wu, An Zhang, Leheng Sheng, Hui Lin, Xiang Wang, and Xiangnan He. 2024. Customizing language models with instance-wise LoRA for sequential recommendation. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article 3593...

  21. [21]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica

  22. [22]

    InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

    Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. 14 InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models

  23. [23]

    Hao Li, Chenghao Yang, An Zhang, Yang Deng, Xiang Wang, and Tat-Seng Chua. 2025. Hello Again! LLM-powered Personalized Agent for Long-term Dialogue. arXiv:2406.05925 [cs.CL]https://arxiv.org/ abs/2406.05925

  24. [24]

    Suyi Li, Yifan Qiao, Jiacheng Ma, Shan Yu, Haoran Ma, Ziming Liu, Hang Ren, Wenguang Chen, Yongwei Wu, Weimin Zheng, and Kang Chen. 2025. Toppings: Modular and Extensible Serverless Function Delivery at High Speed. In2025 USENIX Annual Technical Conference (USENIX ATC 25). USENIX Association, Seattle, WA.https://www. usenix.org/conference/atc25/presentati...

  25. [25]

    Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guo- hong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guan- jing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, and Yunxin Liu. 2024. Personal LLM Agents: Insights and Survey about the...

  26. [26]

    2022.Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async

    Pak Markthub, Jim Dinan, Sreeram Potluri, and Seth How- ell. 2022.Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async. NVIDIA.https://developer.nvidia.com/blog/improving-network- performance-of-hpc-systems-using-nvidia-magnum-io-nvshmem- and-gpudirect-async/

  27. [27]

    OpenAI. 2025. gpt-oss-120b & gpt-oss-20b Model Card. arXiv:2508.10925 [cs.CL]https://arxiv.org/abs/2508.10925

  28. [28]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2025. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (Buenos Aires, Argentina)(ISCA ’24). IEEE Press, 118–132. doi:10. 1109/ISCA59077.2024.00019

  29. [29]

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Moon- cake: Trading More Storage for Less Computation — A KVCache- centric Architecture for Serving LLM Chatbot. In23rd USENIX Confer- ence on File and Storage Technologies (FAST 25). USENIX Association, Santa Clara, CA, 155–170.https://w...

  30. [30]

    Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed-MoE: Advancing Mixture-of- Experts Inference and Training to Power Next-Generation AI Scale. arXiv:2201.05596 [cs.LG]https://arxiv.org/abs/2201.05596

  31. [31]

    John Schulman and Thinking Machines Lab. 2025. LoRA Without Regret.Thinking Machines Lab: Connectionism(2025). doi:10.64434/ tml.20250929https://thinkingmachines.ai/blog/lora/

  32. [32]

    Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vin- cent Zhao, Hongkun Yu, Kurt Keutzer, Trevor Darrell, and Denny Zhou. 2023. Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models. a...

  33. [33]

    2025.EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices

    Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, and Ang Li. 2025.EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices. Association for Computing Machinery, New York, NY, USA, 138–153.https://doi.org/10.1145/ 3711875.3729141

  34. [34]

    Gonzalez, and Ion Stoica

    Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, and Ion Stoica. 2023. S-LoRA: Serving Thousands of Concurrent LoRA Adapters.arXiv preprint arXiv:2311.03285(2023)

  35. [35]

    Ge Shi, Hanieh Sadri, Qian Wang, Yu Zhang, Ying Xiong, Yong Zhang, and Zhenan Fan. 2025. ExpertWeave: Efficiently Serving Expert- Specialized Fine-Tuned Adapters at Scale. arXiv:2508.17624 [cs.DC] https://arxiv.org/abs/2508.17624

  36. [36]

    Smith, Luke Zettlemoyer, Wen tau Yih, and Mike Lewis

    Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Wen tau Yih, and Mike Lewis. 2024. In-Context Pretraining: Language Modeling Beyond Document Boundaries. InThe Twelfth International Confer- ence on Learning Representations.https://openreview.net/forum?id= LXVswInHOo

  37. [37]

    Xiao Shi, Jiangsu Du, Zhiguang Chen, and Yutong Lu. 2025. AuLoRA: Fine-Grained Loading and Computation Orchestration for Efficient LoRA LLM Serving. In2025 IEEE 43rd International Conference on Computer Design (ICCD). 277–284. doi:10.1109/ICCD65941.2025.00046

  38. [38]

    Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

  39. [39]

    The Mosaic Research Team. 2024. Introducing DBRX: A New State- of-the-Art Open LLM.https://www.databricks.com/blog/introducing- dbrx-new-state-art-open-llm. Accessed: 2026-01-21

  40. [40]

    Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, and Xiaowen Chu. 2025. BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD ’25). ACM, Tor...

  41. [41]

    Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gutfreund, Rogerio Feris, and Zexue He

  42. [42]

    arXiv preprint arXiv:2502.00592 , year=

    M+: Extending MemoryLLM with Scalable Long-Term Memory. arXiv:2502.00592 [cs.CL]https://arxiv.org/abs/2502.00592

  43. [43]

    Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, and Tieniu Tan. 2025. LoRA-Pro: Are Low-Rank Adapters Properly Optimized?. InThe Thir- teenth International Conference on Learning Representations (ICLR)

  44. [44]

    Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving. In18th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 24). USENIX Associa- tion, Santa Clara, CA, 911–927.https://www.usenix.org/conference/ osdi24/presentation/wu-bingyang

  45. [45]

    Shuaipeng Wu, Yanying Lin, Shijie Peng, Wenyan Chen, Chong Ma, Min Shen, Le Chen, Chengzhong Xu, and Kejiang Ye. 2025. Rock: Serving Multimodal Models in Cloud with Heterogeneous-Aware Resource Orchestration for Thousands of LoRA Adapters. In2025 IEEE International Conference on Cluster Computing (CLUSTER). 1–13. doi:10.1109/CLUSTER59342.2025.11186463

  46. [46]

    Lingnan Xia and Hua Ma. 2024. Enhancing LoRA Model Serving Capacity via Adaptive Operator Scheduling for Multi-Tenancy on GPU.IEEE Access12 (2024), 160441–160449. doi:10.1109/ACCESS. 2024.3483250

  47. [47]

    Yifei Xia, Fangcheng Fu, Wentao Zhang, Jiawei Jiang, and Bin Cui

  48. [48]

    InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada) (NIPS ’24)

    Efficient multi-task LLM quantization and serving for multiple LoRA adapters. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada) (NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article 2034, 29 pages

  49. [49]

    Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. 2024. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads. arXiv:2410.10819 [cs.CL]https://arxiv.org/abs/2410.10819

  50. [50]

    Yahe Yang, Chunliang Tao, and Xiaojing Fan. 2025. LoRA-LiteE: A Computationally Efficient Framework for Chatbot Preference-Tuning. arXiv:2411.09947 [cs.CL]https://arxiv.org/abs/2411.09947

  51. [52]

    Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, and Minyi Guo. 2025. Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management. arXiv:2505.03756 [cs.AR]https://arxiv.org/abs/2505.03756

  52. [53]

    Tianyu Zhang, Peng Zhang, Yusong Gao, and Yun Zhang. 2025. To- gether with SGLang: Best Practices for Serving DeepSeek-R1 on H20- 96G.https://lmsys.org/blog/2025-09-26-sglang-ant-group/. LMSYS Org Blog

  53. [54]

    You Zhang, Jin Wang, Liang-Chih Yu, Dan Xu, and Xuejie Zhang

  54. [55]

    Proceedings of the AAAI Conference on Artificial Intelligence38, 17 (Mar

    Personalized LoRA for Human-Centered Text Understanding. Proceedings of the AAAI Conference on Artificial Intelligence38, 17 (Mar. 2024), 19588–19596. doi:10.1609/aaai.v38i17.29931

  55. [56]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: disaggregating prefill and decoding for goodput-optimized large language model serv- ing. InProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation(Santa Clara, CA, USA)(OSDI’24). USENIX Association, USA, Art...

  56. [57]

    Changhai Zhou, Yuhua Zhou, Shiyang Zhang, Yibin Wang, and Zekai Liu. 2025. Dynamic Operator Optimization for Efficient Multi-Tenant LoRA Model Serving.Proceedings of the AAAI Conference on Artificial Intelligence39, 21 (Apr. 2025), 22910–22918. doi:10.1609/aaai.v39i21. 34453

  57. [58]

    Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, and Xin Liu. 2025. MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Dis- aggregated Expert Parallelism...

  58. [59]

    Ruidong Zhu, Ziyue Jiang, Zhi Zhang, Xin Liu, Xuanzhe Liu, and Xin Jin. 2025. Cannikin: No Lagger of SLO in Concurrent Multiple LoRA LLM Serving.IEEE Transactions on Parallel and Distributed Systems 36, 9 (2025), 1972–1984. doi:10.1109/TPDS.2025.3590014 16 InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models A Appendix A.1 Additional Sca...