arxiv: 2604.07173 · v1 · submitted 2026-04-08 · 💻 cs.DC

Recognition: 2 theorem links

· Lean Theorem

InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models

Hongyu Chen , Letian Ruan , Zilin Xu , Yuchen Li , Xinyu Chen , Jingwen Leng , Bingsheng He , Minyi Guo

show 1 more author

Shixuan Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:19 UTC · model grok-4.3

classification 💻 cs.DC

keywords LoRAdisaggregated servingLLM inferencemulti-tenant systemsmixture of expertslatency SLOGPU optimization

0 comments

The pith

Disaggregating LoRA execution from base model inference enables higher request throughput under strict latency constraints in multi-LoRA LLM serving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LoRA adapters allow efficient customization of large language models but create memory challenges when used with mixture-of-experts architectures in multi-tenant settings. The paper proposes to disaggregate the execution of these adapters from the main inference process by introducing a dedicated shared server for LoRA operations. This server incorporates parallelism across adapters, provisions based on latency targets, and uses direct hardware communication to minimize delays. If the design works as described, serving systems can process significantly more requests while keeping response times within required bounds and support more customized models at once. The approach focuses on practical optimizations to make the separation efficient rather than adding overhead.

Core claim

By decoupling LoRA adapter execution from base-model inference, InfiniLoRA uses a shared LoRA Server equipped with parallelism-aware execution, SLO-driven provisioning, and optimizations including GPU-initiated communication and specialized kernels to achieve an average 3.05 times increase in serviceable request rate under strict latency SLOs and raise the share of adapters meeting the SLO by 54 percent.

What carries the argument

A shared LoRA Server that performs adapter computations independently from the base model, managed with awareness of parallel execution and hardware-specific optimizations to reduce data transfer costs.

If this is right

The system can handle a higher volume of requests while respecting latency service level objectives.
A greater percentage of LoRA adapters become usable without breaching performance guarantees.
Memory usage becomes more efficient for models with high adapter overhead such as those based on mixture of experts.
Critical path optimizations keep added latency from separation low enough to yield net improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation technique could extend to other fine-tuning methods beyond LoRA in inference serving.
Pooling adapter resources in this way might lower the overall hardware requirements for providers offering many customized models.
Testing the system with non-MoE models would clarify how much the gains depend on the high memory cost of expert-based architectures.
Further work could explore dynamic migration of adapters between servers to balance load in real time.

Load-bearing premise

The communication and coordination costs introduced by running LoRA on a separate server stay small compared to the benefits from reduced memory contention and better parallelism.

What would settle it

Measuring request throughput on a cluster with slower interconnects between the base model servers and the LoRA server and finding that the gains fall below the reported levels or become negative.

Figures

Figures reproduced from arXiv: 2604.07173 by Bingsheng He, Hongyu Chen, Jingwen Leng, Letian Ruan, Minyi Guo, Shixuan Sun, Xinyu Chen, Yuchen Li, Zilin Xu.

**Figure 1.** Figure 1: (Top) LoRA cache capacity across model architectures. (Bottom) Scale-out vs. scale-up performance. A natural response is to increase LoRA cache capacity by either scaling out, i.e., deploying more LLM instances, or scaling up, i.e., allocating more GPUs to a single instance. However, both approaches have fundamental limitations. Scaling out increases total cache capacity across instances but requires dupl… view at source ↗

**Figure 3.** Figure 3: LoRA computation on Dense and MoE models. Scheduler Req Resp LoRA Table ID Count 1 12 … … 512 0 Req Admission ① Batch ② LoRA Waiting Queue LLM Instance Host Memory Loading Adapters Req KV GPU 1 Model Adapter1 Adapter2 Running Queue Resp Adapter1 Adapter2 ... Adapter511 Adapter512 ... KV GPU 2 Model Adapter1 Adapter2 N N Y Y [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Coupled-design multi-LoRA serving architecture. R ℎ×𝑟 and 𝐵 ∈ R 𝑟×𝑑 are trainable matrices, called adapters. Given an input 𝑥, the output becomes 𝑦 ′ = 𝑥𝑊 ′ = 𝑥𝑊 +𝑥𝐴𝐵. The rank 𝑟 is typically small (e.g., 32–128), which significantly reduces both training cost and inference overhead compared to fully fine-tuning [6, 13, 28]. In multi-task and multi-tenant serving environments, many LoRA adapters will be s… view at source ↗

**Figure 2.** Figure 2: Prefill–decode disaggregated architecture. LLM instances are deployed with 2 GPUs using expert parallelism. Pretrained Weights W ∈ R!×# A ∈ R!×$ B ∈ R$×# Input Output (a) Dense model. Input Output … Dispatch Expert 1 Pretrained Weights W ∈ R!×# A ∈ R!×$ B ∈ R$×# Combine Expert N Pretrained Weights W ∈ R!×# A ∈ R!×$ B ∈ R$×# (b) MoE model [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 5.** Figure 5: Impact of LoRA cache ratio on TTFT performance and SLO attainment. (Left) P95 TTFT under varying cache ratios, with SLO of 0.25 seconds. (Right) Percentage of LoRA adapters for which the fraction of requests meeting the TTFT SLO exceeds specific thresholds (50%, 80%, and 90%). requests must wait until in-flight executions complete and GPU memory becomes available to load its adapter. This introduces additi… view at source ↗

**Figure 6.** Figure 6: Impact of LoRA cache ratio on batch size. Measurements are collected during the steady-state interval (30–270s) of the 300s experiment. adapters to the base model execution. This tight coupling forces dynamic LoRA adapters to scale in lockstep with the heavyweight base model, limiting scalability and efficiency. 3 An Overview of InfiniLoRA To overcome the fundamental limitations of the coupled architectu… view at source ↗

**Figure 7.** Figure 7: illustrates the architecture and execution workflow of InfiniLoRA. Unlike coupled designs, InfiniLoRA manages LoRA adapters in a dedicated LoRA Server, which may span multiple nodes, while LLM instances remain LoRA-free and execute the base model using their existing optimization strategies. During request processing, an LLM instance performs base-model computation and forwards the corresponding activa… view at source ↗

**Figure 8.** Figure 8: LoRA adapter placement strategies across server GPUs. The three-dimensional block represents the adapter space, with axes corresponding to LoRA adapters, layers, and experts. Each color indicates the server GPU (GPU 1–4) to which a partition of adapters is assigned. Arrows depict the activation data flow between client GPUs and server GPUs. using four metrics summarized in [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 10.** Figure 10: Layer-wise LoRA loading. Shaded blue blocks represent LoRA execution from any other LLM instance. pull-based communication incurs 2.63× higher latency than push-based. A similar asymmetry applies to the server-to-client path: pull-based designs require clients to repeatedly poll remote completion states, incurring additional round-trip latency. Consequently, InfiniLoRA adopts push-based communication in … view at source ↗

**Figure 11.** Figure 11: P95 TTFT, SLO attainment rate, throughput and average TPOT from top to bottom under varying loads. We compare InfiniLoRA against three baselines. The two values listed under each model name correspond to the LoRA cache capacity provided by S-LoRA w/ Less LoRA, S-LoRA (including w/ SJF) and InfiniLoRA, respectively. 1 2 3 4 5 6 LLM Instance Count 10 1 10 0 P95 TTFT (s) InfiniLoRA SLO (0.25s) 1 2 3 4 5 6 LL… view at source ↗

**Figure 12.** Figure 12: Performance of scaling #LLM instances, configured with a request rate of 12 req/s per instance (keeping LoRA Server unchanged and using Mixtral-8x7B model). larger effective batch size enabled by increased LoRA cache capacity improves GPU utilization. At high request rates on Mixtral, InfiniLoRA exhibits higher TPOT than S-LoRA. This occurs because both systems cache a large number of adapters, but under… view at source ↗

**Figure 14.** Figure 14: Ablation study for quantifying the effectiveness of individual optimization techniques. +kernel represents the fully optimized system with all techniques enabled. layer-wise adapter loading (+loading). The full InfiniLoRA system further incorporates hardware-specialized kernels (+kernel). As shown in the results, despite having a slightly larger cache capacity (104 over 100), +disagg alone increases tail … view at source ↗

**Figure 13.** Figure 13: Performance of scaling LoRA Server resources (we keep resources for LLM instances unchanged, using Qwen3-30B-A3B model and request rate=35 req/s). are also consistent with the probabilistic model developed in Section 4.2: for LoRA cache capacities of 128, 192, and 256 in this setting, the model predicts immediate admission probabilities of 83.0%, 92.2%, and 100.0%, respectively, which closely matches the … view at source ↗

**Figure 16.** Figure 16: Our experimental setup consists of a 4-GPU LoRA Server serving two types of LLM instances: a Mixtral 8x7B model (2 GPUs) or a Scaled MoE model (4 GPUs). We observe that communication latency scales linearly with batch size, as it is strictly bound by the send/receive bandwidth of the LLM instance’s NICs. In contrast, LoRA computation time increases sub-linearly with batch size. This behavior stems from th… view at source ↗

**Figure 15.** Figure 15: Scalability under varying LoRA popularity distributions and adapter counts. A.1.2 Scale with batch size. We evaluate LoRA Server’s processing latency across varying task sizes as shown in [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

**Figure 17.** Figure 17: Impact of interconnect bandwidth on InfiniLoRA’s serving performance: NVLink vs. InfiniBand. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 18.** Figure 18: End-to-end performance comparison of two LoRA data layouts (𝐸𝑃2-𝑃𝑃4 and 𝐸𝑃4-𝑃𝑃2) under the same LoRA cache capacity. bandwidth utilization. Specifically, InfiniLoRA-BGMV excels during the shrink phase but exhibits performance degradation in the expand phase, primarily due to the larger volume of data written back in the latter. In contrast, InfiniLoRASGMV maintains consistent performance across both pha… view at source ↗

**Figure 19.** Figure 19: Characterization of latency and bandwidth for distinct LoRA kernels across shrink/expand phases. Dashed lines represent bandwidth and solid lines indicate latency. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗

read the original abstract

LoRA enables efficient customization of LLMs and is widely used in multi-tenant and multi-task serving. However, emerging model architectures such as MoE significantly increase LoRA memory cost, making existing coupled LoRA serving designs poorly scalable and prone to tail-latency inflation. We present InfiniLoRA, a disaggregated LoRA serving system that decouples LoRA execution from base-model inference. InfiniLoRA introduces a shared LoRA Server with parallelism-aware execution, SLO-driven provisioning, and critical-path optimizations, including GPU-initiated communication and hardware-specialized LoRA kernels. Experiments show that InfiniLoRA can achieve an average $3.05\times$ increase in serviceable request rate under strict latency SLOs, and improve the percentage of LoRA adapters satisfying the SLO requirement by 54.0\%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InfiniLoRA decouples LoRA execution into a shared server with GPU-initiated comms and SLO provisioning, delivering claimed 3x throughput gains for MoE multi-tenant serving, though the overheads and generalization need tighter validation.

read the letter

The central point is that this paper takes the memory pressure from running many LoRA adapters on MoE models and solves it by fully disaggregating the adapter work onto a separate shared server instead of keeping everything coupled on the same GPUs. That shift, plus their parallelism-aware kernels, GPU-initiated communication, and SLO-driven provisioning, is the main new piece. Earlier LoRA serving papers stayed coupled, so this architecture change directly targets the scaling wall that appears once expert counts and adapter numbers rise together. The reported 3.05× request-rate lift and 54% better SLO compliance under strict latency targets show the design can move the needle on utilization when the memory savings outweigh the added data movement. The optimizations look practical and focused on the critical path rather than generic. The soft spots sit in the experimental grounding. The headline numbers rest on the assumption that cross-server communication and scheduling costs stay small, yet the abstract gives little on the exact interconnect, request traces, MoE scales tested, or the precise baselines used for comparison. If those gains were measured only on high-bandwidth links and narrow workloads, they may shrink elsewhere. The paper would benefit from clearer breakdowns of tail-latency contributions from the disaggregation layer itself. This work is aimed at systems people building production LLM inference stacks, especially anyone already wrestling with multi-tenant customization at scale. A reader who needs concrete ideas for memory-efficient serving will get usable design points even if the evaluation is not yet exhaustive. It deserves peer review because the core architecture is coherent and the problem is real; referees can push on the measurement details without the idea itself being shaky.

Referee Report

2 major / 2 minor

Summary. The paper presents InfiniLoRA, a disaggregated multi-LoRA serving system for LLMs that decouples LoRA adapter execution from base-model inference to address high memory costs in MoE architectures. It introduces a shared LoRA Server with parallelism-aware execution, SLO-driven provisioning, GPU-initiated communication, and specialized kernels. The central claim, supported by experiments, is an average 3.05× increase in serviceable request rate under strict latency SLOs together with a 54% improvement in the fraction of adapters meeting those SLOs.

Significance. If the measured gains prove robust, the disaggregation approach could meaningfully advance scalable multi-tenant LLM serving by improving memory efficiency and latency compliance for memory-intensive MoE models without proportional hardware increases. The work supplies a concrete systems design and empirical quantification that could inform future inference-stack research.

major comments (2)

[§4] §4 (Evaluation): The headline 3.05× request-rate and +54% SLO-compliance claims are presented without a latency-component breakdown or explicit measurement showing that GPU-initiated communication and cross-server data movement remain negligible relative to the memory savings; this is load-bearing because the architecture's premise is that disaggregation overheads do not inflate tail latencies under the strict SLOs.
[§4.2, §5] §4.2 and §5: No ablation or sensitivity results are reported for different interconnect bandwidths, MoE expert counts, or request-trace characteristics, leaving open whether the reported gains generalize beyond the specific high-bandwidth hardware and workloads tested.

minor comments (2)

[Abstract, §4] The abstract and §4 omit the precise numerical SLO thresholds, model sizes, and baseline system versions used, which would aid immediate interpretation of the quantitative results.
[§3.3] Notation for the SLO-driven provisioning algorithm in §3.3 could be clarified with a small pseudocode listing or explicit variable definitions to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The feedback highlights important aspects of our evaluation that we will strengthen in the revision. We address each major comment below and commit to incorporating the suggested improvements.

read point-by-point responses

Referee: [§4] §4 (Evaluation): The headline 3.05× request-rate and +54% SLO-compliance claims are presented without a latency-component breakdown or explicit measurement showing that GPU-initiated communication and cross-server data movement remain negligible relative to the memory savings; this is load-bearing because the architecture's premise is that disaggregation overheads do not inflate tail latencies under the strict SLOs.

Authors: We agree that an explicit breakdown of latency components would strengthen the central claims. While the manuscript describes the GPU-initiated communication and specialized kernels as critical-path optimizations intended to keep disaggregation overheads low, we did not provide a quantitative decomposition isolating communication versus computation times under load. In the revised version we will add a new subsection (or expanded figure) in §4 that reports per-component latency measurements (base-model inference, LoRA execution, GPU-initiated transfers, and cross-server movement) across the evaluated request rates. These measurements will explicitly show that the added communication remains a small fraction of the SLO budget and does not drive the observed tail-latency improvements. revision: yes
Referee: [§4.2, §5] §4.2 and §5: No ablation or sensitivity results are reported for different interconnect bandwidths, MoE expert counts, or request-trace characteristics, leaving open whether the reported gains generalize beyond the specific high-bandwidth hardware and workloads tested.

Authors: We acknowledge that the current evaluation is performed on a single high-bandwidth cluster and a limited set of MoE configurations and traces. To address generalizability, the revised manuscript will include additional sensitivity experiments. We will report results for (1) reduced interconnect bandwidth (e.g., PCIe-only versus NVLink), (2) varying numbers of MoE experts, and (3) alternative request-trace distributions. These results will be placed in an expanded §4.2 with corresponding discussion in §5, allowing readers to assess how the gains scale with hardware and workload parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical systems evaluation with no derivation chain

full rationale

The paper presents a disaggregated LoRA serving architecture for MoE models, supported by experimental measurements of request-rate gains and SLO compliance. No mathematical derivations, fitted parameters, or equations are described that could reduce to self-definition or fitted inputs. Claims rest on external benchmarks (measured throughput and latency under specific workloads) rather than internal construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in the abstract or context to justify core results. This is a standard empirical systems paper whose validity is testable via replication on the reported hardware and traces.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented physical entities are stated in the abstract; the work rests on standard assumptions of distributed systems engineering and the existence of the described hardware kernels.

pith-pipeline@v0.9.0 · 5466 in / 1101 out tokens · 49136 ms · 2026-05-10T17:19:21.433509+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

InfiniLoRA introduces a shared LoRA Server with parallelism-aware execution, SLO-driven provisioning, and critical-path optimizations, including GPU-initiated communication and hardware-specialized LoRA kernels.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate provisioning based on ... Immediate Admissibility Rate (IAR) ... Poissonized model ... dynamic programming

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 35 canonical work pages · 5 internal anchors

[1]

Anonymous. 2026. Understanding LoRA As Knowledge Memory: An Empirical Analysis.https://openreview.net/forum?id=i1Mi2R1TsU

2026
[2]

Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi, Yue Zhao, Yinan Zhao, Hui Qu, Wei-An Lin, Yiru Shen, Ajinkya Kale, Irfan Essa, and Humphrey Shi. 2025. MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models. arXiv:2511.20629 [cs.CV]https://arxiv.org/abs/2511.20629

work page arXiv 2025
[3]

Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. 2023. Punica: Multi-Tenant LoRA Serving. arXiv:2310.18547 [cs.DC]https://arxiv.org/abs/2310.18547

work page arXiv 2023
[4]

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. 2024. LongLoRA: Efficient Fine-tuning of Long- Context Large Language Models. arXiv:2309.12307 [cs.CL]https: //arxiv.org/abs/2309.12307

work page arXiv 2024
[5]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer
[7]

InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23)

QLORA: efficient finetuning of quantized LLMs. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 441, 28 pages
[8]

Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xiaoxuan Liu, Yifan Qiao, Ion Stoica, and Junchen Jiang. 2025. PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles(Lotte Hotel World, S...

work page doi:10.1145/3731569.3764834 2025
[9]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Trans- formers: Scaling to Trillion Parameter Models with Simple and Effi- cient Sparsity. arXiv:2101.03961 [cs.LG]https://arxiv.org/abs/2101. 03961

work page internal anchor Pith review arXiv 2022
[10]

Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. 2025. How to Train Long-Context Language Models (Effectively). InACL

2025
[11]

Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. 2024. MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool. arXiv:2406.17565 [cs.DC] https://arxiv.org/abs/2406.17565

work page arXiv 2024
[12]

Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Chenxi Wang, Jiang Xu, Shuang Chen, Hao Feng, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. 2025. ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads.ACM Trans. Archit. Code Optim.22, 2, Article 77 (July 2025), 24 pages. doi:10.1145/3732941

work page doi:10.1145/3732941 2025
[13]

Chenghao Hu, Yufei Kang, and Baochun Li. 2025. Communication- Efficient MoE Fine-Tuning with Locality-Aware Expert Placement. In2025 IEEE 45th International Conference on Distributed Computing Systems (ICDCS). 166–176. doi:10.1109/ICDCS63083.2025.00025

work page doi:10.1109/icdcs63083.2025.00025 2025
[14]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low- Rank Adaptation of Large Language Models. InInternational Confer- ence on Learning Representations.https://openreview.net/forum?id= nZeVKeeFYf9

2022
[15]

Nikoleta Iliakopoulou, Jovan Stojkovic, Chloe Alverti, Tianyin Xu, Hubertus Franke, and Josep Torrellas. 2025. Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environ- ments. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO ’25). Association for Computing Machinery, New York, NY, USA, 217–231...

work page doi:10.1145/3725843.3756083 2025
[16]

Shashwat Jaiswal, Shrikara Arun, Anjaly Parayil, Ankur Mallick, Spyros Mastorakis, Alind Khare, Chloi Alverti, Renee St Amant, Chetan Bansal, Victor Rühle, and Josep Torrellas. 2025. Serving Het- erogeneous LoRA Adapters in Distributed LLM Inference Systems. arXiv:2511.22880 [cs.DC]https://arxiv.org/abs/2511.22880

work page arXiv 2025
[17]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Men- sch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Rui Kong, Qiyang Li, Xinyu Fang, Qingtian Feng, Qingfeng He, Yazhu Dong, Weijun Wang, Yuanchun Li, Linghe Kong, and Yunxin Liu
[19]

Lora-switch: Boosting the efficiency of dy- namic llm adapters via system-algorithm co-design.arXiv preprint arXiv:2405.17741, 2024

LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design. arXiv:2405.17741 [cs.AI]https: //arxiv.org/abs/2405.17741

work page arXiv
[20]

Xiaoyu Kong, Jiancan Wu, An Zhang, Leheng Sheng, Hui Lin, Xiang Wang, and Xiangnan He. 2024. Customizing language models with instance-wise LoRA for sequential recommendation. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article 3593...

2024
[21]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica
[22]

InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. 14 InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models
[23]

Hao Li, Chenghao Yang, An Zhang, Yang Deng, Xiang Wang, and Tat-Seng Chua. 2025. Hello Again! LLM-powered Personalized Agent for Long-term Dialogue. arXiv:2406.05925 [cs.CL]https://arxiv.org/ abs/2406.05925

work page arXiv 2025
[24]

Suyi Li, Yifan Qiao, Jiacheng Ma, Shan Yu, Haoran Ma, Ziming Liu, Hang Ren, Wenguang Chen, Yongwei Wu, Weimin Zheng, and Kang Chen. 2025. Toppings: Modular and Extensible Serverless Function Delivery at High Speed. In2025 USENIX Annual Technical Conference (USENIX ATC 25). USENIX Association, Seattle, WA.https://www. usenix.org/conference/atc25/presentati...

2025
[25]

Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guo- hong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guan- jing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, and Yunxin Liu. 2024. Personal LLM Agents: Insights and Survey about the...

work page arXiv 2024
[26]

2022.Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async

Pak Markthub, Jim Dinan, Sreeram Potluri, and Seth How- ell. 2022.Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async. NVIDIA.https://developer.nvidia.com/blog/improving-network- performance-of-hpc-systems-using-nvidia-magnum-io-nvshmem- and-gpudirect-async/

2022
[27]

OpenAI. 2025. gpt-oss-120b & gpt-oss-20b Model Card. arXiv:2508.10925 [cs.CL]https://arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2025. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (Buenos Aires, Argentina)(ISCA ’24). IEEE Press, 118–132. doi:10. 1109/ISCA59077.2024.00019

work page arXiv 2025
[29]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Moon- cake: Trading More Storage for Less Computation — A KVCache- centric Architecture for Serving LLM Chatbot. In23rd USENIX Confer- ence on File and Storage Technologies (FAST 25). USENIX Association, Santa Clara, CA, 155–170.https://w...

2025
[30]

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed-MoE: Advancing Mixture-of- Experts Inference and Training to Power Next-Generation AI Scale. arXiv:2201.05596 [cs.LG]https://arxiv.org/abs/2201.05596

work page arXiv 2022
[31]

John Schulman and Thinking Machines Lab. 2025. LoRA Without Regret.Thinking Machines Lab: Connectionism(2025). doi:10.64434/ tml.20250929https://thinkingmachines.ai/blog/lora/

2025
[32]

Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vin- cent Zhao, Hongkun Yu, Kurt Keutzer, Trevor Darrell, and Denny Zhou. 2023. Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models. a...

work page arXiv 2023
[33]

2025.EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices

Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, and Ang Li. 2025.EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices. Association for Computing Machinery, New York, NY, USA, 138–153.https://doi.org/10.1145/ 3711875.3729141

work page arXiv 2025
[34]

Gonzalez, and Ion Stoica

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, and Ion Stoica. 2023. S-LoRA: Serving Thousands of Concurrent LoRA Adapters.arXiv preprint arXiv:2311.03285(2023)

work page arXiv 2023
[35]

Ge Shi, Hanieh Sadri, Qian Wang, Yu Zhang, Ying Xiong, Yong Zhang, and Zhenan Fan. 2025. ExpertWeave: Efficiently Serving Expert- Specialized Fine-Tuned Adapters at Scale. arXiv:2508.17624 [cs.DC] https://arxiv.org/abs/2508.17624

work page arXiv 2025
[36]

Smith, Luke Zettlemoyer, Wen tau Yih, and Mike Lewis

Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Wen tau Yih, and Mike Lewis. 2024. In-Context Pretraining: Language Modeling Beyond Document Boundaries. InThe Twelfth International Confer- ence on Learning Representations.https://openreview.net/forum?id= LXVswInHOo

2024
[37]

Xiao Shi, Jiangsu Du, Zhiguang Chen, and Yutong Lu. 2025. AuLoRA: Fine-Grained Loading and Computation Orchestration for Efficient LoRA LLM Serving. In2025 IEEE 43rd International Conference on Computer Design (ICCD). 277–284. doi:10.1109/ICCD65941.2025.00046

work page doi:10.1109/iccd65941.2025.00046 2025
[38]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

The Mosaic Research Team. 2024. Introducing DBRX: A New State- of-the-Art Open LLM.https://www.databricks.com/blog/introducing- dbrx-new-state-art-open-llm. Accessed: 2026-01-21

2024
[40]

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, and Xiaowen Chu. 2025. BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD ’25). ACM, Tor...

work page doi:10.1145/3711896.3737413 2025
[41]

Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gutfreund, Rogerio Feris, and Zexue He
[42]

arXiv preprint arXiv:2502.00592 , year=

M+: Extending MemoryLLM with Scalable Long-Term Memory. arXiv:2502.00592 [cs.CL]https://arxiv.org/abs/2502.00592

work page arXiv
[43]

Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, and Tieniu Tan. 2025. LoRA-Pro: Are Low-Rank Adapters Properly Optimized?. InThe Thir- teenth International Conference on Learning Representations (ICLR)

2025
[44]

Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving. In18th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 24). USENIX Associa- tion, Santa Clara, CA, 911–927.https://www.usenix.org/conference/ osdi24/presentation/wu-bingyang

2024
[45]

Shuaipeng Wu, Yanying Lin, Shijie Peng, Wenyan Chen, Chong Ma, Min Shen, Le Chen, Chengzhong Xu, and Kejiang Ye. 2025. Rock: Serving Multimodal Models in Cloud with Heterogeneous-Aware Resource Orchestration for Thousands of LoRA Adapters. In2025 IEEE International Conference on Cluster Computing (CLUSTER). 1–13. doi:10.1109/CLUSTER59342.2025.11186463

work page doi:10.1109/cluster59342.2025.11186463 2025
[46]

Lingnan Xia and Hua Ma. 2024. Enhancing LoRA Model Serving Capacity via Adaptive Operator Scheduling for Multi-Tenancy on GPU.IEEE Access12 (2024), 160441–160449. doi:10.1109/ACCESS. 2024.3483250

work page doi:10.1109/access 2024
[47]

Yifei Xia, Fangcheng Fu, Wentao Zhang, Jiawei Jiang, and Bin Cui
[48]

InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada) (NIPS ’24)

Efficient multi-task LLM quantization and serving for multiple LoRA adapters. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada) (NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article 2034, 29 pages

2034
[49]

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. 2024. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads. arXiv:2410.10819 [cs.CL]https://arxiv.org/abs/2410.10819

work page arXiv 2024
[50]

Yahe Yang, Chunliang Tao, and Xiaojing Fan. 2025. LoRA-LiteE: A Computationally Efficient Framework for Chatbot Preference-Tuning. arXiv:2411.09947 [cs.CL]https://arxiv.org/abs/2411.09947

work page arXiv 2025
[52]

Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, and Minyi Guo. 2025. Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management. arXiv:2505.03756 [cs.AR]https://arxiv.org/abs/2505.03756

work page arXiv 2025
[53]

Tianyu Zhang, Peng Zhang, Yusong Gao, and Yun Zhang. 2025. To- gether with SGLang: Best Practices for Serving DeepSeek-R1 on H20- 96G.https://lmsys.org/blog/2025-09-26-sglang-ant-group/. LMSYS Org Blog

2025
[54]

You Zhang, Jin Wang, Liang-Chih Yu, Dan Xu, and Xuejie Zhang
[55]

Proceedings of the AAAI Conference on Artificial Intelligence38, 17 (Mar

Personalized LoRA for Human-Centered Text Understanding. Proceedings of the AAAI Conference on Artificial Intelligence38, 17 (Mar. 2024), 19588–19596. doi:10.1609/aaai.v38i17.29931

work page doi:10.1609/aaai.v38i17.29931 2024
[56]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: disaggregating prefill and decoding for goodput-optimized large language model serv- ing. InProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation(Santa Clara, CA, USA)(OSDI’24). USENIX Association, USA, Art...

2024
[57]

Changhai Zhou, Yuhua Zhou, Shiyang Zhang, Yibin Wang, and Zekai Liu. 2025. Dynamic Operator Optimization for Efficient Multi-Tenant LoRA Model Serving.Proceedings of the AAAI Conference on Artificial Intelligence39, 21 (Apr. 2025), 22910–22918. doi:10.1609/aaai.v39i21. 34453

work page doi:10.1609/aaai.v39i21 2025
[58]

Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, and Xin Liu. 2025. MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Dis- aggregated Expert Parallelism...

work page doi:10.1145/3718958.3750506 2025
[59]

Ruidong Zhu, Ziyue Jiang, Zhi Zhang, Xin Liu, Xuanzhe Liu, and Xin Jin. 2025. Cannikin: No Lagger of SLO in Concurrent Multiple LoRA LLM Serving.IEEE Transactions on Parallel and Distributed Systems 36, 9 (2025), 1972–1984. doi:10.1109/TPDS.2025.3590014 16 InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models A Appendix A.1 Additional Sca...

work page doi:10.1109/tpds.2025.3590014 2025