arxiv: 2605.04357 · v1 · submitted 2026-05-05 · 💻 cs.DC · cs.AI· cs.CL· cs.LG

Recognition: unknown

Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

K. V. Rashmi, Mengdi Wu, Shiqi Pan, Xupeng Miao, Yixuan Mei, Zhihao Jia, Zikun Li, Zixuan Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:41 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.CLcs.LG

keywords multi-LLM servingheterogeneous GPUscost-efficient servingresource allocationserving strategiesadaptive optimizationcloud computinggoodput

0 comments

The pith

Coral reduces the cost of serving multiple LLMs on mixed cloud GPUs by jointly optimizing allocation and per-replica strategies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Coral as a system for running several large language models at the same time on a mix of mid-tier and older GPUs, which tend to be cheaper and easier to obtain than the newest hardware. It claims that deciding both how many GPUs to give each model and what request-handling method each replica should use, all at once across every model, produces better results than handling models in isolation. This joint view matters because LLM workloads are spread across many different models rather than concentrated in one, and cloud hardware comes in many performance and price tiers. To stay responsive when demands or available GPUs shift, Coral splits the big optimization into two stages that still reach the same best answer but finish in tens of seconds instead of hours. Evaluation on six models and twenty GPU setups shows the approach delivers lower costs and better output rates when hardware is limited.

Core claim

Coral jointly optimizes resource allocation to models and the serving strategy of each replica across all models at once. A lossless two-stage decomposition preserves the optimality of this joint problem while reducing solve time from hours to tens of seconds, enabling adaptation to changing throughput demand and resource availability. The resulting system lowers serving cost by up to 2.79 times over the strongest baseline and raises goodput by up to 2.39 times when resources are scarce.

What carries the argument

The lossless two-stage decomposition that splits the joint optimization of resource allocation and serving strategies while keeping the original optimal solution intact and allowing fast re-solving.

If this is right

Serving costs drop when allocation decisions and per-replica strategies are chosen together instead of separately for each model.
Goodput increases under tight resources because hardware is matched more precisely to the needs of all models simultaneously.
The two-stage method lets the system react to demand changes in seconds while still reaching the same quality of solution as the full problem.
These improvements appear consistently across evaluations with six models and twenty different GPU type combinations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The decomposition approach could speed up other joint optimization problems that arise in distributed systems where decisions interact across components.
Lower costs on varied hardware may let smaller operators run multiple LLMs without depending only on the most expensive GPUs.
The emphasis on continuous tracking suggests that production systems would benefit from better sensors for both request patterns and hardware state.

Load-bearing premise

The two-stage decomposition always produces the same optimal allocation and strategies as the full joint problem, and real-time tracking of demand and GPU availability is accurate enough to support the adaptive choices.

What would settle it

A live deployment on a heterogeneous GPU cluster that logs actual serving cost and goodput under rapidly varying request rates from multiple models, then checks whether the measured savings match the claimed factors over non-adaptive baselines.

Figures

Figures reproduced from arXiv: 2605.04357 by K. V. Rashmi, Mengdi Wu, Shiqi Pan, Xupeng Miao, Yixuan Mei, Zhihao Jia, Zikun Li, Zixuan Chen.

**Figure 1.** Figure 1: a shows the most cost-efficient strategy for the prefill phase of Qwen-3 235B [49] under a 1800 ms latency SLO, drawn from the five GPU types in view at source ↗

**Figure 2.** Figure 2: Joint optimization across models. Greedy per-model view at source ↗

**Figure 3.** Figure 3: Two-stage workflow of Coral. The offline Serving Template generator (Sec. view at source ↗

**Figure 4.** Figure 4: Illustration of a Serving Template. Each box repre view at source ↗

**Figure 5.** Figure 5: Overview of the Coral runtime system. A central coordinator hosts the resource allocator and request router, which dispatches each request to a prefill Serving Instance and later to a decode Serving Instance. Within an instance, the scheduler assigns heterogeneous engine nodes (colored circles) per pipeline stage. KV caches are transferred directly between prefill and decode engine nodes. 5 Coral Runtime 5… view at source ↗

**Figure 6.** Figure 6: Simulator fidelity. Prefill and decode latency CDFs of Phi4 14B closely match between real system and simulator. Workload Setup. Request lengths and arrival patterns are drawn from three datasets—Azure Code [42], Azure Conversation [42], and BurstGPT [45]—which we assign evenly across the models under test. Latency SLOs are set per model based on size and architecture (whether the model uses MoE [10] or h… view at source ↗

**Figure 7.** Figure 7: Hourly cost comparison under default settings across the two model and GPU setups. (a, c) Hourly cost per epoch. (b, view at source ↗

**Figure 8.** Figure 8: Hourly cost under scarce resource availability. Base view at source ↗

**Figure 10.** Figure 10: Decode goodput across epochs under scarce resource availability (extended setup). Prefill follows the same trend. view at source ↗

**Figure 11.** Figure 11: Hourly cost under imbalanced demand, where the top third of models (Large-Heavy) or bottom third (Small-Heavy) view at source ↗

**Figure 12.** Figure 12: Comparison with Helix on Helix’s "High GPU view at source ↗

**Figure 13.** Figure 13: Sensitivity of Serving Template generation to view at source ↗

read the original abstract

The usage of large language models (LLMs) has grown increasingly fragmented, with no single model dominating. Meanwhile, cloud providers offer a wide range of mid-tier and older-generation GPUs that enjoy better availability and deliver comparable performance per dollar to top-tier hardware. To efficiently harness these heterogeneous resources for serving multiple LLMs concurrently, we introduce Coral, an adaptive heterogeneity-aware multi-LLM serving system. The key idea behind Coral is to jointly optimize resource allocation and the serving strategy of each model replica across all models. To keep pace with shifting throughput demand and resource availability, Coral applies a lossless two-stage decomposition that preserves joint optimality while cutting online solve time from hours to tens of seconds. Our evaluation across 6 models and 20 GPU configurations shows that Coral reduces serving cost by up to 2.79$\times$ over the best baseline, and delivers up to 2.39$\times$ higher goodput under scarce resource availability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Coral presents a practical joint optimizer for multi-LLM serving on mixed GPUs via a claimed-lossless two-stage decomposition, with evaluation numbers that look promising but need verification on the optimality guarantee.

read the letter

Coral targets the real problem of running several LLMs at once on the mix of mid-tier and older GPUs that cloud providers actually have in volume. The main new piece is the joint optimization of resource allocation across models and the per-replica serving strategy, paired with an adaptive loop that reacts to changing demand and availability. They break the joint problem into a lossless two-stage decomposition that keeps the optimum but drops solve time from hours to tens of seconds, which is the part that makes the system usable online.

Referee Report

2 major / 1 minor

Summary. The paper introduces Coral, a system for cost-efficient concurrent serving of multiple LLMs on heterogeneous cloud GPUs. It jointly optimizes resource allocation across models and the serving strategy (e.g., batching, parallelism) for each model replica. To handle dynamic demands, it proposes a lossless two-stage decomposition of the joint optimization problem that reduces solve time from hours to tens of seconds while preserving optimality. Evaluation across 6 models and 20 GPU configurations reports up to 2.79× lower serving cost versus the best baseline and up to 2.39× higher goodput under scarce resources.

Significance. If the lossless decomposition and empirical gains hold under rigorous validation, the work would advance practical multi-LLM serving by better exploiting mid-tier and older GPUs that offer strong price/performance. It targets a timely problem in cloud systems where LLM workloads are fragmented and hardware heterogeneity is increasing.

major comments (2)

[Two-stage decomposition and solver (abstract and §4–5)] The central claim that the two-stage decomposition is lossless and recovers the exact joint optimum (resource allocation + per-replica serving strategy) is load-bearing for attributing the reported 2.79× cost and 2.39× goodput gains to joint optimization rather than heuristic effects. No direct optimality verification—such as matching objective values or solutions against the joint formulation on small instances—is described, leaving open the possibility that throughput-demand tracking or GPU-type separation introduces suboptimality.
[Evaluation (§6)] The evaluation claims results across 6 models and 20 GPU configurations, yet supplies no details on baseline implementations (e.g., how vLLM or other systems were configured for heterogeneity), workload traces, statistical methods (repetitions, confidence intervals), or how the solver was validated against ground-truth optima. This weakens the data-to-claim link for the headline performance numbers.

minor comments (1)

[Notation and problem formulation] Notation for variables such as throughput demand, goodput, and GPU-type constraints should be defined consistently in a table or early section to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the lossless decomposition and evaluation details. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Two-stage decomposition and solver (abstract and §4–5)] The central claim that the two-stage decomposition is lossless and recovers the exact joint optimum (resource allocation + per-replica serving strategy) is load-bearing for attributing the reported 2.79× cost and 2.39× goodput gains to joint optimization rather than heuristic effects. No direct optimality verification—such as matching objective values or solutions against the joint formulation on small instances—is described, leaving open the possibility that throughput-demand tracking or GPU-type separation introduces suboptimality.

Authors: We appreciate this observation. The two-stage decomposition is proven lossless via a formal argument in §4 that shows the decomposed subproblems recover the exact optimum of the original joint formulation (no suboptimality is introduced by the separation of resource allocation from per-replica strategy selection). To provide the requested empirical verification, we will add an appendix containing small-instance experiments: we solve the full joint ILP on reduced problem sizes (e.g., 2–3 models and 4–6 GPUs) using a commercial solver and compare both objective values and recovered solutions against the two-stage method, confirming exact matches within numerical tolerance. This will directly support attribution of the reported gains to joint optimization. revision: yes
Referee: [Evaluation (§6)] The evaluation claims results across 6 models and 20 GPU configurations, yet supplies no details on baseline implementations (e.g., how vLLM or other systems were configured for heterogeneity), workload traces, statistical methods (repetitions, confidence intervals), or how the solver was validated against ground-truth optima. This weakens the data-to-claim link for the headline performance numbers.

Authors: We agree that additional implementation and methodological details will improve reproducibility and strengthen the claims. Section 6 already enumerates the six models and twenty GPU configurations, but we will expand it to include: explicit baseline configurations (e.g., how vLLM and other systems were adapted to heterogeneous GPUs via manual resource partitioning), workload trace descriptions (synthetic Poisson arrivals plus production-derived traces), statistical procedures (five independent runs per configuration with 95% confidence intervals), and the small-instance solver validation results referenced in the first comment. These clarifications will be added in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: decomposition and evaluation are independent of final metrics

full rationale

The paper formulates a joint optimization problem over resource allocation and per-replica serving strategies, then introduces a two-stage decomposition asserted to preserve optimality. Performance numbers (2.79× cost, 2.39× goodput) are obtained from direct empirical runs on 6 models and 20 GPU configurations. No equations reduce a claimed prediction to a fitted parameter by construction, no load-bearing self-citation chain is invoked to justify uniqueness or the decomposition, and the lossless property is presented as a technical claim rather than a definitional tautology. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; any modeling assumptions about GPU performance curves or workload stationarity are implicit and unstated.

pith-pipeline@v0.9.0 · 5493 in / 1104 out tokens · 25011 ms · 2026-05-08T16:41:49.753353+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 21 canonical work pages · 14 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Har- rison, Russell J Hewett, Mojan Javaheripi, Piero Kauff- mann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review arXiv 2024
[2]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt- oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review arXiv 2025
[3]

Amazon Web Services. AWS and Anthropic announce strategic collaboration to advance generative AI.https: //press.aboutamazon.com/2023/9/amazon-and -anthropic-announce-strategic-collaborati on-to-advance-generative-ai , September 2023. Accessed: 2026-04-20

2023
[4]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review arXiv 2004
[5]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review arXiv 2021
[6]

Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anasta- sios N. Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: an open platform for evaluating llms by human preference. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

2024
[7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review arXiv 2025
[8]

Large scale distributed deep networks.Advances in neural informa- tion processing systems, 25, 2012

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, An- drew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks.Advances in neural informa- tion processing systems, 25, 2012

2012
[9]

Gfs: A preemptive scheduling framework for gpu clusters with predictive spot management

Jiaang Duan, Shenglin Xu, Shiyou Qian, Dingyu Yang, Kangjin Wang, Chenzhi Liao, Yinghao Yu, Qin Hua, Hanwen Hu, Qi Wang, Wenchao Wu, Dongqing Bao, Tianyu Lu, Jian Cao, Guangtao Xue, Guodong Yang, Liping Zhang, and Gang Chen. Gfs: A preemptive scheduling framework for gpu clusters with predictive spot management. InProceedings of the 31th ACM In- ternation...
[10]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learn- ing Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learn- ing Research, 23(120):1–39, 2022

2022
[11]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review arXiv 2024
[12]

M\’elange: Cost efficient large language model serv- ing by exploiting gpu heterogeneity.arXiv preprint arXiv:2404.14527, 2024

Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, and Ion Stoica. M\’elange: Cost efficient large language model serv- ing by exploiting gpu heterogeneity.arXiv preprint arXiv:2404.14527, 2024

work page arXiv 2024
[13]

Gurobi Optimizer Refer- ence Manual, 2026

Gurobi Optimization, LLC. Gurobi Optimizer Refer- ence Manual, 2026

2026
[14]

O’Reilly Media, Inc., 2013

Pieter Hintjens.ZeroMQ: messaging for many applica- tions. O’Reilly Media, Inc., 2013

2013
[15]

Inference without interfer- ence: Disaggregate llm inference for mixed downstream workloads.arXiv preprint arXiv:2401.11181, 2024

Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, et al. Inference without interfer- ence: Disaggregate llm inference for mixed downstream workloads.arXiv preprint arXiv:2401.11181, 2024. 13

work page arXiv 2024
[16]

Gpipe: Effi- cient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Effi- cient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

2019
[17]

Amant, Chetan Bansal, Victor Ruhle, Anoop Kulkarni, Steve Kofsky, and Saravan Rajmohan

Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, Anjaly Parayil, Ankur Mallick, Rujia Wang, Renee St. Amant, Chetan Bansal, Victor Ruhle, Anoop Kulkarni, Steve Kofsky, and Saravan Rajmohan. Sageserve: Optimizing llm serving on cloud data centers with forecast aware auto-scaling.Proc. ACM Meas. Anal. Comput. Syst., 9(3), December 2025

2025
[18]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review arXiv 2024
[19]

Demystifying cost- efficiency in llm serving over heterogeneous gpus.ArXiv, abs/2502.00722,

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. Demystifying cost-efficiency in llm serving over heterogeneous gpus.arXiv preprint arXiv:2502.00722, 2025

work page arXiv 2025
[20]

Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization.arXiv preprint arXiv:2602.10729, 2026

Youhe Jiang, Fangcheng Fu, and Eiko Yoneki. Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization.arXiv preprint arXiv:2602.10729, 2026

work page arXiv 2026
[21]

Hexgen: generative infer- ence of large language model over heterogeneous en- vironment

Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Binhang Yuan. Hexgen: generative infer- ence of large language model over heterogeneous en- vironment. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

2024
[22]

SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

2024
[23]

Gonza- lez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023
[24]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, De- hao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling gi- ant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review arXiv 2006
[25]

Adaserve: Accelerating multi-slo llm serving with slo-customized speculative decoding.arXiv preprint arXiv:2501.12162, 2025

Zikun Li, Zhuofu Chen, Remi Delacourt, Gabriele Oliaro, Zeyu Wang, Qinghan Chen, Shuhuai Lin, April Yang, Zhihao Zhang, Zhuoming Chen, et al. Adaserve: Accelerating multi-slo llm serving with slo-customized speculative decoding.arXiv preprint arXiv:2501.12162, 2025

work page arXiv 2025
[26]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review arXiv 2022
[27]

Flexpipe: Adapting dy- namic llm serving through inflight pipeline refactor- ing in fragmented serverless clusters.arXiv preprint arXiv:2510.11938, 2025

Yanying Lin, Shijie Peng, Chengzhi Lu, Chengzhong Xu, and Kejiang Ye. Flexpipe: Adapting dy- namic llm serving through inflight pipeline refactor- ing in fragmented serverless clusters.arXiv preprint arXiv:2510.11938, 2025

work page internal anchor Pith review arXiv 2025
[28]

Skyserve: Serving ai mod- els across regions and clouds with spot instances

Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, and Ion Stoica. Skyserve: Serving ai mod- els across regions and clouds with spot instances. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 159–175, New York, NY , USA, 2025. Association for Computing M...

2025
[29]

Helix: Serving large language models over heterogeneous gpus and net- work via max-flow

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. Helix: Serving large language models over heterogeneous gpus and net- work via max-flow. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vol- ume 1, ASPLOS ’25, page 586–602, New York, NY ...

2025
[30]

The llama 4 herd: The beginning of a new era of natively multimodal AI innovation

Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal AI innovation. https://ai.m eta.com/blog/llama-4-multimodal-intellige nce/, April 2025. Accessed: 2026-04-20

2025
[31]

Gloo: A collective communications library.https://github.com/pytorch/gloo, 2017

Meta Platforms, Inc. Gloo: A collective communications library.https://github.com/pytorch/gloo, 2017

2017
[32]

Spotserve: Serv- ing generative large language models on preemptible instances

Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia. Spotserve: Serv- ing generative large language models on preemptible instances. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 1112–1127, 2024

2024
[33]

Microsoft and OpenAI extend partner- ship

Microsoft Corp. Microsoft and OpenAI extend partner- ship. https://blogs.microsoft.com/blog/2023/ 01/23/microsoftandopenaiextendpartnership/ , January 2023. Accessed: 2026-04-20. 14

2023
[34]

NVIDIA Collective Communication Library (NCCL).https://github.com/NVIDIA/nccl, 2015

NVIDIA. NVIDIA Collective Communication Library (NCCL).https://github.com/NVIDIA/nccl, 2015

2015
[35]

NVIDIA Corporation, 2026

NVIDIA.Developing a Linux Kernel Module using GPUDirect RDMA. NVIDIA Corporation, 2026

2026
[36]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024

2024
[37]

Inside the rise of enterprise ai model- switching

Perplexity Team. Inside the rise of enterprise ai model- switching. https://www.perplexity.ai/hub/blog /inside-the-rise-of-enterprise-ai-model-s witching, Feb 2026. Accessed: 2026-04-20

2026
[38]

Mooncake: A kvcache- centric disaggregated architecture for llm serving.ACM Transactions on Storage, 2024

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. Mooncake: A kvcache- centric disaggregated architecture for llm serving.ACM Transactions on Storage, 2024

2024
[39]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review arXiv 2023
[40]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review arXiv 1909
[41]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Dynamollm: Designing llm inference clusters for performance and energy efficiency

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Tor- rellas, and Esha Choukse. Dynamollm: Designing llm inference clusters for performance and energy efficiency. In2025 IEEE International Symposium on High Per- formance Computer Architecture (HPCA), pages 1348–
[43]

Sailor: Automating distributed training over dynamic, heterogeneous, and geo-distributed clusters

Foteini Strati, Zhendong Zhang, George Manos, Ix- eia Sánchez Périz, Qinghao Hu, Tiancheng Chen, Berk Buzcu, Song Han, Pamela Delgado, and Ana Klimovic. Sailor: Automating distributed training over dynamic, heterogeneous, and geo-distributed clusters. InProceed- ings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, SOSP ’25, page 204–220,...

2025
[44]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Per- rin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas B...

2025
[45]

Burstgpt: A real-world workload dataset to optimize llm serving systems

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, et al. Burstgpt: A real-world workload dataset to optimize llm serving systems. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discov- ery and Data Mining V . 2, pages 5831–5841, 2025

2025
[46]

Can’t be late: Optimizing spot instance savings under deadlines

Zhanghao Wu, Wei-Lin Chiang, Ziming Mao, Zongheng Yang, Eric Friedman, Scott Shenker, and Ion Stoica. Can’t be late: Optimizing spot instance savings under deadlines. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 185–203, 2024

2024
[47]

The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

2025
[48]

Aegaeon: Effective gpu pooling for concurrent llm serving on the market

Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective gpu pooling for concurrent llm serving on the market. InProceedings of the ACM SIGOPS 31st Symposium on Operating Sys- tems Principles, pages 1030–1045, 2025

2025
[49]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review arXiv 2025
[50]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

2022
[51]

Orca: A distributed serving system for {Transformer-Based} generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo- jeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating sys- tems design and implementation (OSDI 22), pages 521– 538, 2022

2022
[52]

Prism: Unleashing gpu sharing for cost-eﬀicient multi-llm serving, 2025

Shan Yu, Jiarong Xing, Yifan Qiao, Mingyuan Ma, Yang- min Li, Yang Wang, Shuo Yang, Zhiqiang Xie, Shiyi Cao, Ke Bao, et al. Prism: Unleashing gpu sharing for cost-efficient multi-llm serving.arXiv preprint arXiv:2505.04021, 2025

work page arXiv 2025
[53]

Cauchy: A cost- efficient llm serving system through adaptive hetero- geneous deployment

Yihui Zhang, Han Shen, Renyu Yang, Di Tian, Yuxi Luo, Menghao Zhang, Li Li, Chunming Hu, Tianyu Wo, Chengru Song, and Jin Ouyang. Cauchy: A cost- efficient llm serving system through adaptive hetero- geneous deployment. InProceedings of the 2025 ACM Symposium on Cloud Computing, SoCC ’25, page 881–893, New York, NY , USA, 2026. Association for Computing Machinery

2025
[54]

{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024

2024
[55]

DeepSeek-Coder-V2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931,

Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. Deepseek-coder-v2: Breaking the bar- rier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931, 2024. 16

work page arXiv 2024