Maestro: Workload-Aware Cross-Cluster Scheduling for LLM-Based Multi-Agent Systems

Chunming Hu; Jinghao Wang; Renyu Yang; Tianyu Wo; Xiaoyang Sun; Xiao Zhou; Xu Wang; Yihui Zhang; Yilong Li

arxiv: 2606.12950 · v1 · pith:QEI4DBQPnew · submitted 2026-06-11 · 💻 cs.DC

Maestro: Workload-Aware Cross-Cluster Scheduling for LLM-Based Multi-Agent Systems

Jinghao Wang , Xiao Zhou , Xiaoyang Sun , Yihui Zhang , Yilong Li , Tianyu Wo , Xu Wang , Chunming Hu

show 1 more author

Renyu Yang

This is my paper

Pith reviewed 2026-06-27 06:07 UTC · model grok-4.3

classification 💻 cs.DC

keywords LLM multi-agent systemsworkload-aware schedulingcross-cluster schedulingKV cache reservationSLO attainmenthierarchical schedulingelastic memory provisioningmulti-model co-location

0 comments

The pith

Maestro schedules LLM multi-agent workflows using per-stage predictions of output length and memory to reduce reserved high-bandwidth memory and improve deadline attainment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Maestro, a scheduling system for LLM-based multi-agent systems that must handle iterative, input-dependent pipelines under limited GPUs. It predicts the output length and memory needs of each workflow stage and uses these predictions to guide decisions at node, cluster, and global levels. At nodes it co-locates models with dynamic caching and elastic memory; at clusters it routes to avoid delays; globally it prioritizes to prevent blocking. Experiments on prototypes and simulations demonstrate 67.2% less KV-reservation HBM and 23.6 percentage points better SLO attainment than earliest-deadline-first scheduling.

Core claim

Maestro explicitly leverages agent semantics by predicting output length and memory usage of each stage to drive a hierarchical scheduler. This enables dynamic multi-model co-location via hierarchical weight caching and elastic memory provisioning at the node level, latency-aware routing at the cluster level to avoid cold-start delays and overloads, and workflow-aware prioritization at the global level to minimize head-of-line blocking. The result is substantially lower memory reservations and higher service level objective attainment in high-contention scenarios.

What carries the argument

The hierarchical scheduler driven by predictions of per-stage output length and memory usage, with node-level co-location, cluster-level routing, and global prioritization.

If this is right

Dynamic multi-model co-location via weight caching and elastic provisioning reduces memory fragmentation and over-provisioning.
Latency-aware routing across clusters prevents cold-start delays and memory overloads.
Workflow-aware prioritization at the global level minimizes head-of-line blocking for interactive tasks.
Overall resource efficiency improves under strict GPU budgets for multi-agent workloads.
These techniques allow better handling of non-deterministic decode costs and heavy-tailed requirements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Accurate stage predictions could be extended to other iterative AI pipelines beyond multi-agent systems.
Cloud operators might adopt similar hierarchical approaches when deploying agentic applications at scale.
Further gains may come from integrating better prediction models or feedback loops from actual execution.
The approach highlights the value of incorporating workflow semantics into scheduling decisions for variable-cost tasks.

Load-bearing premise

The predictions of output length and memory usage of each stage are accurate enough to drive effective hierarchical scheduling decisions without introducing new bottlenecks or misallocations.

What would settle it

Running the system with deliberately inaccurate predictions of stage lengths and memory needs and observing whether the reported memory savings and SLO improvements disappear.

Figures

Figures reproduced from arXiv: 2606.12950 by Chunming Hu, Jinghao Wang, Renyu Yang, Tianyu Wo, Xiaoyang Sun, Xiao Zhou, Xu Wang, Yihui Zhang, Yilong Li.

**Figure 1.** Figure 1: Output-token length distributions under non-CoT and CoT settings, tool-call and non-tool-call. 0 1000 2000 3000 4000 Output Token Length 0 50 100 150 TTLT (s) 400 600 800 1000 1200 Input Token Length 0.02 0.04 0.06 TTFT (s) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Model-invocation long tail (left) from Chatbot Arena traces [ [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 5.** Figure 5: The system architecture and workflow of Maestro. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Agent-aware output-length prediction architecture in Maestro. [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 7.** Figure 7: Overall scheduling results across arrival rates and batch ratios: SLO [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: GPU utilization and memory usage under multi-model colocation. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Prediction overhead (P50/P95) for methods with BERT encoders. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Model activation latency (0.6B–14B) on a single A100 40GB GPU. [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

read the original abstract

Large Language Model based Multi-Agent Systems (LLM-MAS) have emerged as a powerful paradigm for tackling complex tasks by breaking them into collaborative workflows of specialized LLM-powered agents. However, deploying such multi-agent workloads at scale poses significant system challenges. Each user query spawns an iterative pipeline of LLM calls, greatly amplifying resource consumption compared to single-turn queries. In resource-constrained cloud settings, these workflows face non-deterministic and input-dependent costs at decode stage, heavy-tailed multi-model requirements with memory fragmentation and over-provisioning, and cross-cluster scheduling trade-offs. We present Maestro, a workload-aware scheduling system designed for LLM-MAS serving under strict GPU budgets. Maestro explicitly leverages agent semantics and roles: it predicts the output length and memory usage of each stage and uses this prediction to drive a hierarchical scheduler. At the node level, Maestro enables dynamic multi-model co-location via hierarchical weight caching and elastic memory provisioning. At the cluster level, it performs latency-aware routing to avoid cold-start delays and memory overloads. At the global level, it enforces workflow-aware prioritization to minimize head-of-line blocking for interactive tasks. Across prototype experiments and trace-driven simulations, Maestro reduces KV-reservation HBM by 67.2% and improves high-contention SLO attainment over EDF by 23.6 percentage points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Maestro adds a prediction-driven hierarchical scheduler for LLM multi-agent workloads that targets memory fragmentation and SLOs, but the gains rest on unvalidated per-stage length and memory forecasts.

read the letter

Maestro builds a scheduler that predicts output length and memory per agent stage, then uses those forecasts for node-level co-location with hierarchical weight caching and elastic provisioning, cluster-level latency-aware routing, and global workflow prioritization. The abstract reports 67% lower KV-reservation HBM and 23.6-point better SLO attainment versus EDF under high contention.

The work is new in its explicit focus on multi-agent LLM workflows rather than single-model serving. It combines agent semantics with three-level decisions that directly address heavy-tailed memory needs and cross-cluster trade-offs, which existing EDF or simple caching approaches do not handle together.

The central assumption is that the length and memory predictors are accurate enough to avoid misallocations. The abstract itself flags non-deterministic decode costs and input-dependent behavior, so any material prediction error would increase fragmentation or cold starts and erase the reported gains. No predictor details, error rates, or ablation on prediction quality appear in the text. The experiments are described only at high level as prototypes plus trace-driven simulations, with no dataset sizes, variance numbers, or validation of the predictors themselves.

If the full paper supplies reproducible predictor accuracy data and controlled experiments that isolate the contribution of each level, the claims become testable. Otherwise the headline numbers remain provisional.

This is for systems people who run multi-agent LLM services under tight GPU limits. It is worth a serious referee to verify the experimental controls and prediction robustness, even if revisions are needed.

Referee Report

2 major / 0 minor

Summary. The paper presents Maestro, a workload-aware cross-cluster scheduling system for LLM-based multi-agent systems (LLM-MAS). It predicts per-stage output length and memory usage to drive a three-level hierarchical scheduler: node-level dynamic multi-model co-location via hierarchical weight caching and elastic provisioning; cluster-level latency-aware routing to avoid cold starts and overloads; and global-level workflow-aware prioritization to reduce head-of-line blocking. The system targets non-deterministic decode costs, heavy-tailed multi-model memory fragmentation, and cross-cluster trade-offs under strict GPU budgets. Prototype experiments and trace-driven simulations are reported to yield a 67.2% reduction in KV-reservation HBM and a 23.6 percentage point improvement in high-contention SLO attainment versus EDF.

Significance. If the prediction accuracy and experimental results hold under rigorous validation, the work would be significant for efficient serving of iterative multi-agent LLM workflows in memory-constrained environments. The explicit use of agent semantics and roles to inform scheduling decisions offers a targeted approach to the fragmentation and over-provisioning problems that standard schedulers like EDF do not address. Reproducible code or machine-checked elements are not mentioned.

major comments (2)

[Abstract] Abstract: The headline claims (67.2% KV-reservation HBM reduction and 23.6pp SLO gain over EDF) rest entirely on the accuracy of the per-stage output-length and memory-usage predictors that feed the hierarchical scheduler. No experimental details, error bars, dataset descriptions, predictor validation, or sensitivity analysis appear in the abstract, making it impossible to determine whether the data support the central claims.
[Abstract] The load-bearing assumption that predictions of output length and memory usage are sufficiently accurate to enable effective node-level co-location, cluster-level routing, and global prioritization without introducing fragmentation or cold-start delays is not accompanied by any reported error metrics or ablation on prediction quality. Given the abstract's own emphasis on non-deterministic decode-stage costs and heavy-tailed requirements, this gap directly affects both reported metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in the abstract regarding predictor accuracy. We agree that the headline results depend on this component and will revise the abstract to include key validation details drawn from the body of the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claims (67.2% KV-reservation HBM reduction and 23.6pp SLO gain over EDF) rest entirely on the accuracy of the per-stage output-length and memory-usage predictors that feed the hierarchical scheduler. No experimental details, error bars, dataset descriptions, predictor validation, or sensitivity analysis appear in the abstract, making it impossible to determine whether the data support the central claims.

Authors: We accept this observation. The full manuscript reports predictor validation (including MAPE for output-length prediction, memory-usage error distributions, dataset descriptions from production traces, and sensitivity ablations) in Sections 4.2 and 5.1. The abstract will be revised to concisely reference these results (e.g., “with 12.4% MAPE on output length”) so readers can assess support for the reported gains. revision: yes
Referee: [Abstract] The load-bearing assumption that predictions of output length and memory usage are sufficiently accurate to enable effective node-level co-location, cluster-level routing, and global prioritization without introducing fragmentation or cold-start delays is not accompanied by any reported error metrics or ablation on prediction quality. Given the abstract's own emphasis on non-deterministic decode-stage costs and heavy-tailed requirements, this gap directly affects both reported metrics.

Authors: We agree the abstract should address this directly. The paper contains error metrics and ablations showing that prediction errors do not materially increase fragmentation or cold-start rates under the evaluated workloads. We will add a short clause to the abstract summarizing these findings (e.g., “prediction errors remain below thresholds that would degrade co-location or routing benefits”) while retaining the length constraint. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents Maestro as an empirical scheduling system whose headline gains (67.2% KV-reservation HBM reduction and 23.6 pp SLO improvement) are obtained from prototype experiments and trace-driven simulations rather than from any closed mathematical derivation. The abstract describes the use of per-stage output-length and memory predictions to drive hierarchical decisions, but supplies no equations, fitted parameters, self-citations, or uniqueness theorems that would reduce those measured outcomes to the inputs by construction. Because the central claims rest on external, falsifiable benchmarks instead of self-referential definitions or load-bearing self-citations, the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5792 in / 1069 out tokens · 19046 ms · 2026-06-27T06:07:37.960814+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 13 canonical work pages · 9 internal anchors

[1]

Enabling scalable and adaptive ma- chine learning training via serverless computing on public cloud

Ahsan Ali et al. “Enabling scalable and adaptive ma- chine learning training via serverless computing on public cloud”.Performance Evaluation 2025

2025
[2]

Alibaba Cloud documentation

Alibaba Cloud.Internet Access Performance. Alibaba Cloud documentation
[3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai et al.Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862. 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

LongBench v2: Towards Deeper Un- derstanding and Reasoning on Realistic Long-context Multitasks

Yushi Bai et al. “LongBench v2: Towards Deeper Un- derstanding and Reasoning on Realistic Long-context Multitasks”.ACL 2025

2025
[5]

AgentVerse: Facilitating Multi- Agent Collaboration and Exploring Emergent Behav- iors

Weize Chen et al. “AgentVerse: Facilitating Multi- Agent Collaboration and Exploring Emergent Behav- iors”.ICLR 2024

2024
[6]

Enabling efficient batch serving for LMaaS via generation length prediction

Ke Cheng et al. “Enabling efficient batch serving for LMaaS via generation length prediction”.ICWS 2024

2024
[7]

Mina: Fine-Grained In-network Aggregation Resource Scheduling for Machine Learn- ing Service

Shichen Dong et al. “Mina: Fine-Grained In-network Aggregation Resource Scheduling for Machine Learn- ing Service”.IEEE INFOCOM 2025

2025
[8]

Improving factuality and reasoning in language models through multiagent debate

Yilun Du et al. “Improving factuality and reasoning in language models through multiagent debate”.ICML 2024

2024
[9]

MuxServe: flexible spatial- temporal multiplexing for multiple LLM serving

Jiangfei Duan et al. “MuxServe: flexible spatial- temporal multiplexing for multiple LLM serving”. ICML 2024. 10

2024
[10]

Serving DNNs like Clock- work: Performance Predictability from the Bottom Up

Arpan Gujarati et al. “Serving DNNs like Clock- work: Performance Predictability from the Bottom Up”. USENIX OSDI 2020

2020
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo et al.DeepSeek-R1: Incentivizing reason- ing capability in LLMs via reinforcement learning. arXiv:2501.12948. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Large language model based multi- agents: a survey of progress and challenges

Taicheng Guo et al. “Large language model based multi- agents: a survey of progress and challenges”.IJCAI 2024

2024
[13]

Measuring Mathematical Prob- lem Solving With the MATH Dataset

Dan Hendrycks et al. “Measuring Mathematical Prob- lem Solving With the MATH Dataset”.NeurIPS 2021

2021
[14]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong et al. “MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework”.ICLR 2024

2024
[15]

S3: Increasing GPU Utilization during Generative Inference for Higher Throughput

Yunho Jin et al. “S3: Increasing GPU Utilization during Generative Inference for Higher Throughput”.NeurIPS 2023

2023
[16]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon et al. “Efficient memory management for large language model serving with pagedattention”. SOSP 2023

2023
[17]

Parrot: Efficient serving of LLM- based applications with semantic variable

Chaofan Lin et al. “Parrot: Efficient serving of LLM- based applications with semantic variable”.OSDI 2024

2024
[18]

Efficient Serving of LLM Applications with Probabilistic Demand Modeling

Yifei Liu et al.Efficient Serving of LLM Applications with Probabilistic Demand Modeling. arXiv:2506.14851. 2025

work page internal anchor Pith review arXiv 2025
[19]

Kale: Elastic GPU Scheduling for Online DL Model Training

Ziyang Liu et al. “Kale: Elastic GPU Scheduling for Online DL Model Training”.ACM SoCC 2024

2024
[20]

Castor: Optimizing Deep Learning Job Scheduling in Multi-Tenant GPU Clusters via In- telligent Colocation

Yizhou Luo et al. “Castor: Optimizing Deep Learning Job Scheduling in Multi-Tenant GPU Clusters via In- telligent Colocation”
[21]

Prediction-assisted online distributed deep learning workload scheduling in GPU clusters

Ziyue Luo et al. “Prediction-assisted online distributed deep learning workload scheduling in GPU clusters”. IEEE INFOCOM 2025

2025
[22]

SkyServe: Serving AI Models across Regions and Clouds with Spot Instances

Ziming Mao et al. “SkyServe: Serving AI Models across Regions and Clouds with Spot Instances”.EuroSys 2025

2025
[23]

O-RAN Intelligence Orches- tration Framework for Quality-Driven xApp Deploy- ment and Sharing

Federico Mungari et al. “O-RAN Intelligence Orches- tration Framework for Quality-Driven xApp Deploy- ment and Sharing”.IEEE Transactions on Mobile Com- puting 2025

2025
[24]

arXiv:2512.21487

Xinglin Pan et al.Efficient MoE Inference with Fine- Grained Scheduling of Disaggregated Expert Paral- lelism. arXiv:2512.21487. 2025

work page arXiv 2025
[25]

Queue management for slo-oriented large language model serving

Archit Patke et al. “Queue management for slo-oriented large language model serving”.ACM SoCC 2024

2024
[26]

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

Haoran Qiu et al. “Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction”. ASPLOS Cloud Intelligence/AIOps Workshop 2024

2024
[27]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar et al. “SQuAD: 100,000+ Questions for Machine Comprehension of Text”.EMNLP 2016

2016
[28]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein et al.GPQA: A Graduate-Level Google- Proof Q&A Benchmark. arXiv:2311.12022. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

FlexGen: High-Throughput Genera- tive Inference of Large Language Models with a Single GPU

Ying Sheng et al. “FlexGen: High-Throughput Genera- tive Inference of Large Language Models with a Single GPU”.ICML 2023

2023
[30]

Kaggle dataset

Pavan Subhash.IBM HR Analytics Employee Attrition & Performance. Kaggle dataset. 2017

2017
[31]

arXiv:2504.03648

The AIBrix Team et al.AIBrix: Towards Scalable, Cost- Effective Large Language Model Inference Infrastruc- ture. arXiv:2504.03648. 2025

work page arXiv 2025
[32]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron et al.LLaMA: Open and Efficient Foun- dation Language Models. arXiv:2302.13971. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

SELA: Smart Edge LLM Agent to Optimize Response Trade-offs of AI Assistants

Shreshth Tuli et al. “SELA: Smart Edge LLM Agent to Optimize Response Trade-offs of AI Assistants”.PACM IMWUT 2025

2025
[34]

Mixture-of-Agents Enhances Large Language Model Capabilities

Junlin Wang et al.Mixture-of-agents enhances large language model capabilities. arXiv:2406.04692. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Lan- guage Models

Lei Wang et al. “Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Lan- guage Models”.ACL 2023

2023
[36]

KAIOps: A Platform Solution of End-to-End Multi-Modal AIOps for AI Training at Scale

Zeying Wang et al. “KAIOps: A Platform Solution of End-to-End Multi-Modal AIOps for AI Training at Scale”.IEEE/ACM ASE 2025

2025
[37]

Fast Distributed Inference Serving for Large Language Models

Bingyang Wu et al.Fast Distributed Inference Serving for Large Language Models. arXiv:2305.05920. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu et al.AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Qwen3 Technical Report

An Yang et al.Qwen3 Technical Report. arXiv:2505.09388. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

KAIR: A Statistical and Causal Ap- proach to Pinpointing Stragglers in Distributed Model Training

Yitang Yang et al. “KAIR: A Statistical and Causal Ap- proach to Pinpointing Stragglers in Distributed Model Training”.IEEE/ACM ASE 2025

2025
[41]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao et al. “ReAct: Synergizing reasoning and acting in language models”.ICLR 2023

2023
[42]

Orca: A Distributed Serving System for Transformer-Based Generative Models

Gyeong-In Yu et al. “Orca: A Distributed Serving System for Transformer-Based Generative Models”. USENIX OSDI 2022

2022
[43]

Torpor: GPU-enabled serverless computing for low-latency, resource-efficient infer- ence

Minchen Yu et al. “Torpor: GPU-enabled serverless computing for low-latency, resource-efficient infer- ence”.USENIX ATC 2025

2025
[44]

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

Shan Yu et al. “Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving”. 2026

2026
[45]

arXiv:2501.08197

Yijiong Yu et al.OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training. arXiv:2501.08197. 2025

work page arXiv 2025
[46]

Cauchy: A Cost-Efficient LLM Serving System through Adaptive Heterogeneous De- ployment

Yihui Zhang et al. “Cauchy: A Cost-Efficient LLM Serving System through Adaptive Heterogeneous De- ployment”.ACM SoCC 2025

2025
[47]

Cuckoo: Deadline-Aware Job Packing on Heterogeneous GPUs for DL Model Train- ing

Yuzhang Zhang et al. “Cuckoo: Deadline-Aware Job Packing on Heterogeneous GPUs for DL Model Train- ing”.ACM SoCC 2025

2025
[48]

arXiv:2512.17574

Lingxiao Zhao et al.Enabling Disaggregated Multi- Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing. arXiv:2512.17574. 2025

work page arXiv 2025
[49]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng et al. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”.NeurIPS 2023

2023
[50]

SGLang: efficient execution of structured language model programs

Lianmin Zheng et al. “SGLang: efficient execution of structured language model programs”.NeurIPS 2024

2024
[51]

Espresso: Cost-Efficient Large Model Training by Exploiting GPU Heterogeneity in the Cloud

Qiannan Zhou et al. “Espresso: Cost-Efficient Large Model Training by Exploiting GPU Heterogeneity in the Cloud”.IEEE INFOCOM 2025. 11

2025

[1] [1]

Enabling scalable and adaptive ma- chine learning training via serverless computing on public cloud

Ahsan Ali et al. “Enabling scalable and adaptive ma- chine learning training via serverless computing on public cloud”.Performance Evaluation 2025

2025

[2] [2]

Alibaba Cloud documentation

Alibaba Cloud.Internet Access Performance. Alibaba Cloud documentation

[3] [3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai et al.Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862. 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

LongBench v2: Towards Deeper Un- derstanding and Reasoning on Realistic Long-context Multitasks

Yushi Bai et al. “LongBench v2: Towards Deeper Un- derstanding and Reasoning on Realistic Long-context Multitasks”.ACL 2025

2025

[5] [5]

AgentVerse: Facilitating Multi- Agent Collaboration and Exploring Emergent Behav- iors

Weize Chen et al. “AgentVerse: Facilitating Multi- Agent Collaboration and Exploring Emergent Behav- iors”.ICLR 2024

2024

[6] [6]

Enabling efficient batch serving for LMaaS via generation length prediction

Ke Cheng et al. “Enabling efficient batch serving for LMaaS via generation length prediction”.ICWS 2024

2024

[7] [7]

Mina: Fine-Grained In-network Aggregation Resource Scheduling for Machine Learn- ing Service

Shichen Dong et al. “Mina: Fine-Grained In-network Aggregation Resource Scheduling for Machine Learn- ing Service”.IEEE INFOCOM 2025

2025

[8] [8]

Improving factuality and reasoning in language models through multiagent debate

Yilun Du et al. “Improving factuality and reasoning in language models through multiagent debate”.ICML 2024

2024

[9] [9]

MuxServe: flexible spatial- temporal multiplexing for multiple LLM serving

Jiangfei Duan et al. “MuxServe: flexible spatial- temporal multiplexing for multiple LLM serving”. ICML 2024. 10

2024

[10] [10]

Serving DNNs like Clock- work: Performance Predictability from the Bottom Up

Arpan Gujarati et al. “Serving DNNs like Clock- work: Performance Predictability from the Bottom Up”. USENIX OSDI 2020

2020

[11] [11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo et al.DeepSeek-R1: Incentivizing reason- ing capability in LLMs via reinforcement learning. arXiv:2501.12948. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Large language model based multi- agents: a survey of progress and challenges

Taicheng Guo et al. “Large language model based multi- agents: a survey of progress and challenges”.IJCAI 2024

2024

[13] [13]

Measuring Mathematical Prob- lem Solving With the MATH Dataset

Dan Hendrycks et al. “Measuring Mathematical Prob- lem Solving With the MATH Dataset”.NeurIPS 2021

2021

[14] [14]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong et al. “MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework”.ICLR 2024

2024

[15] [15]

S3: Increasing GPU Utilization during Generative Inference for Higher Throughput

Yunho Jin et al. “S3: Increasing GPU Utilization during Generative Inference for Higher Throughput”.NeurIPS 2023

2023

[16] [16]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon et al. “Efficient memory management for large language model serving with pagedattention”. SOSP 2023

2023

[17] [17]

Parrot: Efficient serving of LLM- based applications with semantic variable

Chaofan Lin et al. “Parrot: Efficient serving of LLM- based applications with semantic variable”.OSDI 2024

2024

[18] [18]

Efficient Serving of LLM Applications with Probabilistic Demand Modeling

Yifei Liu et al.Efficient Serving of LLM Applications with Probabilistic Demand Modeling. arXiv:2506.14851. 2025

work page internal anchor Pith review arXiv 2025

[19] [19]

Kale: Elastic GPU Scheduling for Online DL Model Training

Ziyang Liu et al. “Kale: Elastic GPU Scheduling for Online DL Model Training”.ACM SoCC 2024

2024

[20] [20]

Castor: Optimizing Deep Learning Job Scheduling in Multi-Tenant GPU Clusters via In- telligent Colocation

Yizhou Luo et al. “Castor: Optimizing Deep Learning Job Scheduling in Multi-Tenant GPU Clusters via In- telligent Colocation”

[21] [21]

Prediction-assisted online distributed deep learning workload scheduling in GPU clusters

Ziyue Luo et al. “Prediction-assisted online distributed deep learning workload scheduling in GPU clusters”. IEEE INFOCOM 2025

2025

[22] [22]

SkyServe: Serving AI Models across Regions and Clouds with Spot Instances

Ziming Mao et al. “SkyServe: Serving AI Models across Regions and Clouds with Spot Instances”.EuroSys 2025

2025

[23] [23]

O-RAN Intelligence Orches- tration Framework for Quality-Driven xApp Deploy- ment and Sharing

Federico Mungari et al. “O-RAN Intelligence Orches- tration Framework for Quality-Driven xApp Deploy- ment and Sharing”.IEEE Transactions on Mobile Com- puting 2025

2025

[24] [24]

arXiv:2512.21487

Xinglin Pan et al.Efficient MoE Inference with Fine- Grained Scheduling of Disaggregated Expert Paral- lelism. arXiv:2512.21487. 2025

work page arXiv 2025

[25] [25]

Queue management for slo-oriented large language model serving

Archit Patke et al. “Queue management for slo-oriented large language model serving”.ACM SoCC 2024

2024

[26] [26]

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

Haoran Qiu et al. “Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction”. ASPLOS Cloud Intelligence/AIOps Workshop 2024

2024

[27] [27]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar et al. “SQuAD: 100,000+ Questions for Machine Comprehension of Text”.EMNLP 2016

2016

[28] [28]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein et al.GPQA: A Graduate-Level Google- Proof Q&A Benchmark. arXiv:2311.12022. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

FlexGen: High-Throughput Genera- tive Inference of Large Language Models with a Single GPU

Ying Sheng et al. “FlexGen: High-Throughput Genera- tive Inference of Large Language Models with a Single GPU”.ICML 2023

2023

[30] [30]

Kaggle dataset

Pavan Subhash.IBM HR Analytics Employee Attrition & Performance. Kaggle dataset. 2017

2017

[31] [31]

arXiv:2504.03648

The AIBrix Team et al.AIBrix: Towards Scalable, Cost- Effective Large Language Model Inference Infrastruc- ture. arXiv:2504.03648. 2025

work page arXiv 2025

[32] [32]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron et al.LLaMA: Open and Efficient Foun- dation Language Models. arXiv:2302.13971. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

SELA: Smart Edge LLM Agent to Optimize Response Trade-offs of AI Assistants

Shreshth Tuli et al. “SELA: Smart Edge LLM Agent to Optimize Response Trade-offs of AI Assistants”.PACM IMWUT 2025

2025

[34] [34]

Mixture-of-Agents Enhances Large Language Model Capabilities

Junlin Wang et al.Mixture-of-agents enhances large language model capabilities. arXiv:2406.04692. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Lan- guage Models

Lei Wang et al. “Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Lan- guage Models”.ACL 2023

2023

[36] [36]

KAIOps: A Platform Solution of End-to-End Multi-Modal AIOps for AI Training at Scale

Zeying Wang et al. “KAIOps: A Platform Solution of End-to-End Multi-Modal AIOps for AI Training at Scale”.IEEE/ACM ASE 2025

2025

[37] [37]

Fast Distributed Inference Serving for Large Language Models

Bingyang Wu et al.Fast Distributed Inference Serving for Large Language Models. arXiv:2305.05920. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu et al.AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Qwen3 Technical Report

An Yang et al.Qwen3 Technical Report. arXiv:2505.09388. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

KAIR: A Statistical and Causal Ap- proach to Pinpointing Stragglers in Distributed Model Training

Yitang Yang et al. “KAIR: A Statistical and Causal Ap- proach to Pinpointing Stragglers in Distributed Model Training”.IEEE/ACM ASE 2025

2025

[41] [41]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao et al. “ReAct: Synergizing reasoning and acting in language models”.ICLR 2023

2023

[42] [42]

Orca: A Distributed Serving System for Transformer-Based Generative Models

Gyeong-In Yu et al. “Orca: A Distributed Serving System for Transformer-Based Generative Models”. USENIX OSDI 2022

2022

[43] [43]

Torpor: GPU-enabled serverless computing for low-latency, resource-efficient infer- ence

Minchen Yu et al. “Torpor: GPU-enabled serverless computing for low-latency, resource-efficient infer- ence”.USENIX ATC 2025

2025

[44] [44]

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

Shan Yu et al. “Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving”. 2026

2026

[45] [45]

arXiv:2501.08197

Yijiong Yu et al.OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training. arXiv:2501.08197. 2025

work page arXiv 2025

[46] [46]

Cauchy: A Cost-Efficient LLM Serving System through Adaptive Heterogeneous De- ployment

Yihui Zhang et al. “Cauchy: A Cost-Efficient LLM Serving System through Adaptive Heterogeneous De- ployment”.ACM SoCC 2025

2025

[47] [47]

Cuckoo: Deadline-Aware Job Packing on Heterogeneous GPUs for DL Model Train- ing

Yuzhang Zhang et al. “Cuckoo: Deadline-Aware Job Packing on Heterogeneous GPUs for DL Model Train- ing”.ACM SoCC 2025

2025

[48] [48]

arXiv:2512.17574

Lingxiao Zhao et al.Enabling Disaggregated Multi- Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing. arXiv:2512.17574. 2025

work page arXiv 2025

[49] [49]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng et al. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”.NeurIPS 2023

2023

[50] [50]

SGLang: efficient execution of structured language model programs

Lianmin Zheng et al. “SGLang: efficient execution of structured language model programs”.NeurIPS 2024

2024

[51] [51]

Espresso: Cost-Efficient Large Model Training by Exploiting GPU Heterogeneity in the Cloud

Qiannan Zhou et al. “Espresso: Cost-Efficient Large Model Training by Exploiting GPU Heterogeneity in the Cloud”.IEEE INFOCOM 2025. 11

2025