pith. sign in

arxiv: 2605.00528 · v2 · pith:J2QAIDUVnew · submitted 2026-05-01 · 💻 cs.DC · cs.AI· cs.LG· cs.OS

SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

Pith reviewed 2026-07-01 07:56 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LGcs.OS
keywords AI agent schedulingworkflow-aware inferenceKV cache reuseGPU cluster schedulingcompound AI servingtask completion timesession affinity
0
0 comments X

The pith

Scheduling entire AI agent workflows as atomic units on GPUs cuts task completion time by 1.64 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current GPU schedulers handle each LLM call in an agent task independently and therefore discard large amounts of intermediate KV cache state between steps. This paper shows that the mismatch inflates end-to-end latency by 3-8x for workflows that contain tens to hundreds of chained calls. SAGA therefore treats the full agent program as the schedulable unit and uses three mechanisms to recover the lost state and balance load. On a 64-GPU cluster running real coding and browser-agent benchmarks, the approach delivers the reported 1.64x reduction in task time together with higher memory utilization and strong SLO compliance.

Core claim

Treating the entire agent workflow rather than individual inference requests as the first-class schedulable unit allows a distributed scheduler to predict and preserve KV cache reuse across tool-call boundaries. The three mechanisms that realize this abstraction are Agent Execution Graphs that model workflow structure, session-affinity batching with work stealing for co-location and balance, and Agent Fair Share, a task-completion-time fairness metric with bounded-deviation guarantees. On SWE-bench and WebArena workloads the resulting system improves geometric-mean task completion time by 1.64x over vLLM with prefix caching while raising memory utilization by 1.22x and reaching 99.2 percent

What carries the argument

Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries.

If this is right

  • Workflow-level scheduling reaches cache reuse within 1.31x of an optimal offline policy.
  • The same mechanisms raise GPU memory utilization by 1.22x while preserving 99.2 percent SLO attainment under multi-tenant load.
  • Interactive compound-AI deployments accept a roughly 30 percent drop in peak throughput in exchange for the latency improvement.
  • Session-affinity batching with work stealing maintains global load balance without sacrificing per-workflow cache locality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same program-level abstraction could be applied to other stateful inference pipelines that cross model or tool boundaries.
  • Existing per-request fairness metrics may systematically undervalue long-running agent tasks; replacing them with task-completion metrics changes which jobs receive priority.
  • Production clusters serving many short interactive agents would see different utilization curves than the latency-sensitive workloads studied here.

Load-bearing premise

The graphs built from the evaluated workflows correctly forecast which KV cache entries will be reused when tool calls cross execution boundaries.

What would settle it

Measure end-to-end task latency on a new set of agent workflows whose cache-reuse patterns deviate sharply from the graphs used in the 64-GPU experiments and check whether the 1.64x gain disappears.

Figures

Figures reproduced from arXiv: 2605.00528 by Dongxin Guo, Jikun Wu, Siu Ming Yiu.

Figure 1
Figure 1. Figure 1: Inefficiencies in serving agent workloads with view at source ↗
Figure 1
Figure 1. Figure 1: cache is evicted during tool calls and must be regenerated, view at source ↗
Figure 3
Figure 3. Figure 3: Concrete AEG for a SWE-bench coding agent. Nodes view at source ↗
Figure 2
Figure 2. Figure 2: SAGA architecture. Layer 1 captures work view at source ↗
read the original abstract

AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mismatched to compound AI workloads, and propose a shift to program-level scheduling: treating the entire agent workflow (not individual inference calls) as the first-class schedulable unit. We present SAGA, a distributed scheduler that implements this abstraction through three mechanisms: (1) Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, achieving within 1.31x of B\'el\'ady's optimal offline policy; (2) session-affinity batching with work stealing that co-locates correlated requests while maintaining global load balance; and (3) Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster serving SWE-bench coding agents and WebArena browser tasks, SAGA reduces task completion time by 1.64x (geometric mean, p < 0.001) over vLLM v0.15.1 with prefix caching and affinity routing, while improving GPU memory utilization by 1.22x and achieving 99.2% SLO attainment under multi-tenant interference. These latency gains come at a quantified cost: approximately 30% lower peak throughput than throughput-optimal batch scheduling, a tradeoff appropriate for the latency-sensitive interactive deployments that dominate compound AI usage. Our results demonstrate that workflow-aware scheduling is essential for efficient compound AI serving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes SAGA, a distributed scheduler for AI agent inference that shifts from request-level to program-level (workflow-atomic) scheduling. It introduces Agent Execution Graphs to model workflow structure and predict KV cache reuse across tool-call boundaries (claimed within 1.31x of Bélády optimality), session-affinity batching with work stealing for co-location while preserving load balance, and Agent Fair Share as a task-completion-time fairness metric with bounded-deviation guarantees. On a 64-GPU cluster running SWE-bench coding agents and WebArena browser tasks, SAGA reports a 1.64x geometric-mean reduction in task completion time (p < 0.001) versus vLLM v0.15.1 with prefix caching and affinity routing, plus 1.22x higher GPU memory utilization and 99.2% SLO attainment under interference, at the quantified cost of ~30% lower peak throughput.

Significance. If the central empirical claims hold, the work would be significant for compound-AI serving by providing concrete evidence that workflow-aware scheduling can substantially reduce end-to-end latency for chained LLM calls while explicitly documenting the latency-throughput tradeoff. The use of real agent workloads (SWE-bench, WebArena) and statistical reporting (p < 0.001) strengthen the result relative to purely synthetic evaluations.

major comments (2)
  1. [Abstract] Abstract and evaluation sections: the 1.64x task-completion-time claim rests on Agent Execution Graphs accurately forecasting KV cache reuse across tool-call boundaries so that affinity batching and work-stealing can exploit it, yet the manuscript supplies no trace-level comparison of predicted versus observed KV hit rates on the same 64-GPU runs that produced the 1.64x number.
  2. [Evaluation] Evaluation: the claim that SWE-bench and WebArena tasks are representative of production compound-AI usage patterns that benefit from program-level scheduling is load-bearing for generalizability, but no workload characterization, sensitivity analysis, or comparison to additional agent traces is provided to support this assumption.
minor comments (1)
  1. Notation for Agent Fair Share and its bounded-deviation proof should be cross-referenced to the exact theorem statement for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional empirical validation would strengthen the manuscript. We respond to each major comment below and commit to revisions that directly address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation sections: the 1.64x task-completion-time claim rests on Agent Execution Graphs accurately forecasting KV cache reuse across tool-call boundaries so that affinity batching and work-stealing can exploit it, yet the manuscript supplies no trace-level comparison of predicted versus observed KV hit rates on the same 64-GPU runs that produced the 1.64x number.

    Authors: We acknowledge this gap. The 1.31x optimality result for Agent Execution Graphs is based on offline trace analysis, but the manuscript does not present a direct predicted-versus-observed KV hit-rate comparison drawn from the identical 64-GPU production runs that produced the 1.64x task-completion-time figure. This weakens the causal linkage between prediction accuracy and the reported gains. In the revised manuscript we will add a dedicated subsection (and accompanying figure) that reports exactly this comparison using traces collected during the cluster experiments, including per-workload hit-rate accuracy, any observed deviations, and their effect on affinity decisions. revision: yes

  2. Referee: [Evaluation] Evaluation: the claim that SWE-bench and WebArena tasks are representative of production compound-AI usage patterns that benefit from program-level scheduling is load-bearing for generalizability, but no workload characterization, sensitivity analysis, or comparison to additional agent traces is provided to support this assumption.

    Authors: We agree that the representativeness claim would be more robust with explicit supporting analysis. The current manuscript relies on the established status of these benchmarks in the agent literature without providing workload statistics or sensitivity results. We will revise the evaluation section to include (1) a workload-characterization subsection reporting metrics such as workflow length distributions, tool-call frequencies, and KV-reuse opportunities, (2) a sensitivity analysis varying workflow complexity and reuse potential, and (3) brief comparisons against other publicly available agent traces to the extent they are accessible. These additions will be placed in the main evaluation or a new appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims are direct empirical measurements

full rationale

The paper reports measured speedups (1.64x task completion time, 1.22x memory utilization, 99.2% SLO) on concrete 64-GPU runs with SWE-bench and WebArena workloads against an external vLLM baseline. The Agent Execution Graphs are described as achieving 1.31x of Bélády optimality via workflow structure capture, but this is presented as an observed outcome rather than a quantity defined by fitting or self-reference. The fairness metric is stated to have provable guarantees, yet no load-bearing derivation reduces to a self-citation chain, fitted input renamed as prediction, or ansatz smuggled via prior work. All central results remain falsifiable against external traces and baselines without internal circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the assumption that agent workflows exhibit predictable cross-call KV cache reuse patterns that can be captured by execution graphs, and that the chosen benchmarks represent typical interactive agent workloads. No free parameters or invented physical entities are described in the abstract.

axioms (1)
  • domain assumption Agent workflows have sufficient structure to allow accurate offline prediction of KV cache reuse across tool calls via execution graphs.
    This premise underpins the claim that the scheduler can achieve within 1.31x of Bélády's optimal policy.
invented entities (2)
  • Agent Execution Graphs no independent evidence
    purpose: Capture workflow structure to predict KV cache reuse across tool-call boundaries
    New data structure introduced to enable program-level scheduling decisions.
  • Agent Fair Share no independent evidence
    purpose: Task-completion-time fairness metric with provable bounded-deviation guarantees
    New fairness definition tailored to workflow completion rather than per-request metrics.

pith-pipeline@v0.9.1-grok · 5838 in / 1515 out tokens · 25252 ms · 2026-07-01T07:56:03.459358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Idleness is Relative: Exploiting Tool-Call Idle Windows for Offloading in Agentic Systems with MORI

    cs.OS 2026-05 unverdicted novelty 6.0

    MORI improves throughput 20-71% and TTFT 18-43% over baselines by ranking programs on a continuous idleness spectrum and shifting the GPU-CPU boundary to match capacity in agentic LLM serving.

Reference graph

Works this paper leans on

76 extracted references · 30 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Gulavani, Alexey Tumanov, and Ramachandran Ramjee

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Tam- ing Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 117–134

  2. [2]

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters HPDC ’26, July 13–16, 2026, Cleveland, OH, USA Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Em...

  3. [3]

    doi:10.18653/V1/2023.EMNLP-MAIN.298

  4. [4]

    Amazon Web Services. 2024. Amazon Q Developer. AWS product page. https: //aws.amazon.com/q/developer/

  5. [5]

    Laszlo A. Belady. 1966. A Study of Replacement Algorithms for Virtual-Storage Computer.IBM Syst. J.5, 2 (1966), 78–101. doi:10.1147/SJ.52.0078

  6. [6]

    Robert D. Blumofe. 1994. Scheduling Multithreaded Computations by Work Stealing. In35th Annual Symposium on Foundations of Computer Science, Santa Fe, New Mexico, USA, November 20-22, 1994. IEEE Computer Society, 356–368. doi:10.1109/SFCS.1994.365680

  7. [7]

    Felten, Anna R

    Pei Cao, Edward W. Felten, Anna R. Karlin, and Kai Li. 1996. Implementation and Performance of Integrated Application-Controlled File Caching, Prefetching, and Disk Scheduling.ACM Trans. Comput. Syst.14, 4 (1996), 311–343. doi:10. 1145/235543.235544

  8. [8]

    Esha Choukse, Pratyush Patel, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, Rodrigo Fonseca, and Ricardo Bianchini. 2025. Splitwise: Efficient Gen- erative LLM Inference Using Phase Splitting.IEEE Micro45, 4 (2025), 54–59. doi:10.1109/MM.2025.3575361

  9. [9]

    Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J

    James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson C. Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymani...

  10. [10]

    Christian Corrò and Luca Chittaro. 2025. Exploring the Potential and Limitations of Large Language Models to Control the Behavior of Embodied Persuasive Agents. InPersuasive Technology - 20th International Conference, PERSUASIVE 2025, Limassol, Cyprus, May 5-7, 2025, Proceedings (Lecture Notes in Computer Science). Springer, 61–73. doi:10.1007/978-3-031-94959-3_5

  11. [11]

    CrewAI. 2023. CrewAI: Framework for Orchestrating Role-Playing, Autonomous AI Agents. GitHub repository. https://github.com/crewAIInc/crewAI

  12. [12]

    Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InThe Twelfth International Conference on Learning Represen- tations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net

  13. [13]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022

  14. [14]

    Kent Wenger

    Ewa Deelman, Karan Vahi, Gideon Juve, Mats Rynge, Scott Callaghan, Philip Maechling, Rajiv Mayani, Weiwei Chen, Rafael Ferreira da Silva, Miron Livny, and R. Kent Wenger. 2015. Pegasus, a workflow management system for science automation.Future Gener. Comput. Syst.46 (2015), 17–35. doi:10.1016/J.FUTURE. 2014.10.008

  15. [15]

    Peter J. Denning. 1970. Virtual Memory.Comput. Surveys2, 3 (1970), 153–189. doi:10.1145/356571.356573

  16. [16]

    Thomas Dohmke. 2024. GitHub Copilot Workspace: Welcome to the Copilot- Native Developer Environment. GitHub Blog. https://github.blog/news-insights/ product-news/github-copilot-workspace/

  17. [17]

    Ulrich Drepper. 2007. What Every Programmer Should Know About Mem- ory. Whitepaper, Red Hat, Inc. https://people.freebsd.org/~lstewart/articles/ cpumemory.pdf

  18. [18]

    Dubhashi and Alessandro Panconesi

    Devdatt P. Dubhashi and Alessandro Panconesi. 2009.Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press

  19. [19]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.J. Mach. Learn. Res.23 (2022), 120:1–120:39

  20. [20]

    Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yu- vraj Patel, and Luo Mai. 2024. ServerlessLLM: Low-Latency Serverless Inference for Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 135–153

  21. [21]

    Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, and Hao Zhang. 2024. Efficient LLM Scheduling by Learning to Rank. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024

  22. [22]

    Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. InProceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2011, Boston, MA, USA, March 30 - April 1, 2011. USENIX Association

  23. [23]

    In Gim, Guojun Chen, Seung-Seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. InProceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 2024. mlsys.org

  24. [24]

    In Gim, Zhiyao Ma, Seung-seob Lee, and Lin Zhong. 2025. Pie: A Programmable Serving System for Emerging LLM Applications. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles(Lotte Hotel World, Seoul, Republic of Korea)(SOSP ’25). Association for Computing Machinery, New York, NY, USA, 415–430. doi:10.1145/3731569.3764814

  25. [25]

    Goetz Graefe. 1990. Encapsulation of Parallelism in the Volcano Query Processing System. InProceedings of the 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, USA, May 23-25, 1990. ACM Press, 102–

  26. [26]

    doi:10.1145/93597.98720

  27. [27]

    Maurice Herlihy. 2006. The art of multiprocessor programming. InProceedings of the Twenty-Fifth Annual ACM Symposium on Principles of Distributed Computing, PODC 2006, Denver, CO, USA, July 23-26, 2006. ACM, 1–2. doi:10.1145/1146381. 1146382

  28. [28]

    Ke Hong, Xiuhong Li, Lufang Chen, Qiuli Mao, Guohao Dai, Xuefei Ning, Shengen Yan, Yun Liang, and Yu Wang. 2025. SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling. InProceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025. OpenReview.net/mlsys.org

  29. [29]

    Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Chenxi Wang, Jiang Xu, Shuang Chen, Hao Feng, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. 2025. ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads.ACM Trans. Archit. Code Optim.22, 2, Article 77 (July 2025), 24 pages. doi:10.1145/3732941

  30. [30]

    InfiniBand Trade Association. 2020. InfiniBand Architecture Specification, Vol- ume 2, Release 1.4. Industry standards specification. https://www.infinibandta. org/ibta-specification/

  31. [31]

    Patrick Jaillet, Jiashuo Jiang, Konstantina Mellou, Marco Molinaro, Chara Podi- mata, and Zijie Zhou. 2026. Online Scheduling for LLM Inference with KV Cache Constraints.arXiv preprintarXiv.2502.07115 (2026). https://arxiv.org/abs/2502. 07115

  32. [32]

    Matthijs Jansen, Linus Wagner, Animesh Trivedi, and Alexandru Iosup. 2023. Continuum: Automate Infrastructure Deployment and Benchmarking in the Com- pute Continuum. InCompanion of the 2023 ACM/SPEC International Conference on Performance Engineering, ICPE 2023, Coimbra, Portugal, April 15-19, 2023. ACM, 181–188. doi:10.1145/3578245.3584936

  33. [33]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net

  34. [34]

    Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2026. RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation.ACM Trans. Comput. Syst.44, 1 (2026), 2:1–2:27. doi:10.1145/3768628

  35. [35]

    Siegel, Nitya Nadgir, and Arvind Narayanan

    Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. 2025. AI Agents That Matter.Trans. Mach. Learn. Res.2025 (2025)

  36. [36]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023. ACM, 611–626. doi:10.1145/...

  37. [37]

    LangChain-AI. 2022. LangChain: Build Context-Aware Reasoning Applications. GitHub repository. https://github.com/langchain-ai/langchain

  38. [38]

    2020.Bandit Algorithms

    Tor Lattimore and Csaba Szepesvári. 2020.Bandit Algorithms. Cambridge Uni- versity Press. doi:10.1017/9781108571401

  39. [39]

    Yueying Li, Jim Dai, and Tianyi Peng. 2025. Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents.arXiv preprintarXiv.2504.07347 (2025). https://arxiv.org/abs/2504.07347

  40. [40]

    Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient Serving of LLM-based Applications with Semantic Variable. In18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 929–945

  41. [41]

    Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. In Proceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM 2024,...

  42. [42]

    Gonzalez, and Ion Stoica

    Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, and Ion Stoica. 2025. Autellix: An Efficient Serving Engine for LLM Agents as General Programs.arXiv preprintarXiv.2502.13965 (2025). https://arxiv.org/abs/2502. 13965

  43. [43]

    Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkatara- man, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and Efficient GPU Cluster Scheduling. In17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020, Santa Clara, CA, USA, February 25-27, 2020. USENIX Association, 289–304. HPDC ’26, J...

  44. [44]

    Nimrod Megiddo and Dharmendra S. Modha. 2003. ARC: A Self-Tuning, Low Overhead Replacement Cache. InProceedings of the FAST ’03 Conference on File and Storage Technologies, March 31 - April 2, 2003, Cathedral Hill Hotel, San Francisco, California, USA. USENIX

  45. [45]

    Meta. 2024. Llama 3 Model Card. Meta Llama documentation. https://github. com/meta-llama/llama3/blob/main/MODEL_CARD.md

  46. [46]

    Jordan, and Ion Stoica

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI Applications. In13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018. ...

  47. [47]

    NVIDIA Corporation. 2018. NVIDIA NVSwitch: The World’s Highest-Bandwidth On-Node Switch. NVIDIA Technical Overview. https://images.nvidia.com/ content/pdf/nvswitch-technical-overview.pdf

  48. [48]

    NVIDIA Corporation. 2023. TensorRT-LLM: High-Performance Large Language Model Inference. GitHub repository. https://github.com/NVIDIA/TensorRT-LLM

  49. [49]

    Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: distributed, low latency scheduling. InACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP ’13, Farmington, PA, USA, November 3-6, 2013. ACM, 69–

  50. [50]

    doi:10.1145/2517349.2522716

  51. [51]

    Zaifeng Pan, AJJKUMAR PATEL, Yipeng Shen, Zhengding Hu, Yue Guan, Wan- Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. 2025. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview. net/forum?id=5Iw1nDtYmT

  52. [52]

    Hugo Patterson, Garth A

    R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka. 1995. Informed Prefetching and Caching. InProceedings of the Fifteenth ACM Symposium on Operating System Principles, SOSP 1995, Copper Mountain Resort, Colorado, USA, December 3-6, 1995. ACM, 79–95. doi:10.1145/224056.224064

  53. [53]

    Steve Rennich. 2012. CUDA C/C++ Streams and Concurrency. NVIDIA CUDA Training Webinar. https://developer.download.nvidia.com/CUDA/training/ StreamsAndConcurrencyWebinar.pdf

  54. [54]

    Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory- efficient neural network design. In49th Annual IEEE/ACM International Sympo- sium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 15-19, 2016. IEEE Computer Society, 18:1–18:13. doi:10.1109/MI...

  55. [55]

    Yeonju Ro, Haoran Qiu, Íñigo Goiri, Rodrigo Fonseca, Ricardo Bianchini, Aditya Akella, Zhangyang Wang, Mattan Erez, and Esha Choukse. 2025. Sherlock: Reli- able and Efficient Agentic Workflow Execution.arXiv preprintarXiv.2511.00330 (2025). https://arxiv.org/abs/2511.00330

  56. [56]

    Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. InProceedings of the 14th Python in Science Conference, SciPy 2015, Austin, Texas, USA, July 6-12, 2015. scipy.org, 126–132. doi:10.25080/MAJORA- 7B98E3ED-013

  57. [57]

    Maddison, and Tatsunori Hashimoto

    Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. 2024. Identifying the Risks of LM Agents with an LM-Emulated Sandbox. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net

  58. [58]

    Gonzalez, and Ion Stoica

    Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. 2024. Fairness in Serving Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implemen- tation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 965–988

  59. [59]

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learn...

  60. [60]

    Daniel Dominic Sleator and Robert Endre Tarjan. 1985. Amortized Efficiency of List Update and Paging Rules.Commun. ACM28, 2 (1985), 202–208. doi:10.1145/ 2786.2793

  61. [61]

    Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse

  62. [62]

    InIEEE International Symposium on High Performance Computer Architecture, HPCA 2025, Las Vegas, NV, USA, March 1-5, 2025

    DynamoLLM: Designing LLM Inference Clusters for Performance and En- ergy Efficiency. InIEEE International Symposium on High Performance Computer Architecture, HPCA 2025, Las Vegas, NV, USA, March 1-5, 2025. IEEE, 1348–1362. doi:10.1109/HPCA61900.2025.00102

  63. [63]

    Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 173–191

  64. [64]

    Douglas Thain, Todd Tannenbaum, and Miron Livny. 2005. Distributed computing in practice: the Condor experience.Concurr. Pract. Exp.17, 2-4 (2005), 323–356. doi:10.1002/CPE.938

  65. [65]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010. https://proce...

  66. [66]

    Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, and Xiaowen Chu. 2025. BurstGPT: A Real-World Workload Dataset to Op- timize LLM Serving Systems. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, V.2, KDD 2025, Toron...

  67. [67]

    Fangzhou Wu, Sandeep Silwal, and Qiuyi Zhang. 2026. Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/ forum?id=R7fv5NWfMm

  68. [68]

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2024. AutoGen: Enabling Next- Gen LLM Applications via Multi-Agent Conversations. InFirst Conference on Language Modeling. https://openreview.net/forum?id=BAakY1hNKS

  69. [69]

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023

  70. [70]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net

  71. [71]

    Naimeng Ye, Arnav Ahuja, Georgios Liargkovas, Yunan Lu, Kostis Kaffes, and Tianyi Peng. 2026. Speculative Actions: A Lossless Framework for Faster AI Agents. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=P0GOk5wslg

  72. [72]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022. USENIX Association, 521–538

  73. [73]

    Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Christopher Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. 2024. The Shift from Models to Compound AI Systems. Berkeley Artificial Intelligence Research (BAIR) Blog. https://bair.berkeley.edu/blog/2024/ 02/18/compound-ai-systems/

  74. [74]

    Gonzalez, Clark W

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Struc- tured Language Model Programs. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Process...

  75. [75]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 193–210

  76. [76]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net