arxiv: 2605.00528 · v1 · submitted 2026-05-01 · 💻 cs.DC · cs.AI· cs.LG· cs.OS

Recognition: unknown

SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

Dongxin Guo , Jikun Wu , Siu Ming Yiu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:10 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LGcs.OS

keywords AI agentsworkflow schedulingGPU clustersKV cache reusedistributed schedulingcompound AImulti-tenant inference

0 comments

The pith

Scheduling AI agent workflows as single units reduces task completion time by 1.64x on GPU clusters

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AI agents perform tens to hundreds of chained LLM calls for a single task, yet current schedulers treat each call independently and discard intermediate cached state, inflating end-to-end latency. SAGA introduces program-level scheduling by treating the full agent workflow as the basic unit. It uses graphs to predict cache reuse, affinity batching to keep related work together, and a fairness metric based on task completion time. In tests on a 64-GPU cluster with coding and browser agent tasks, this yields faster completions, better memory efficiency, and reliable service levels even under contention, at the cost of some peak throughput. A sympathetic reader would care because compound AI workloads are becoming common and current systems are mismatched to them.

Core claim

The central claim is that shifting from request-level to workflow-atomic scheduling for AI agent inference allows prediction of KV cache reuse across tool calls using Agent Execution Graphs, enabling co-location of correlated requests via affinity batching and work stealing, and fair sharing via Agent Fair Share, which together reduce task completion time by 1.64x geometric mean over standard approaches on real workloads.

What carries the argument

Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, together with session-affinity batching and Agent Fair Share fairness metric.

If this is right

Preserving KV cache state across chained calls within a workflow reduces overall task latency by 3-8x compared to discarding it.
Co-locating correlated requests improves GPU memory utilization by 1.22x.
Task-completion-time fairness ensures bounded deviation from ideal shares even in multi-tenant settings.
99.2% SLO attainment is achieved under interference while maintaining load balance through work stealing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method may generalize to other multi-step AI pipelines where state reuse across steps is valuable.
If agent workflows include more unpredictable elements, online graph updates could be needed to maintain accuracy.
The latency-throughput tradeoff highlights a design space for schedulers balancing interactive use versus maximum efficiency.

Load-bearing premise

Agent workflows have enough predictable structure that execution graphs can accurately forecast which KV cache entries will be reused across tool-call boundaries.

What would settle it

Measure the actual KV cache reuse rates and task completion times on a set of agent tasks with highly variable or unpredictable tool call sequences; if the reuse predictions are poor and the 1.64x improvement does not appear, the scheduling benefit does not hold.

Figures

Figures reproduced from arXiv: 2605.00528 by Dongxin Guo, Jikun Wu, Siu Ming Yiu.

**Figure 1.** Figure 1: Inefficiencies in serving agent workloads with view at source ↗

**Figure 1.** Figure 1: cache is evicted during tool calls and must be regenerated, view at source ↗

**Figure 3.** Figure 3: Concrete AEG for a SWE-bench coding agent. Nodes view at source ↗

**Figure 2.** Figure 2: SAGA architecture. Layer 1 captures work view at source ↗

read the original abstract

AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mismatched to compound AI workloads, and propose a shift to program-level scheduling: treating the entire agent workflow (not individual inference calls) as the first-class schedulable unit. We present SAGA, a distributed scheduler that implements this abstraction through three mechanisms: (1) Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, achieving within 1.31x of B\'el\'ady's optimal offline policy; (2) session-affinity batching with work stealing that co-locates correlated requests while maintaining global load balance; and (3) Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster serving SWE-bench coding agents and WebArena browser tasks, SAGA reduces task completion time by 1.64x (geometric mean, p < 0.001) over vLLM v0.15.1 with prefix caching and affinity routing, while improving GPU memory utilization by 1.22x and achieving 99.2% SLO attainment under multi-tenant interference. These latency gains come at a quantified cost: approximately 30% lower peak throughput than throughput-optimal batch scheduling, a tradeoff appropriate for the latency-sensitive interactive deployments that dominate compound AI usage. Our results demonstrate that workflow-aware scheduling is essential for efficient compound AI serving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAGA makes a case for workflow-level GPU scheduling on agent chains and reports clear latency wins on two benchmarks, but the gains rest on how well the execution graphs predict reuse when behavior turns unpredictable.

read the letter

The paper's main takeaway is that scheduling at the level of entire agent workflows, rather than individual requests, can substantially reduce end-to-end latency for compound AI tasks. SAGA brings three specific ideas to the table: execution graphs that model the workflow to predict KV cache reuse across tool calls, affinity batching combined with work stealing to keep related work together without losing balance, and a fairness metric tied to task completion time. These are not just tweaks on vLLM; the graphs aim for near-optimal reuse, and the results claim a 1.64x geometric mean improvement in task time over vLLM with prefix caching on SWE-bench and WebArena agents running on 64 GPUs. They also show better memory utilization and high SLO compliance, while noting the 30% throughput penalty as a reasonable cost for latency-focused settings. The work is grounded in real benchmarks and provides some statistical backing for the claims. It correctly identifies the mismatch between per-call scheduling and chained agent execution. One area that needs more scrutiny is the reliability of those execution graphs when agent behavior is not fully predictable. Dynamic branching or state changes could reduce the reuse prediction accuracy below the reported 1.31x of optimal, which would weaken the downstream scheduling benefits. The abstract presents the numbers cleanly but leaves the experimental setup details thin, so it's not yet clear how sensitive the gains are to workload variations or implementation choices. This paper targets researchers and engineers working on inference serving for AI agents. Anyone dealing with multi-step LLM applications on GPU clusters could find the scheduling abstractions worth examining. It has enough substance to warrant a full peer review, particularly to verify the robustness of the graph-based predictions. I would send it to referees and ask them to focus on the evaluation of the workflow prediction accuracy and the generality of the results beyond the two tested agent types.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SAGA, a distributed GPU-cluster scheduler for compound AI agent workloads. It argues that request-level scheduling is mismatched to workflows consisting of tens to hundreds of chained LLM calls and introduces workflow-atomic scheduling via three mechanisms: (1) Agent Execution Graphs that statically or trace-based predict KV-cache reuse across tool-call boundaries (claimed to reach within 1.31× of Bélády optimality), (2) session-affinity batching augmented by work stealing to co-locate correlated requests while preserving load balance, and (3) Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster running SWE-bench coding agents and WebArena browser tasks, the paper reports a 1.64× geometric-mean reduction in task completion time (p < 0.001) versus vLLM v0.15.1 with prefix caching and affinity routing, together with 1.22× higher GPU memory utilization and 99.2 % SLO attainment under multi-tenant interference, at the cost of approximately 30 % lower peak throughput.

Significance. If the reported performance numbers and the underlying KV-reuse predictions prove reproducible, the work would establish that program-level rather than request-level scheduling can materially improve end-to-end latency for latency-sensitive agent deployments while still providing fairness guarantees. The explicit quantification of the throughput-latency trade-off and the provision of bounded-deviation fairness are positive features that could influence the design of future serving systems for compound AI.

major comments (2)

[Abstract] Abstract: the central 1.64× task-completion-time claim (and the supporting 1.31× Bélády-optimality claim for Agent Execution Graphs) is presented with a p-value and named baselines, yet the abstract supplies no experimental methodology, workload characterization, number of runs, or analysis of confounds. Because these numbers are direct measurements rather than quantities derived from the paper’s own equations, the absence of this information is load-bearing for the soundness of the primary result.
[Abstract] Abstract (and § on Agent Execution Graphs, if present): the performance advantage is predicated on the graphs accurately forecasting KV-cache reuse across dynamic tool-call boundaries in SWE-bench and WebArena agents. No description is given of graph construction (static analysis vs. limited traces), how branching or state changes are handled, or any empirical accuracy measurement under runtime variability; if prediction accuracy falls below the stated 1.31× factor, the session-affinity and work-stealing decisions revert to request-level behavior and the reported gains disappear.

minor comments (2)

[Abstract] Abstract: the phrase “Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees” is introduced without even a one-sentence definition or reference to the theorem that establishes the bound.
[Abstract] Abstract: the 30 % throughput reduction is stated as a quantified cost but is not accompanied by the absolute throughput numbers or the precise operating point (batch size, SLO target) at which the comparison was made.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications from the full manuscript and indicating revisions made to improve the presentation of our results.

read point-by-point responses

Referee: [Abstract] Abstract: the central 1.64× task-completion-time claim (and the supporting 1.31× Bélády-optimality claim for Agent Execution Graphs) is presented with a p-value and named baselines, yet the abstract supplies no experimental methodology, workload characterization, number of runs, or analysis of confounds. Because these numbers are direct measurements rather than quantities derived from the paper’s own equations, the absence of this information is load-bearing for the soundness of the primary result.

Authors: We agree that the abstract's brevity leaves the primary claims without sufficient supporting context on methodology. The full manuscript (Section 5) provides the workload characterization (SWE-bench coding agents and WebArena browser tasks), cluster details (64 GPUs), number of runs (repeated independent executions for geometric mean and p-value computation), and confound controls (e.g., network and utilization variations). To make the abstract more self-contained while respecting length limits, we have revised it to include a concise methodology summary: 'evaluated on a 64-GPU cluster with SWE-bench and WebArena workloads over repeated runs with statistical significance testing (p < 0.001)'. This directly addresses the soundness concern for the reported measurements. revision: yes
Referee: [Abstract] Abstract (and § on Agent Execution Graphs, if present): the performance advantage is predicated on the graphs accurately forecasting KV-cache reuse across dynamic tool-call boundaries in SWE-bench and WebArena agents. No description is given of graph construction (static analysis vs. limited traces), how branching or state changes are handled, or any empirical accuracy measurement under runtime variability; if prediction accuracy falls below the stated 1.31× factor, the session-affinity and work-stealing decisions revert to request-level behavior and the reported gains disappear.

Authors: The manuscript's Section 3 already describes Agent Execution Graph construction as a hybrid of static analysis of agent workflow code and limited trace-based profiling for KV-cache reuse across tool-call boundaries. Branching and state changes are handled by modeling likely execution paths from traces with conservative reuse estimates. We acknowledge the abstract omits these details and the need for explicit accuracy validation. In the revision, we have expanded Section 3 with empirical accuracy measurements under runtime variability for the evaluated agents (confirming the 1.31× Bélády factor holds in practice) and added a brief reference in the abstract. If accuracy degraded substantially below this level, gains would indeed reduce to request-level scheduling, but the added measurements demonstrate robustness for the reported workloads. revision: partial

Circularity Check

0 steps flagged

No circularity; key results are direct empirical measurements

full rationale

The paper presents its central claims—the 1.64x task-completion-time reduction (geometric mean), 1.22x memory utilization improvement, 99.2% SLO attainment, and Agent Execution Graphs achieving within 1.31x of Bélády optimality—as outcomes of experiments on SWE-bench and WebArena benchmarks. These are not derived quantities obtained by fitting parameters inside the paper's own equations or by renaming inputs as predictions. The fairness metric is described as having 'provable bounded-deviation guarantees,' but no self-referential reduction or self-citation chain is exhibited in the provided text that would make the guarantees equivalent to the inputs by construction. The derivation chain remains self-contained against external benchmarks and does not match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the new abstractions of workflow-level scheduling and predictable KV cache reuse across agent steps; no explicit numerical free parameters are stated in the abstract.

axioms (1)

domain assumption AI agent workflows possess sufficient structure to allow construction of execution graphs that predict KV cache reuse across tool-call boundaries
Invoked as the basis for the first mechanism achieving within 1.31x of optimal offline policy.

invented entities (2)

Agent Execution Graphs no independent evidence
purpose: Capture workflow structure to predict KV cache reuse across tool calls
New abstraction introduced to enable program-level rather than request-level scheduling.
Agent Fair Share no independent evidence
purpose: Provide task-completion-time fairness with provable bounded-deviation guarantees
New fairness metric defined for multi-tenant agent workloads.

pith-pipeline@v0.9.0 · 5607 in / 1588 out tokens · 56906 ms · 2026-05-09T19:10:49.517777+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 30 canonical work pages

[1]

Gulavani, Alexey Tumanov, and Ramachandran Ramjee

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Tam- ing Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 117–134

2024
[2]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters HPDC ’26, July 13–16, 2026, Cleveland, OH, USA Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Em...

2023
[3]

doi:10.18653/V1/2023.EMNLP-MAIN.298

work page doi:10.18653/v1/2023.emnlp-main.298 2023
[4]

Amazon Web Services. 2024. Amazon Q Developer. AWS product page. https: //aws.amazon.com/q/developer/

2024
[5]

Laszlo A. Belady. 1966. A Study of Replacement Algorithms for Virtual-Storage Computer.IBM Syst. J.5, 2 (1966), 78–101. doi:10.1147/SJ.52.0078

work page doi:10.1147/sj.52.0078 1966
[6]

Robert D. Blumofe. 1994. Scheduling Multithreaded Computations by Work Stealing. In35th Annual Symposium on Foundations of Computer Science, Santa Fe, New Mexico, USA, November 20-22, 1994. IEEE Computer Society, 356–368. doi:10.1109/SFCS.1994.365680

work page doi:10.1109/sfcs.1994.365680 1994
[7]

Felten, Anna R

Pei Cao, Edward W. Felten, Anna R. Karlin, and Kai Li. 1996. Implementation and Performance of Integrated Application-Controlled File Caching, Prefetching, and Disk Scheduling.ACM Trans. Comput. Syst.14, 4 (1996), 311–343. doi:10. 1145/235543.235544

work page arXiv 1996
[8]

Esha Choukse, Pratyush Patel, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, Rodrigo Fonseca, and Ricardo Bianchini. 2025. Splitwise: Efficient Gen- erative LLM Inference Using Phase Splitting.IEEE Micro45, 4 (2025), 54–59. doi:10.1109/MM.2025.3575361

work page doi:10.1109/mm.2025.3575361 2025
[9]

Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J

James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson C. Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymani...

work page doi:10.1145/2491245 2013
[10]

Christian Corrò and Luca Chittaro. 2025. Exploring the Potential and Limitations of Large Language Models to Control the Behavior of Embodied Persuasive Agents. InPersuasive Technology - 20th International Conference, PERSUASIVE 2025, Limassol, Cyprus, May 5-7, 2025, Proceedings (Lecture Notes in Computer Science). Springer, 61–73. doi:10.1007/978-3-031-94959-3_5

work page doi:10.1007/978-3-031-94959-3_5 2025
[11]

CrewAI. 2023. CrewAI: Framework for Orchestrating Role-Playing, Autonomous AI Agents. GitHub repository. https://github.com/crewAIInc/crewAI

2023
[12]

Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InThe Twelfth International Conference on Learning Represen- tations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net

2024
[13]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022

2022
[14]

Designing a QEMU plugin to profile multicore long vector RISC-V architectures: RAVE

Ewa Deelman, Karan Vahi, Gideon Juve, Mats Rynge, Scott Callaghan, Philip Maechling, Rajiv Mayani, Weiwei Chen, Rafael Ferreira da Silva, Miron Livny, and R. Kent Wenger. 2015. Pegasus, a workflow management system for science automation.Future Gener. Comput. Syst.46 (2015), 17–35. doi:10.1016/J.FUTURE. 2014.10.008

work page doi:10.1016/j.future 2015
[15]

Peter J. Denning. 1970. Virtual Memory.Comput. Surveys2, 3 (1970), 153–189. doi:10.1145/356571.356573

work page doi:10.1145/356571.356573 1970
[16]

Thomas Dohmke. 2024. GitHub Copilot Workspace: Welcome to the Copilot- Native Developer Environment. GitHub Blog. https://github.blog/news-insights/ product-news/github-copilot-workspace/

2024
[17]

Ulrich Drepper. 2007. What Every Programmer Should Know About Mem- ory. Whitepaper, Red Hat, Inc. https://people.freebsd.org/~lstewart/articles/ cpumemory.pdf

2007
[18]

Dubhashi and Alessandro Panconesi

Devdatt P. Dubhashi and Alessandro Panconesi. 2009.Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press

2009
[19]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.J. Mach. Learn. Res.23 (2022), 120:1–120:39

2022
[20]

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yu- vraj Patel, and Luo Mai. 2024. ServerlessLLM: Low-Latency Serverless Inference for Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 135–153

2024
[21]

Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, and Hao Zhang. 2024. Efficient LLM Scheduling by Learning to Rank. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024

2024
[22]

Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. InProceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2011, Boston, MA, USA, March 30 - April 1, 2011. USENIX Association

2011
[23]

In Gim, Guojun Chen, Seung-Seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. InProceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 2024. mlsys.org

2024
[24]

In Gim, Zhiyao Ma, Seung-seob Lee, and Lin Zhong. 2025. Pie: A Programmable Serving System for Emerging LLM Applications. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles(Lotte Hotel World, Seoul, Republic of Korea)(SOSP ’25). Association for Computing Machinery, New York, NY, USA, 415–430. doi:10.1145/3731569.3764814

work page doi:10.1145/3731569.3764814 2025
[25]

Goetz Graefe. 1990. Encapsulation of Parallelism in the Volcano Query Processing System. InProceedings of the 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, USA, May 23-25, 1990. ACM Press, 102–

1990
[26]

doi:10.1145/93597.98720

work page doi:10.1145/93597.98720
[27]

Maurice Herlihy. 2006. The art of multiprocessor programming. InProceedings of the Twenty-Fifth Annual ACM Symposium on Principles of Distributed Computing, PODC 2006, Denver, CO, USA, July 23-26, 2006. ACM, 1–2. doi:10.1145/1146381. 1146382

work page doi:10.1145/1146381 2006
[28]

Ke Hong, Xiuhong Li, Lufang Chen, Qiuli Mao, Guohao Dai, Xuefei Ning, Shengen Yan, Yun Liang, and Yu Wang. 2025. SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling. InProceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025. OpenReview.net/mlsys.org

2025
[29]

Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Chenxi Wang, Jiang Xu, Shuang Chen, Hao Feng, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. 2025. ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads.ACM Trans. Archit. Code Optim.22, 2, Article 77 (July 2025), 24 pages. doi:10.1145/3732941

work page doi:10.1145/3732941 2025
[30]

InfiniBand Trade Association. 2020. InfiniBand Architecture Specification, Vol- ume 2, Release 1.4. Industry standards specification. https://www.infinibandta. org/ibta-specification/

2020
[31]

Patrick Jaillet, Jiashuo Jiang, Konstantina Mellou, Marco Molinaro, Chara Podi- mata, and Zijie Zhou. 2026. Online Scheduling for LLM Inference with KV Cache Constraints.arXiv preprintarXiv.2502.07115 (2026). https://arxiv.org/abs/2502. 07115

work page arXiv 2026
[32]

Matthijs Jansen, Linus Wagner, Animesh Trivedi, and Alexandru Iosup. 2023. Continuum: Automate Infrastructure Deployment and Benchmarking in the Com- pute Continuum. InCompanion of the 2023 ACM/SPEC International Conference on Performance Engineering, ICPE 2023, Coimbra, Portugal, April 15-19, 2023. ACM, 181–188. doi:10.1145/3578245.3584936

work page doi:10.1145/3578245.3584936 2023
[33]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net

2024
[34]

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2026. RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation.ACM Trans. Comput. Syst.44, 1 (2026), 2:1–2:27. doi:10.1145/3768628

work page doi:10.1145/3768628 2026
[35]

Siegel, Nitya Nadgir, and Arvind Narayanan

Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. 2025. AI Agents That Matter.Trans. Mach. Learn. Res.2025 (2025)

2025
[36]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023. ACM, 611–626. doi:10.1145/...

work page doi:10.1145/3600006 2023
[37]

LangChain-AI. 2022. LangChain: Build Context-Aware Reasoning Applications. GitHub repository. https://github.com/langchain-ai/langchain

2022
[38]

Bandit Algorithms

Tor Lattimore and Csaba Szepesvári. 2020.Bandit Algorithms. Cambridge Uni- versity Press. doi:10.1017/9781108571401

work page doi:10.1017/9781108571401 2020
[39]

Yueying Li, Jim Dai, and Tianyi Peng. 2025. Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents.arXiv preprintarXiv.2504.07347 (2025). https://arxiv.org/abs/2504.07347

work page arXiv 2025
[40]

Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient Serving of LLM-based Applications with Semantic Variable. In18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 929–945

2024
[41]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. In Proceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM 2024,...

work page doi:10.1145/3651890.3672274 2024
[42]

arXiv preprint arXiv:2502.13965 , year =

Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, and Ion Stoica. 2025. Autellix: An Efficient Serving Engine for LLM Agents as General Programs.arXiv preprintarXiv.2502.13965 (2025). https://arxiv.org/abs/2502. 13965

work page arXiv 2025
[43]

Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkatara- man, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and Efficient GPU Cluster Scheduling. In17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020, Santa Clara, CA, USA, February 25-27, 2020. USENIX Association, 289–304. HPDC ’26, J...

2020
[44]

Nimrod Megiddo and Dharmendra S. Modha. 2003. ARC: A Self-Tuning, Low Overhead Replacement Cache. InProceedings of the FAST ’03 Conference on File and Storage Technologies, March 31 - April 2, 2003, Cathedral Hill Hotel, San Francisco, California, USA. USENIX

2003
[45]

Meta. 2024. Llama 3 Model Card. Meta Llama documentation. https://github. com/meta-llama/llama3/blob/main/MODEL_CARD.md

2024
[46]

Jordan, and Ion Stoica

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI Applications. In13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018. ...

2018
[47]

NVIDIA Corporation. 2018. NVIDIA NVSwitch: The World’s Highest-Bandwidth On-Node Switch. NVIDIA Technical Overview. https://images.nvidia.com/ content/pdf/nvswitch-technical-overview.pdf

2018
[48]

NVIDIA Corporation. 2023. TensorRT-LLM: High-Performance Large Language Model Inference. GitHub repository. https://github.com/NVIDIA/TensorRT-LLM

2023
[49]

Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: distributed, low latency scheduling. InACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP ’13, Farmington, PA, USA, November 3-6, 2013. ACM, 69–

2013
[50]

doi:10.1145/2517349.2522716

work page doi:10.1145/2517349.2522716
[51]

Zaifeng Pan, AJJKUMAR PATEL, Yipeng Shen, Zhengding Hu, Yue Guan, Wan- Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. 2025. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview. net/forum?id=5Iw1nDtYmT

2025
[52]

Hugo Patterson, Garth A

R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka. 1995. Informed Prefetching and Caching. InProceedings of the Fifteenth ACM Symposium on Operating System Principles, SOSP 1995, Copper Mountain Resort, Colorado, USA, December 3-6, 1995. ACM, 79–95. doi:10.1145/224056.224064

work page doi:10.1145/224056.224064 1995
[53]

Steve Rennich. 2012. CUDA C/C++ Streams and Concurrency. NVIDIA CUDA Training Webinar. https://developer.download.nvidia.com/CUDA/training/ StreamsAndConcurrencyWebinar.pdf

2012
[54]

Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory- efficient neural network design. In49th Annual IEEE/ACM International Sympo- sium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 15-19, 2016. IEEE Computer Society, 18:1–18:13. doi:10.1109/MI...

work page doi:10.1109/micro.2016.7783721 2016
[55]

Yeonju Ro, Haoran Qiu, Íñigo Goiri, Rodrigo Fonseca, Ricardo Bianchini, Aditya Akella, Zhangyang Wang, Mattan Erez, and Esha Choukse. 2025. Sherlock: Reli- able and Efficient Agentic Workflow Execution.arXiv preprintarXiv.2511.00330 (2025). https://arxiv.org/abs/2511.00330

work page arXiv 2025
[56]

Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. InProceedings of the 14th Python in Science Conference, SciPy 2015, Austin, Texas, USA, July 6-12, 2015. scipy.org, 126–132. doi:10.25080/MAJORA- 7B98E3ED-013

work page doi:10.25080/majora- 2015
[57]

Maddison, and Tatsunori Hashimoto

Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. 2024. Identifying the Risks of LM Agents with an LM-Emulated Sandbox. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net

2024
[58]

Gonzalez, and Ion Stoica

Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. 2024. Fairness in Serving Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implemen- tation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 965–988

2024
[59]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learn...

2023
[60]

Daniel Dominic Sleator and Robert Endre Tarjan. 1985. Amortized Efficiency of List Update and Paging Rules.Commun. ACM28, 2 (1985), 202–208. doi:10.1145/ 2786.2793

work page arXiv 1985
[61]

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse
[62]

To cross, or not to cross pages for prefetching?

DynamoLLM: Designing LLM Inference Clusters for Performance and En- ergy Efficiency. InIEEE International Symposium on High Performance Computer Architecture, HPCA 2025, Las Vegas, NV, USA, March 1-5, 2025. IEEE, 1348–1362. doi:10.1109/HPCA61900.2025.00102

work page doi:10.1109/hpca61900.2025.00102 2025
[63]

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 173–191

2024
[64]

Douglas Thain, Todd Tannenbaum, and Miron Livny. 2005. Distributed computing in practice: the Condor experience.Concurr. Pract. Exp.17, 2-4 (2005), 323–356. doi:10.1002/CPE.938

work page doi:10.1002/cpe.938 2005
[65]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010. https://proce...

2017
[66]

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, and Xiaowen Chu. 2025. BurstGPT: A Real-World Workload Dataset to Op- timize LLM Serving Systems. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, V.2, KDD 2025, Toron...

work page doi:10.1145/3711896.3737413 2025
[67]

Fangzhou Wu, Sandeep Silwal, and Qiuyi Zhang. 2026. Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/ forum?id=R7fv5NWfMm

2026
[68]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2024. AutoGen: Enabling Next- Gen LLM Applications via Multi-Agent Conversations. InFirst Conference on Language Modeling. https://openreview.net/forum?id=BAakY1hNKS

2024
[69]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023

2023
[70]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net

2023
[71]

Naimeng Ye, Arnav Ahuja, Georgios Liargkovas, Yunan Lu, Kostis Kaffes, and Tianyi Peng. 2026. Speculative Actions: A Lossless Framework for Faster AI Agents. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=P0GOk5wslg

2026
[72]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022. USENIX Association, 521–538

2022
[73]

Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Christopher Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. 2024. The Shift from Models to Compound AI Systems. Berkeley Artificial Intelligence Research (BAIR) Blog. https://bair.berkeley.edu/blog/2024/ 02/18/compound-ai-systems/

2024
[74]

Gonzalez, Clark W

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Struc- tured Language Model Programs. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Process...

2024
[75]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 193–210

2024
[76]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net

2024