Recognition: unknown
SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters
Pith reviewed 2026-05-09 19:10 UTC · model grok-4.3
The pith
Scheduling AI agent workflows as single units reduces task completion time by 1.64x on GPU clusters
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that shifting from request-level to workflow-atomic scheduling for AI agent inference allows prediction of KV cache reuse across tool calls using Agent Execution Graphs, enabling co-location of correlated requests via affinity batching and work stealing, and fair sharing via Agent Fair Share, which together reduce task completion time by 1.64x geometric mean over standard approaches on real workloads.
What carries the argument
Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, together with session-affinity batching and Agent Fair Share fairness metric.
If this is right
- Preserving KV cache state across chained calls within a workflow reduces overall task latency by 3-8x compared to discarding it.
- Co-locating correlated requests improves GPU memory utilization by 1.22x.
- Task-completion-time fairness ensures bounded deviation from ideal shares even in multi-tenant settings.
- 99.2% SLO attainment is achieved under interference while maintaining load balance through work stealing.
Where Pith is reading between the lines
- This method may generalize to other multi-step AI pipelines where state reuse across steps is valuable.
- If agent workflows include more unpredictable elements, online graph updates could be needed to maintain accuracy.
- The latency-throughput tradeoff highlights a design space for schedulers balancing interactive use versus maximum efficiency.
Load-bearing premise
Agent workflows have enough predictable structure that execution graphs can accurately forecast which KV cache entries will be reused across tool-call boundaries.
What would settle it
Measure the actual KV cache reuse rates and task completion times on a set of agent tasks with highly variable or unpredictable tool call sequences; if the reuse predictions are poor and the 1.64x improvement does not appear, the scheduling benefit does not hold.
Figures
read the original abstract
AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mismatched to compound AI workloads, and propose a shift to program-level scheduling: treating the entire agent workflow (not individual inference calls) as the first-class schedulable unit. We present SAGA, a distributed scheduler that implements this abstraction through three mechanisms: (1) Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, achieving within 1.31x of B\'el\'ady's optimal offline policy; (2) session-affinity batching with work stealing that co-locates correlated requests while maintaining global load balance; and (3) Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster serving SWE-bench coding agents and WebArena browser tasks, SAGA reduces task completion time by 1.64x (geometric mean, p < 0.001) over vLLM v0.15.1 with prefix caching and affinity routing, while improving GPU memory utilization by 1.22x and achieving 99.2% SLO attainment under multi-tenant interference. These latency gains come at a quantified cost: approximately 30% lower peak throughput than throughput-optimal batch scheduling, a tradeoff appropriate for the latency-sensitive interactive deployments that dominate compound AI usage. Our results demonstrate that workflow-aware scheduling is essential for efficient compound AI serving.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SAGA, a distributed GPU-cluster scheduler for compound AI agent workloads. It argues that request-level scheduling is mismatched to workflows consisting of tens to hundreds of chained LLM calls and introduces workflow-atomic scheduling via three mechanisms: (1) Agent Execution Graphs that statically or trace-based predict KV-cache reuse across tool-call boundaries (claimed to reach within 1.31× of Bélády optimality), (2) session-affinity batching augmented by work stealing to co-locate correlated requests while preserving load balance, and (3) Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster running SWE-bench coding agents and WebArena browser tasks, the paper reports a 1.64× geometric-mean reduction in task completion time (p < 0.001) versus vLLM v0.15.1 with prefix caching and affinity routing, together with 1.22× higher GPU memory utilization and 99.2 % SLO attainment under multi-tenant interference, at the cost of approximately 30 % lower peak throughput.
Significance. If the reported performance numbers and the underlying KV-reuse predictions prove reproducible, the work would establish that program-level rather than request-level scheduling can materially improve end-to-end latency for latency-sensitive agent deployments while still providing fairness guarantees. The explicit quantification of the throughput-latency trade-off and the provision of bounded-deviation fairness are positive features that could influence the design of future serving systems for compound AI.
major comments (2)
- [Abstract] Abstract: the central 1.64× task-completion-time claim (and the supporting 1.31× Bélády-optimality claim for Agent Execution Graphs) is presented with a p-value and named baselines, yet the abstract supplies no experimental methodology, workload characterization, number of runs, or analysis of confounds. Because these numbers are direct measurements rather than quantities derived from the paper’s own equations, the absence of this information is load-bearing for the soundness of the primary result.
- [Abstract] Abstract (and § on Agent Execution Graphs, if present): the performance advantage is predicated on the graphs accurately forecasting KV-cache reuse across dynamic tool-call boundaries in SWE-bench and WebArena agents. No description is given of graph construction (static analysis vs. limited traces), how branching or state changes are handled, or any empirical accuracy measurement under runtime variability; if prediction accuracy falls below the stated 1.31× factor, the session-affinity and work-stealing decisions revert to request-level behavior and the reported gains disappear.
minor comments (2)
- [Abstract] Abstract: the phrase “Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees” is introduced without even a one-sentence definition or reference to the theorem that establishes the bound.
- [Abstract] Abstract: the 30 % throughput reduction is stated as a quantified cost but is not accompanied by the absolute throughput numbers or the precise operating point (batch size, SLO target) at which the comparison was made.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications from the full manuscript and indicating revisions made to improve the presentation of our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central 1.64× task-completion-time claim (and the supporting 1.31× Bélády-optimality claim for Agent Execution Graphs) is presented with a p-value and named baselines, yet the abstract supplies no experimental methodology, workload characterization, number of runs, or analysis of confounds. Because these numbers are direct measurements rather than quantities derived from the paper’s own equations, the absence of this information is load-bearing for the soundness of the primary result.
Authors: We agree that the abstract's brevity leaves the primary claims without sufficient supporting context on methodology. The full manuscript (Section 5) provides the workload characterization (SWE-bench coding agents and WebArena browser tasks), cluster details (64 GPUs), number of runs (repeated independent executions for geometric mean and p-value computation), and confound controls (e.g., network and utilization variations). To make the abstract more self-contained while respecting length limits, we have revised it to include a concise methodology summary: 'evaluated on a 64-GPU cluster with SWE-bench and WebArena workloads over repeated runs with statistical significance testing (p < 0.001)'. This directly addresses the soundness concern for the reported measurements. revision: yes
-
Referee: [Abstract] Abstract (and § on Agent Execution Graphs, if present): the performance advantage is predicated on the graphs accurately forecasting KV-cache reuse across dynamic tool-call boundaries in SWE-bench and WebArena agents. No description is given of graph construction (static analysis vs. limited traces), how branching or state changes are handled, or any empirical accuracy measurement under runtime variability; if prediction accuracy falls below the stated 1.31× factor, the session-affinity and work-stealing decisions revert to request-level behavior and the reported gains disappear.
Authors: The manuscript's Section 3 already describes Agent Execution Graph construction as a hybrid of static analysis of agent workflow code and limited trace-based profiling for KV-cache reuse across tool-call boundaries. Branching and state changes are handled by modeling likely execution paths from traces with conservative reuse estimates. We acknowledge the abstract omits these details and the need for explicit accuracy validation. In the revision, we have expanded Section 3 with empirical accuracy measurements under runtime variability for the evaluated agents (confirming the 1.31× Bélády factor holds in practice) and added a brief reference in the abstract. If accuracy degraded substantially below this level, gains would indeed reduce to request-level scheduling, but the added measurements demonstrate robustness for the reported workloads. revision: partial
Circularity Check
No circularity; key results are direct empirical measurements
full rationale
The paper presents its central claims—the 1.64x task-completion-time reduction (geometric mean), 1.22x memory utilization improvement, 99.2% SLO attainment, and Agent Execution Graphs achieving within 1.31x of Bélády optimality—as outcomes of experiments on SWE-bench and WebArena benchmarks. These are not derived quantities obtained by fitting parameters inside the paper's own equations or by renaming inputs as predictions. The fairness metric is described as having 'provable bounded-deviation guarantees,' but no self-referential reduction or self-citation chain is exhibited in the provided text that would make the guarantees equivalent to the inputs by construction. The derivation chain remains self-contained against external benchmarks and does not match any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption AI agent workflows possess sufficient structure to allow construction of execution graphs that predict KV cache reuse across tool-call boundaries
invented entities (2)
-
Agent Execution Graphs
no independent evidence
-
Agent Fair Share
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Gulavani, Alexey Tumanov, and Ramachandran Ramjee
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Tam- ing Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 117–134
2024
-
[2]
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters HPDC ’26, July 13–16, 2026, Cleveland, OH, USA Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Em...
2023
-
[3]
doi:10.18653/V1/2023.EMNLP-MAIN.298
-
[4]
Amazon Web Services. 2024. Amazon Q Developer. AWS product page. https: //aws.amazon.com/q/developer/
2024
-
[5]
Laszlo A. Belady. 1966. A Study of Replacement Algorithms for Virtual-Storage Computer.IBM Syst. J.5, 2 (1966), 78–101. doi:10.1147/SJ.52.0078
-
[6]
Robert D. Blumofe. 1994. Scheduling Multithreaded Computations by Work Stealing. In35th Annual Symposium on Foundations of Computer Science, Santa Fe, New Mexico, USA, November 20-22, 1994. IEEE Computer Society, 356–368. doi:10.1109/SFCS.1994.365680
-
[7]
Pei Cao, Edward W. Felten, Anna R. Karlin, and Kai Li. 1996. Implementation and Performance of Integrated Application-Controlled File Caching, Prefetching, and Disk Scheduling.ACM Trans. Comput. Syst.14, 4 (1996), 311–343. doi:10. 1145/235543.235544
-
[8]
Esha Choukse, Pratyush Patel, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, Rodrigo Fonseca, and Ricardo Bianchini. 2025. Splitwise: Efficient Gen- erative LLM Inference Using Phase Splitting.IEEE Micro45, 4 (2025), 54–59. doi:10.1109/MM.2025.3575361
-
[9]
Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson C. Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymani...
-
[10]
Christian Corrò and Luca Chittaro. 2025. Exploring the Potential and Limitations of Large Language Models to Control the Behavior of Embodied Persuasive Agents. InPersuasive Technology - 20th International Conference, PERSUASIVE 2025, Limassol, Cyprus, May 5-7, 2025, Proceedings (Lecture Notes in Computer Science). Springer, 61–73. doi:10.1007/978-3-031-94959-3_5
-
[11]
CrewAI. 2023. CrewAI: Framework for Orchestrating Role-Playing, Autonomous AI Agents. GitHub repository. https://github.com/crewAIInc/crewAI
2023
-
[12]
Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InThe Twelfth International Conference on Learning Represen- tations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net
2024
-
[13]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022
2022
-
[14]
Designing a QEMU plugin to profile multicore long vector RISC-V architectures: RAVE
Ewa Deelman, Karan Vahi, Gideon Juve, Mats Rynge, Scott Callaghan, Philip Maechling, Rajiv Mayani, Weiwei Chen, Rafael Ferreira da Silva, Miron Livny, and R. Kent Wenger. 2015. Pegasus, a workflow management system for science automation.Future Gener. Comput. Syst.46 (2015), 17–35. doi:10.1016/J.FUTURE. 2014.10.008
-
[15]
Peter J. Denning. 1970. Virtual Memory.Comput. Surveys2, 3 (1970), 153–189. doi:10.1145/356571.356573
-
[16]
Thomas Dohmke. 2024. GitHub Copilot Workspace: Welcome to the Copilot- Native Developer Environment. GitHub Blog. https://github.blog/news-insights/ product-news/github-copilot-workspace/
2024
-
[17]
Ulrich Drepper. 2007. What Every Programmer Should Know About Mem- ory. Whitepaper, Red Hat, Inc. https://people.freebsd.org/~lstewart/articles/ cpumemory.pdf
2007
-
[18]
Dubhashi and Alessandro Panconesi
Devdatt P. Dubhashi and Alessandro Panconesi. 2009.Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press
2009
-
[19]
William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.J. Mach. Learn. Res.23 (2022), 120:1–120:39
2022
-
[20]
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yu- vraj Patel, and Luo Mai. 2024. ServerlessLLM: Low-Latency Serverless Inference for Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 135–153
2024
-
[21]
Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, and Hao Zhang. 2024. Efficient LLM Scheduling by Learning to Rank. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024
2024
-
[22]
Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. InProceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2011, Boston, MA, USA, March 30 - April 1, 2011. USENIX Association
2011
-
[23]
In Gim, Guojun Chen, Seung-Seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. InProceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 2024. mlsys.org
2024
-
[24]
In Gim, Zhiyao Ma, Seung-seob Lee, and Lin Zhong. 2025. Pie: A Programmable Serving System for Emerging LLM Applications. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles(Lotte Hotel World, Seoul, Republic of Korea)(SOSP ’25). Association for Computing Machinery, New York, NY, USA, 415–430. doi:10.1145/3731569.3764814
-
[25]
Goetz Graefe. 1990. Encapsulation of Parallelism in the Volcano Query Processing System. InProceedings of the 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, USA, May 23-25, 1990. ACM Press, 102–
1990
-
[26]
doi:10.1145/93597.98720
-
[27]
Maurice Herlihy. 2006. The art of multiprocessor programming. InProceedings of the Twenty-Fifth Annual ACM Symposium on Principles of Distributed Computing, PODC 2006, Denver, CO, USA, July 23-26, 2006. ACM, 1–2. doi:10.1145/1146381. 1146382
-
[28]
Ke Hong, Xiuhong Li, Lufang Chen, Qiuli Mao, Guohao Dai, Xuefei Ning, Shengen Yan, Yun Liang, and Yu Wang. 2025. SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling. InProceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025. OpenReview.net/mlsys.org
2025
-
[29]
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Chenxi Wang, Jiang Xu, Shuang Chen, Hao Feng, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. 2025. ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads.ACM Trans. Archit. Code Optim.22, 2, Article 77 (July 2025), 24 pages. doi:10.1145/3732941
-
[30]
InfiniBand Trade Association. 2020. InfiniBand Architecture Specification, Vol- ume 2, Release 1.4. Industry standards specification. https://www.infinibandta. org/ibta-specification/
2020
- [31]
-
[32]
Matthijs Jansen, Linus Wagner, Animesh Trivedi, and Alexandru Iosup. 2023. Continuum: Automate Infrastructure Deployment and Benchmarking in the Com- pute Continuum. InCompanion of the 2023 ACM/SPEC International Conference on Performance Engineering, ICPE 2023, Coimbra, Portugal, April 15-19, 2023. ACM, 181–188. doi:10.1145/3578245.3584936
-
[33]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net
2024
-
[34]
Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2026. RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation.ACM Trans. Comput. Syst.44, 1 (2026), 2:1–2:27. doi:10.1145/3768628
-
[35]
Siegel, Nitya Nadgir, and Arvind Narayanan
Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. 2025. AI Agents That Matter.Trans. Mach. Learn. Res.2025 (2025)
2025
-
[36]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023. ACM, 611–626. doi:10.1145/...
-
[37]
LangChain-AI. 2022. LangChain: Build Context-Aware Reasoning Applications. GitHub repository. https://github.com/langchain-ai/langchain
2022
-
[38]
Tor Lattimore and Csaba Szepesvári. 2020.Bandit Algorithms. Cambridge Uni- versity Press. doi:10.1017/9781108571401
- [39]
-
[40]
Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient Serving of LLM-based Applications with Semantic Variable. In18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 929–945
2024
-
[41]
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. In Proceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM 2024,...
-
[42]
arXiv preprint arXiv:2502.13965 , year =
Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, and Ion Stoica. 2025. Autellix: An Efficient Serving Engine for LLM Agents as General Programs.arXiv preprintarXiv.2502.13965 (2025). https://arxiv.org/abs/2502. 13965
-
[43]
Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkatara- man, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and Efficient GPU Cluster Scheduling. In17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020, Santa Clara, CA, USA, February 25-27, 2020. USENIX Association, 289–304. HPDC ’26, J...
2020
-
[44]
Nimrod Megiddo and Dharmendra S. Modha. 2003. ARC: A Self-Tuning, Low Overhead Replacement Cache. InProceedings of the FAST ’03 Conference on File and Storage Technologies, March 31 - April 2, 2003, Cathedral Hill Hotel, San Francisco, California, USA. USENIX
2003
-
[45]
Meta. 2024. Llama 3 Model Card. Meta Llama documentation. https://github. com/meta-llama/llama3/blob/main/MODEL_CARD.md
2024
-
[46]
Jordan, and Ion Stoica
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI Applications. In13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018. ...
2018
-
[47]
NVIDIA Corporation. 2018. NVIDIA NVSwitch: The World’s Highest-Bandwidth On-Node Switch. NVIDIA Technical Overview. https://images.nvidia.com/ content/pdf/nvswitch-technical-overview.pdf
2018
-
[48]
NVIDIA Corporation. 2023. TensorRT-LLM: High-Performance Large Language Model Inference. GitHub repository. https://github.com/NVIDIA/TensorRT-LLM
2023
-
[49]
Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: distributed, low latency scheduling. InACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP ’13, Farmington, PA, USA, November 3-6, 2013. ACM, 69–
2013
-
[50]
doi:10.1145/2517349.2522716
-
[51]
Zaifeng Pan, AJJKUMAR PATEL, Yipeng Shen, Zhengding Hu, Yue Guan, Wan- Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. 2025. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview. net/forum?id=5Iw1nDtYmT
2025
-
[52]
R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka. 1995. Informed Prefetching and Caching. InProceedings of the Fifteenth ACM Symposium on Operating System Principles, SOSP 1995, Copper Mountain Resort, Colorado, USA, December 3-6, 1995. ACM, 79–95. doi:10.1145/224056.224064
-
[53]
Steve Rennich. 2012. CUDA C/C++ Streams and Concurrency. NVIDIA CUDA Training Webinar. https://developer.download.nvidia.com/CUDA/training/ StreamsAndConcurrencyWebinar.pdf
2012
-
[54]
Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory- efficient neural network design. In49th Annual IEEE/ACM International Sympo- sium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 15-19, 2016. IEEE Computer Society, 18:1–18:13. doi:10.1109/MI...
- [55]
-
[56]
Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. InProceedings of the 14th Python in Science Conference, SciPy 2015, Austin, Texas, USA, July 6-12, 2015. scipy.org, 126–132. doi:10.25080/MAJORA- 7B98E3ED-013
-
[57]
Maddison, and Tatsunori Hashimoto
Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. 2024. Identifying the Risks of LM Agents with an LM-Emulated Sandbox. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net
2024
-
[58]
Gonzalez, and Ion Stoica
Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. 2024. Fairness in Serving Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implemen- tation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 965–988
2024
-
[59]
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learn...
2023
- [60]
-
[61]
Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse
-
[62]
To cross, or not to cross pages for prefetching?
DynamoLLM: Designing LLM Inference Clusters for Performance and En- ergy Efficiency. InIEEE International Symposium on High Performance Computer Architecture, HPCA 2025, Las Vegas, NV, USA, March 1-5, 2025. IEEE, 1348–1362. doi:10.1109/HPCA61900.2025.00102
-
[63]
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 173–191
2024
-
[64]
Douglas Thain, Todd Tannenbaum, and Miron Livny. 2005. Distributed computing in practice: the Condor experience.Concurr. Pract. Exp.17, 2-4 (2005), 323–356. doi:10.1002/CPE.938
-
[65]
Gomez, Łukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010. https://proce...
2017
-
[66]
Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, and Xiaowen Chu. 2025. BurstGPT: A Real-World Workload Dataset to Op- timize LLM Serving Systems. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, V.2, KDD 2025, Toron...
-
[67]
Fangzhou Wu, Sandeep Silwal, and Qiuyi Zhang. 2026. Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/ forum?id=R7fv5NWfMm
2026
-
[68]
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2024. AutoGen: Enabling Next- Gen LLM Applications via Multi-Agent Conversations. InFirst Conference on Language Modeling. https://openreview.net/forum?id=BAakY1hNKS
2024
-
[69]
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023
2023
-
[70]
Narasimhan, and Yuan Cao
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net
2023
-
[71]
Naimeng Ye, Arnav Ahuja, Georgios Liargkovas, Yunan Lu, Kostis Kaffes, and Tianyi Peng. 2026. Speculative Actions: A Lossless Framework for Faster AI Agents. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=P0GOk5wslg
2026
-
[72]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022. USENIX Association, 521–538
2022
-
[73]
Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Christopher Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. 2024. The Shift from Models to Compound AI Systems. Berkeley Artificial Intelligence Research (BAIR) Blog. https://bair.berkeley.edu/blog/2024/ 02/18/compound-ai-systems/
2024
-
[74]
Gonzalez, Clark W
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Struc- tured Language Model Programs. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Process...
2024
-
[75]
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. USENIX Association, 193–210
2024
-
[76]
Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.