pith. sign in

arxiv: 2606.17081 · v1 · pith:YELPLNFFnew · submitted 2026-06-11 · 💻 cs.AR · cs.AI· cs.DC· cs.GT· cs.PF

The Price of Anarchy in Disaggregated Inference

Pith reviewed 2026-06-27 04:49 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.DCcs.GTcs.PF
keywords price of anarchydisaggregated inferenceGPU servingKV cachecongestion gameadaptive routingNVIDIA Dynamo
0
0 comments X

The pith

Disaggregated inference turns GPU pools into competing agents whose selfish routing raises the price of anarchy once saturation hits, but an adaptive controller reduces it by 3.1 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models disaggregated inference as three coupled games between prefill and decode GPU pools. It shows that below saturation the price of anarchy stays bounded, while at saturation superlinear latency and cache effects drive an empirical estimator upward. An adaptive controller detects the transition and switches from cache-affinity to load balancing, cutting the estimator from 66.4 to 21.5 on a 70B model at 13 percent throughput cost. This matters because disaggregation is becoming common in large-model serving, and uncontrolled selfishness could waste hardware capacity.

Core claim

Saturation induces regime transitions that shift the payoff structure of the resource, caching, and congestion games, causing the empirical PoA-hat to rise sharply; an adaptive routing controller that switches strategies at the saturation knee reduces PoA-hat 3.1 times on the 70B 1P/5D topology while incurring only a 13 percent throughput penalty.

What carries the argument

The empirical PoA-hat estimator, which measures the ratio of worst-case to optimal latency under selfish routing in the congestion game with positive KV-cache externalities.

If this is right

  • Below saturation, selfish caching and routing produce bounded inefficiency.
  • At saturation, the same behavior produces superlinear latency growth.
  • Real-time detection of the knee allows switching routing policies without new instabilities.
  • The same three-regime structure appears on both 340B and 70B models with the same transition point at C=128.
  • Adaptive control improves the saturated operating point on both 1P/5D and 1P/2D topologies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar game structures may appear in other disaggregated systems such as training or multi-tenant clusters.
  • If saturation detection proves noisy, the controller could be replaced by a continuous optimization layer.
  • The 13 percent throughput cost might be reduced by hybrid routing that blends both strategies.
  • Extending the model to include energy or monetary costs could change the payoff matrices and PoA values.

Load-bearing premise

The empirical PoA-hat estimator accurately reflects the true price of anarchy and that saturation transitions can be detected in real time without introducing new instabilities.

What would settle it

Running the adaptive controller on the 70B 1P/5D topology and observing no drop in measured latency ratio or an increase in throughput loss beyond 13 percent would falsify the claim that the controller reliably improves the operating point.

Figures

Figures reproduced from arXiv: 2606.17081 by Athos Georgiou (NCA).

Figure 1
Figure 1. Figure 1: The three games recast as a mechanism-design table. Each row names the players, the Dynamo mechanism that arbitrates between them, and the resources being contested. Reading down the middle column recovers the standard Dynamo architecture—Planner, KVBM, Smart Router—but each component now carries its game-theoretic role as an explicit caption. Saturation is what binds the three rows: when one row’s cost fu… view at source ↗
Figure 2
Figure 2. Figure 2: Dynamo architecture for disaggregated inference. Components are colored by the game they implement, matching [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The three coupled games in disaggregated inference, colored to match Figure [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Asymmetric P/D saturation: TTFT P99 versus ITL P99 for the 340B (solid) and 70B [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Estimated Price of Anarchy versus concurrency (log-log scale) for the 340B 1P/2D (solid), [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: PoAd heatmaps for the 4 × 4 parameter sweep, shown on a shared color scale (PoA 14.6 → 42.5 spans the full intensity range). (a) 340B below saturation (C = 64): all 16 configurations yield PoAd = 18.7 ± 0.10, appearing uniformly light. (b) 340B at saturation (C = 128): PoAd = 34.3 ± 3.7, almost every cell deeply tinted (1.6× spread; moderate, unstructured variation). (c) 70B 1P/2D at saturation (C = 128): … view at source ↗
Figure 7
Figure 7. Figure 7: PoA comparison under load spike (mean of 3 iterations each). Below saturation and [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗
read the original abstract

Disaggregated inference architectures physically separate prefill and decode phases onto distinct GPU pools, creating competing "agents" that share a fixed hardware budget. We provide, to our knowledge, the first formal game-theoretic analysis of this architecture, using NVIDIA Dynamo as a concrete case study. We model disaggregated serving as three coupled games: a two-player resource game between prefill and decode pools, a selfish caching game over the hierarchical KV cache, and a congestion game with positive externalities for request routing. We empirically validate the latter two; the P/D resource game is treated analytically (Section 9.2). We characterize how GPU saturation induces regime transitions that shift the game's payoff structure: below saturation, selfish behavior has bounded Price of Anarchy (PoA); at saturation, superlinear latency and cache externalities drive our empirical estimator PoA-hat (defined in Section 6.4) upward. Based on this analysis, we design an adaptive controller that detects saturation transitions in real time and adjusts routing parameters accordingly, shifting from cache-affinity exploitation to load-balanced congestion avoidance. We instantiate our framework on a 3-node NVIDIA B200 cluster running Dynamo with two models, Nemotron-4-340B (TP=8, full-node workers with cross-InfiniBand KV transfers) and Llama-3.1-70B (TP=4), and find the same three-regime PoA-hat structure with the same first post-knee grid point (C=128) on both models. Adaptive routing shifts each model to a better operating point. Our strongest result is on the 70B 1P/5D topology, where PoA-hat drops 3.1x (66.4 to 21.5) in the saturated phase at a 13% throughput cost. On the 70B 1P/2D, PoA-hat drops 2.2x and TTFT P99 drops 7.6x (see Section 8.5).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to deliver the first game-theoretic analysis of disaggregated inference (prefill/decode separation on distinct GPU pools) by modeling it as three coupled games: a two-player resource game, a selfish KV-cache game, and a routing congestion game. It identifies saturation-induced regime transitions that increase an empirical PoA estimator (PoA-hat, defined in §6.4), designs a real-time adaptive routing controller that switches strategies at the detected knee (C=128), and reports that the controller reduces PoA-hat by 3.1× (66.4→21.5) on the 70B 1P/5D topology at 13% throughput cost, with consistent three-regime structure observed on both Nemotron-4-340B and Llama-3.1-70B.

Significance. If the empirical PoA-hat estimator is shown to track the true game-theoretic Price of Anarchy and the saturation detector proves robust, the work supplies a principled basis for routing controllers in production disaggregated serving systems. The consistency of the C=128 knee across two models and the analytical treatment of the P/D resource game (§9.2) are strengths; the empirical validation of the caching and routing games adds concrete grounding.

major comments (2)
  1. [§6.4] §6.4: PoA-hat is defined as an empirical ratio computed from the same experimental runs that later demonstrate the controller gains; the paper supplies no independent theoretical computation of the true PoA (supremum over equilibria divided by OPT) nor cross-validation of the estimator against an external baseline. Consequently the headline 3.1× reduction reported in §8.5 on the 70B 1P/5D topology rests on an unverified proxy.
  2. [§8.5] §8.5 and §9.2: The adaptive controller relies on real-time saturation detection to switch routing parameters, yet the manuscript contains no sensitivity analysis of detector error rates, no examination of transient instabilities introduced by the regime switch, and no quantification of how approximation in the “optimal” routing baseline propagates into the measured PoA-hat ratio. These omissions directly affect the claim that the controller reliably moves the system to a better operating point.
minor comments (2)
  1. [Abstract] The abstract states that the P/D resource game is treated analytically in §9.2 while the other two games are only empirically validated; a brief statement of the analytical result (e.g., closed-form equilibrium or bound) would help readers assess the completeness of the game-theoretic contribution.
  2. Notation for the three games and the saturation threshold C is introduced without a consolidated table; a short nomenclature table would improve readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, clarifying the empirical nature of our PoA-hat estimator while agreeing to strengthen the validation of the adaptive controller.

read point-by-point responses
  1. Referee: [§6.4] §6.4: PoA-hat is defined as an empirical ratio computed from the same experimental runs that later demonstrate the controller gains; the paper supplies no independent theoretical computation of the true PoA (supremum over equilibria divided by OPT) nor cross-validation of the estimator against an external baseline. Consequently the headline 3.1× reduction reported in §8.5 on the 70B 1P/5D topology rests on an unverified proxy.

    Authors: We agree that PoA-hat is computed from the same runs used to evaluate the controller and that no closed-form theoretical PoA is provided. Section 6.4 explicitly defines PoA-hat as an empirical ratio (worst observed performance over best observed) under the three modeled games, chosen because the continuous strategy spaces and coupled dynamics preclude exact computation of the supremum over all equilibria. The 3.1× reduction is therefore an empirical demonstration of the controller's effect on this practical proxy, supported by the identical three-regime structure and C=128 knee observed on two different models. We will revise §6.4 and the discussion in §8.5 to state explicitly that PoA-hat is an estimator, explain the intractability of the theoretical PoA, and note the absence of external cross-validation. revision: partial

  2. Referee: [§8.5] §8.5 and §9.2: The adaptive controller relies on real-time saturation detection to switch routing parameters, yet the manuscript contains no sensitivity analysis of detector error rates, no examination of transient instabilities introduced by the regime switch, and no quantification of how approximation in the “optimal” routing baseline propagates into the measured PoA-hat ratio. These omissions directly affect the claim that the controller reliably moves the system to a better operating point.

    Authors: The detector threshold (C=128) was selected from the observed knee that was consistent across both Nemotron-4-340B and Llama-3.1-70B. We acknowledge the lack of sensitivity, transient, and baseline-approximation analysis. We will add a new subsection to the revised §8.5 containing (i) Monte-Carlo evaluation of detector error rates under load variation, (ii) trace-driven simulation of latency transients at regime switches, and (iii) bounds on how the routing-baseline approximation affects the reported PoA-hat ratios. These additions will quantify the robustness of the claimed operating-point improvement. revision: yes

standing simulated objections not resolved
  • Independent theoretical computation of the exact Price of Anarchy (supremum over equilibria divided by OPT) for the coupled continuous-state games, which remains intractable given the model complexity.

Circularity Check

0 steps flagged

No significant circularity; empirical PoA-hat estimator is measured outcome, not input to derivation.

full rationale

The paper derives a game-theoretic model of three coupled games from first principles, analytically treats the P/D resource allocation game, and designs an adaptive controller from the resulting regime analysis. PoA-hat is introduced as a post-hoc empirical estimator (Sec 6.4) whose values are computed from experimental runs after the controller is applied; the reported reductions (e.g., 66.4 to 21.5) are therefore measured effects rather than quantities that were fitted or defined into the model. No self-citations, uniqueness theorems, or ansatzes are invoked to close the central claim. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on standard game-theory assumptions about selfish agents and introduces an empirical estimator whose numerical values are obtained from the reported experiments; the saturation knee point is identified from data.

free parameters (1)
  • saturation knee point C = 128
    Identified as the first post-knee grid point in experiments on both models
axioms (1)
  • domain assumption Prefill and decode pools behave as selfish agents maximizing individual performance metrics
    Core modeling premise for the resource allocation and caching games
invented entities (1)
  • PoA-hat empirical estimator no independent evidence
    purpose: To quantify price of anarchy in the congestion game under saturation
    Defined in Section 6.4 and used to report controller improvements

pith-pipeline@v0.9.1-grok · 5904 in / 1394 out tokens · 56810 ms · 2026-06-27T04:49:49.697044+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 8 canonical work pages · 1 internal anchor

  1. [1]

    Gulavani, Ramachandran Ramjee, and Alexey Tumanov

    Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, and Alexey Tumanov. Vidur: A large-scale simulation framework for LLM inference. InProceedings of Machine Learning and Systems (MLSys), 2024

  2. [2]

    Exact price of anarchy for polynomial congestion games.SIAM Journal on Computing, 40(5):1211–1233, 2011

    Sebastian Aland, Dominic Dumrauf, Martin Gairing, Burkhard Monien, and Florian Schopp- mann. Exact price of anarchy for polynomial congestion games.SIAM Journal on Computing, 40(5):1211–1233, 2011

  3. [3]

    The price of routing unsplittable flow

    Baruch Awerbuch, Yossi Azar, and Amir Epstein. The price of routing unsplittable flow. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pages 57–66, 2005

  4. [4]

    Settling the complexity of two-player Nash equilibrium

    Xi Chen and Xiaotie Deng. Settling the complexity of two-player Nash equilibrium. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 261–272, 2006

  5. [5]

    The price of anarchy of finite congestion games

    George Christodoulou and Elias Koutsoupias. The price of anarchy of finite congestion games. InProceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pages 67–73, 2005

  6. [6]

    Selfish caching in distributed systems: a game-theoretic analysis

    Byung-Gon Chun, Kamalika Chaudhuri, Hoeteck Wee, Marco Barreno, Christos H Papadim- itriou, and John Kubiatowicz. Selfish caching in distributed systems: a game-theoretic analysis. InProceedings of the 23rd Annual ACM Symposium on Principles of Distributed Computing (PODC), pages 21–30, 2004

  7. [7]

    When is selfish routing bad? The price of anarchy in light and heavy traffic.Operations Research, 68(2):411–434, 2020

    Riccardo Colini-Baldeschi, Roberto Cominetti, Panayotis Mertikopoulos, and Marco Scarsini. When is selfish routing bad? The price of anarchy in light and heavy traffic.Operations Research, 68(2):411–434, 2020

  8. [8]

    Approximation and convergence of large atomic congestion games.Mathematics of Operations Research, 48(2): 784–811, 2023

    Roberto Cominetti, Marco Scarsini, Marc Schröder, and Nicolás E Stier-Moses. Approximation and convergence of large atomic congestion games.Mathematics of Operations Research, 48(2): 784–811, 2023. 35

  9. [9]

    The complexity of computing a Nash equilibrium

    Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a Nash equilibrium. InProceedings of the 38th Annual ACM Symposium on Theory of Computing (STOC), pages 71–78, 2006

  10. [10]

    Finding social optima in congestion games with positive externalities

    Bart de Keijzer and Guido Schäfer. Finding social optima in congestion games with positive externalities. InProceedings of the 20th European Symposium on Algorithms (ESA), volume 7501 ofLNCS, pages 395–406. Springer, 2012

  11. [11]

    Generalized Nash equilibrium problems.4OR, 5(3): 173–210, 2007

    Francisco Facchinei and Christian Kanzow. Generalized Nash equilibrium problems.4OR, 5(3): 173–210, 2007

  12. [12]

    Rasoul Etesami

    Fatemeh Fardno and S. Rasoul Etesami. A game-theoretic framework for distributed load balancing: Static and dynamic game models.arXiv preprint arXiv:2501.15324, 2025. To appear in Automatica, 2026

  13. [13]

    Incentives in dominant resource fair allocation under dynamic demands

    Giannis Fikioris, Rachit Agarwal, and Éva Tardos. Incentives in dominant resource fair allocation under dynamic demands. InSymposium on Algorithmic Game Theory (SAGT), 2024

  14. [14]

    The price of anarchy of strategic queuing systems.Journal of the ACM, 70(3):20:1–20:63, 2023

    Jason Gaitonde and Éva Tardos. The price of anarchy of strategic queuing systems.Journal of the ACM, 70(3):20:1–20:63, 2023

  15. [15]

    Dominant resource fairness: Fair allocation of multiple resource types

    Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. Dominant resource fairness: Fair allocation of multiple resource types. InProceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI), pages 323–336, 2011

  16. [16]

    The price of anarchy in the Markovian single server queue.IEEE Transactions on Automatic Control, 59(2):455–459, 2014

    Gail Gilboa-Freedman, Refael Hassin, and Yoav Kerner. The price of anarchy in the Markovian single server queue.IEEE Transactions on Automatic Control, 59(2):455–459, 2014

  17. [17]

    KV Pareto: Systems-level optimization of KV cache and model compression for long context inference

    Sai Gokhale, Devleena Das, Rajeev Patwari, Ashish Sirasao, and Elliott Delaye. KV Pareto: Systems-level optimization of KV cache and model compression for long context inference. arXiv preprint arXiv:2512.01953, 2025

  18. [18]

    The Llama 3 Herd of Models

    Aaron Grattafiori et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  19. [19]

    The price of anarchy in an exponential multi-server

    Moshe Haviv and Tim Roughgarden. The price of anarchy in an exponential multi-server. Operations Research Letters, 35(4):421–426, 2007

  20. [20]

    Efficient memory management for large language model serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626, 2023

  21. [21]

    Selfish caching games on directed graphs

    Qian Ma, Edmund Yeh, and Jianwei Huang. Selfish caching games on directed graphs. IEEE/ACM Transactions on Networking, 29(2):709–722, 2021

  22. [22]

    Themis: Fair and efficient GPU cluster scheduling

    Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. Themis: Fair and efficient GPU cluster scheduling. InProceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 289–304, 2020. 36

  23. [23]

    Congestion games with player-specific payoff functions.Games and Economic Behavior, 13(1):111–124, 1996

    Igal Milchtaich. Congestion games with player-specific payoff functions.Games and Economic Behavior, 13(1):111–124, 1996

  24. [24]

    Potential games.Games and Economic Behavior, 14(1): 124–143, 1996

    Dov Monderer and Lloyd S Shapley. Potential games.Games and Economic Behavior, 14(1): 124–143, 1996

  25. [25]

    Cambridge University Press, 2007

    Noam Nisan, Tim Roughgarden, Éva Tardos, and Vijay V Vazirani.Algorithmic Game Theory. Cambridge University Press, 2007

  26. [26]

    NVIDIA Dynamo: Disaggregated inference serving for LLMs.https: //developer.nvidia.com/dynamo, 2025

    NVIDIA Corporation. NVIDIA Dynamo: Disaggregated inference serving for LLMs.https: //developer.nvidia.com/dynamo, 2025. Version 0.9.0, accessed June 2026

  27. [27]

    arXiv preprint arXiv:2311.18677 , year=

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative LLM inference using phase splitting. In Proceedings of the 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024. arXiv:2311.18677

  28. [28]

    Comparison of Nash bargaining and myopic equilibrium for resources allocation in cloud computing

    Giovanni Perin, Gianluca Fighera, and Leonardo Badia. Comparison of Nash bargaining and myopic equilibrium for resources allocation in cloud computing. InIEEE Global Communications Conference (GLOBECOM), pages 1–6. IEEE, 2019

  29. [29]

    Existence and uniqueness of equilibrium points for concave N-person games

    J Ben Rosen. Existence and uniqueness of equilibrium points for concave N-person games. Econometrica, 33(3):520–534, 1965

  30. [30]

    A class of games possessing pure-strategy Nash equilibria.International Journal of Game Theory, 2(1):65–67, 1973

    Robert W Rosenthal. A class of games possessing pure-strategy Nash equilibria.International Journal of Game Theory, 2(1):65–67, 1973

  31. [31]

    Intrinsic robustness of the price of anarchy.Journal of the ACM, 62(5): 1–42, 2015

    Tim Roughgarden. Intrinsic robustness of the price of anarchy.Journal of the ACM, 62(5): 1–42, 2015

  32. [32]

    How bad is selfish routing?Journal of the ACM, 49(2): 236–259, 2002

    Tim Roughgarden and Éva Tardos. How bad is selfish routing?Journal of the ACM, 49(2): 236–259, 2002

  33. [33]

    FlexGen: High-throughput generative inference of large language models with a single GPU

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. FlexGen: High-throughput generative inference of large language models with a single GPU. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023

  34. [34]

    Prefill-decode aggregation or disaggregation? Unifying both for goodput-optimized LLM serving

    Chao Wang, Pengfei Zuo, Zhangyu Chen, Yunkai Liang, Zhou Yu, and Ming-Chang Yang. Prefill-decode aggregation or disaggregation? Unifying both for goodput-optimized LLM serving. arXiv preprint arXiv:2508.01989, 2025

  35. [35]

    Some theoretical aspects of road traffic research.Proceedings of the Institution of Civil Engineers, 1(3):325–362, 1952

    John Glen Wardrop. Some theoretical aspects of road traffic research.Proceedings of the Institution of Civil Engineers, 1(3):325–362, 1952

  36. [36]

    Serving long-context LLMs at the mobile edge: Test-time reinforcement learning-based model caching and inference offloading.arXiv preprint arXiv:2501.14205, 2025

    Minrui Xu, Dusit Niyato, and Christopher G Brinton. Serving long-context LLMs at the mobile edge: Test-time reinforcement learning-based model caching and inference offloading.arXiv preprint arXiv:2501.14205, 2025

  37. [37]

    AIConfigurator: Lightning-fast configuration optimization for multi-framework LLM serving.arXiv preprint arXiv:2601.06288, 2026

    Tianhao Xu, Yiming Liu, et al. AIConfigurator: Lightning-fast configuration optimization for multi-framework LLM serving.arXiv preprint arXiv:2601.06288, 2026. 37

  38. [38]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  39. [39]

    Shock- wave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning

    Pengfei Zheng, Rui Pan, Tarannum Khan, Shivaram Venkataraman, and Aditya Akella. Shock- wave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2023

  40. [40]

    DistServe: Disaggregating Prefill and Decoding for Goodput- optimized Large Language Model Serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. InProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 193–210, 2024. arXiv:2401.09670. 38