The Price of Anarchy in Disaggregated Inference

Athos Georgiou (NCA)

arxiv: 2606.17081 · v1 · pith:YELPLNFFnew · submitted 2026-06-11 · 💻 cs.AR · cs.AI· cs.DC· cs.GT· cs.PF

The Price of Anarchy in Disaggregated Inference

Athos Georgiou (NCA) This is my paper

Pith reviewed 2026-06-27 04:49 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.DCcs.GTcs.PF

keywords price of anarchydisaggregated inferenceGPU servingKV cachecongestion gameadaptive routingNVIDIA Dynamo

0 comments

The pith

Disaggregated inference turns GPU pools into competing agents whose selfish routing raises the price of anarchy once saturation hits, but an adaptive controller reduces it by 3.1 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models disaggregated inference as three coupled games between prefill and decode GPU pools. It shows that below saturation the price of anarchy stays bounded, while at saturation superlinear latency and cache effects drive an empirical estimator upward. An adaptive controller detects the transition and switches from cache-affinity to load balancing, cutting the estimator from 66.4 to 21.5 on a 70B model at 13 percent throughput cost. This matters because disaggregation is becoming common in large-model serving, and uncontrolled selfishness could waste hardware capacity.

Core claim

Saturation induces regime transitions that shift the payoff structure of the resource, caching, and congestion games, causing the empirical PoA-hat to rise sharply; an adaptive routing controller that switches strategies at the saturation knee reduces PoA-hat 3.1 times on the 70B 1P/5D topology while incurring only a 13 percent throughput penalty.

What carries the argument

The empirical PoA-hat estimator, which measures the ratio of worst-case to optimal latency under selfish routing in the congestion game with positive KV-cache externalities.

If this is right

Below saturation, selfish caching and routing produce bounded inefficiency.
At saturation, the same behavior produces superlinear latency growth.
Real-time detection of the knee allows switching routing policies without new instabilities.
The same three-regime structure appears on both 340B and 70B models with the same transition point at C=128.
Adaptive control improves the saturated operating point on both 1P/5D and 1P/2D topologies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar game structures may appear in other disaggregated systems such as training or multi-tenant clusters.
If saturation detection proves noisy, the controller could be replaced by a continuous optimization layer.
The 13 percent throughput cost might be reduced by hybrid routing that blends both strategies.
Extending the model to include energy or monetary costs could change the payoff matrices and PoA values.

Load-bearing premise

The empirical PoA-hat estimator accurately reflects the true price of anarchy and that saturation transitions can be detected in real time without introducing new instabilities.

What would settle it

Running the adaptive controller on the 70B 1P/5D topology and observing no drop in measured latency ratio or an increase in throughput loss beyond 13 percent would falsify the claim that the controller reliably improves the operating point.

Figures

Figures reproduced from arXiv: 2606.17081 by Athos Georgiou (NCA).

**Figure 1.** Figure 1: The three games recast as a mechanism-design table. Each row names the players, the Dynamo mechanism that arbitrates between them, and the resources being contested. Reading down the middle column recovers the standard Dynamo architecture—Planner, KVBM, Smart Router—but each component now carries its game-theoretic role as an explicit caption. Saturation is what binds the three rows: when one row’s cost fu… view at source ↗

**Figure 2.** Figure 2: Dynamo architecture for disaggregated inference. Components are colored by the game they implement, matching [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The three coupled games in disaggregated inference, colored to match Figure [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Asymmetric P/D saturation: TTFT P99 versus ITL P99 for the 340B (solid) and 70B [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: Estimated Price of Anarchy versus concurrency (log-log scale) for the 340B 1P/2D (solid), [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: PoAd heatmaps for the 4 × 4 parameter sweep, shown on a shared color scale (PoA 14.6 → 42.5 spans the full intensity range). (a) 340B below saturation (C = 64): all 16 configurations yield PoAd = 18.7 ± 0.10, appearing uniformly light. (b) 340B at saturation (C = 128): PoAd = 34.3 ± 3.7, almost every cell deeply tinted (1.6× spread; moderate, unstructured variation). (c) 70B 1P/2D at saturation (C = 128): … view at source ↗

**Figure 7.** Figure 7: PoA comparison under load spike (mean of 3 iterations each). Below saturation and [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗

read the original abstract

Disaggregated inference architectures physically separate prefill and decode phases onto distinct GPU pools, creating competing "agents" that share a fixed hardware budget. We provide, to our knowledge, the first formal game-theoretic analysis of this architecture, using NVIDIA Dynamo as a concrete case study. We model disaggregated serving as three coupled games: a two-player resource game between prefill and decode pools, a selfish caching game over the hierarchical KV cache, and a congestion game with positive externalities for request routing. We empirically validate the latter two; the P/D resource game is treated analytically (Section 9.2). We characterize how GPU saturation induces regime transitions that shift the game's payoff structure: below saturation, selfish behavior has bounded Price of Anarchy (PoA); at saturation, superlinear latency and cache externalities drive our empirical estimator PoA-hat (defined in Section 6.4) upward. Based on this analysis, we design an adaptive controller that detects saturation transitions in real time and adjusts routing parameters accordingly, shifting from cache-affinity exploitation to load-balanced congestion avoidance. We instantiate our framework on a 3-node NVIDIA B200 cluster running Dynamo with two models, Nemotron-4-340B (TP=8, full-node workers with cross-InfiniBand KV transfers) and Llama-3.1-70B (TP=4), and find the same three-regime PoA-hat structure with the same first post-knee grid point (C=128) on both models. Adaptive routing shifts each model to a better operating point. Our strongest result is on the 70B 1P/5D topology, where PoA-hat drops 3.1x (66.4 to 21.5) in the saturated phase at a 13% throughput cost. On the 70B 1P/2D, PoA-hat drops 2.2x and TTFT P99 drops 7.6x (see Section 8.5).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first game-theoretic treatment of disaggregated inference with three coupled games plus a saturation-triggered router, backed by hardware runs on two models that show the same regime knee and up to 3.1x drop in their empirical PoA-hat.

read the letter

The paper's main contribution is modeling disaggregated prefill/decode serving as a resource game, a KV-cache game, and a routing congestion game, then using saturation detection to switch between affinity and load-balancing policies. On a 3-node B200 cluster with Nemotron-4-340B and Llama-3.1-70B it reports the same three-regime structure and the same C=128 transition point, with the strongest result being a 3.1x PoA-hat reduction on the 70B 1P/5D topology at 13% throughput cost.

The hardware validation is the part that lands. Running real KV transfers over InfiniBand and seeing the pattern repeat across model scales gives the regime claim more weight than a simulation-only study would.

The soft spot is the PoA-hat estimator itself. It is computed from the same runs that demonstrate the controller, and the paper does not supply an independent calculation of the true game-theoretic price of anarchy (supremum over equilibria divided by social optimum) or a sensitivity check on how the saturation detector affects the ratio. If the detector misclassifies regimes or the baseline routing is itself approximated, the reported factor could be smaller. The resource game is handled analytically but the other two rest on empirical validation only.

This is for systems builders who already operate disaggregated inference at scale and need a way to reason about contention incentives. A reader working on production serving latency will find usable numbers and a concrete controller to test.

Send it to review. The modeling is new and the experiments are on real hardware, so referees can check the estimator definition and detector stability without starting from zero.

Referee Report

2 major / 2 minor

Summary. The paper claims to deliver the first game-theoretic analysis of disaggregated inference (prefill/decode separation on distinct GPU pools) by modeling it as three coupled games: a two-player resource game, a selfish KV-cache game, and a routing congestion game. It identifies saturation-induced regime transitions that increase an empirical PoA estimator (PoA-hat, defined in §6.4), designs a real-time adaptive routing controller that switches strategies at the detected knee (C=128), and reports that the controller reduces PoA-hat by 3.1× (66.4→21.5) on the 70B 1P/5D topology at 13% throughput cost, with consistent three-regime structure observed on both Nemotron-4-340B and Llama-3.1-70B.

Significance. If the empirical PoA-hat estimator is shown to track the true game-theoretic Price of Anarchy and the saturation detector proves robust, the work supplies a principled basis for routing controllers in production disaggregated serving systems. The consistency of the C=128 knee across two models and the analytical treatment of the P/D resource game (§9.2) are strengths; the empirical validation of the caching and routing games adds concrete grounding.

major comments (2)

[§6.4] §6.4: PoA-hat is defined as an empirical ratio computed from the same experimental runs that later demonstrate the controller gains; the paper supplies no independent theoretical computation of the true PoA (supremum over equilibria divided by OPT) nor cross-validation of the estimator against an external baseline. Consequently the headline 3.1× reduction reported in §8.5 on the 70B 1P/5D topology rests on an unverified proxy.
[§8.5] §8.5 and §9.2: The adaptive controller relies on real-time saturation detection to switch routing parameters, yet the manuscript contains no sensitivity analysis of detector error rates, no examination of transient instabilities introduced by the regime switch, and no quantification of how approximation in the “optimal” routing baseline propagates into the measured PoA-hat ratio. These omissions directly affect the claim that the controller reliably moves the system to a better operating point.

minor comments (2)

[Abstract] The abstract states that the P/D resource game is treated analytically in §9.2 while the other two games are only empirically validated; a brief statement of the analytical result (e.g., closed-form equilibrium or bound) would help readers assess the completeness of the game-theoretic contribution.
Notation for the three games and the saturation threshold C is introduced without a consolidated table; a short nomenclature table would improve readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, clarifying the empirical nature of our PoA-hat estimator while agreeing to strengthen the validation of the adaptive controller.

read point-by-point responses

Referee: [§6.4] §6.4: PoA-hat is defined as an empirical ratio computed from the same experimental runs that later demonstrate the controller gains; the paper supplies no independent theoretical computation of the true PoA (supremum over equilibria divided by OPT) nor cross-validation of the estimator against an external baseline. Consequently the headline 3.1× reduction reported in §8.5 on the 70B 1P/5D topology rests on an unverified proxy.

Authors: We agree that PoA-hat is computed from the same runs used to evaluate the controller and that no closed-form theoretical PoA is provided. Section 6.4 explicitly defines PoA-hat as an empirical ratio (worst observed performance over best observed) under the three modeled games, chosen because the continuous strategy spaces and coupled dynamics preclude exact computation of the supremum over all equilibria. The 3.1× reduction is therefore an empirical demonstration of the controller's effect on this practical proxy, supported by the identical three-regime structure and C=128 knee observed on two different models. We will revise §6.4 and the discussion in §8.5 to state explicitly that PoA-hat is an estimator, explain the intractability of the theoretical PoA, and note the absence of external cross-validation. revision: partial
Referee: [§8.5] §8.5 and §9.2: The adaptive controller relies on real-time saturation detection to switch routing parameters, yet the manuscript contains no sensitivity analysis of detector error rates, no examination of transient instabilities introduced by the regime switch, and no quantification of how approximation in the “optimal” routing baseline propagates into the measured PoA-hat ratio. These omissions directly affect the claim that the controller reliably moves the system to a better operating point.

Authors: The detector threshold (C=128) was selected from the observed knee that was consistent across both Nemotron-4-340B and Llama-3.1-70B. We acknowledge the lack of sensitivity, transient, and baseline-approximation analysis. We will add a new subsection to the revised §8.5 containing (i) Monte-Carlo evaluation of detector error rates under load variation, (ii) trace-driven simulation of latency transients at regime switches, and (iii) bounds on how the routing-baseline approximation affects the reported PoA-hat ratios. These additions will quantify the robustness of the claimed operating-point improvement. revision: yes

standing simulated objections not resolved

Independent theoretical computation of the exact Price of Anarchy (supremum over equilibria divided by OPT) for the coupled continuous-state games, which remains intractable given the model complexity.

Circularity Check

0 steps flagged

No significant circularity; empirical PoA-hat estimator is measured outcome, not input to derivation.

full rationale

The paper derives a game-theoretic model of three coupled games from first principles, analytically treats the P/D resource allocation game, and designs an adaptive controller from the resulting regime analysis. PoA-hat is introduced as a post-hoc empirical estimator (Sec 6.4) whose values are computed from experimental runs after the controller is applied; the reported reductions (e.g., 66.4 to 21.5) are therefore measured effects rather than quantities that were fitted or defined into the model. No self-citations, uniqueness theorems, or ansatzes are invoked to close the central claim. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on standard game-theory assumptions about selfish agents and introduces an empirical estimator whose numerical values are obtained from the reported experiments; the saturation knee point is identified from data.

free parameters (1)

saturation knee point C = 128
Identified as the first post-knee grid point in experiments on both models

axioms (1)

domain assumption Prefill and decode pools behave as selfish agents maximizing individual performance metrics
Core modeling premise for the resource allocation and caching games

invented entities (1)

PoA-hat empirical estimator no independent evidence
purpose: To quantify price of anarchy in the congestion game under saturation
Defined in Section 6.4 and used to report controller improvements

pith-pipeline@v0.9.1-grok · 5904 in / 1394 out tokens · 56810 ms · 2026-06-27T04:49:49.697044+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Gulavani, Ramachandran Ramjee, and Alexey Tumanov

Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, and Alexey Tumanov. Vidur: A large-scale simulation framework for LLM inference. InProceedings of Machine Learning and Systems (MLSys), 2024

2024
[2]

Exact price of anarchy for polynomial congestion games.SIAM Journal on Computing, 40(5):1211–1233, 2011

Sebastian Aland, Dominic Dumrauf, Martin Gairing, Burkhard Monien, and Florian Schopp- mann. Exact price of anarchy for polynomial congestion games.SIAM Journal on Computing, 40(5):1211–1233, 2011

2011
[3]

The price of routing unsplittable flow

Baruch Awerbuch, Yossi Azar, and Amir Epstein. The price of routing unsplittable flow. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pages 57–66, 2005

2005
[4]

Settling the complexity of two-player Nash equilibrium

Xi Chen and Xiaotie Deng. Settling the complexity of two-player Nash equilibrium. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 261–272, 2006

2006
[5]

The price of anarchy of finite congestion games

George Christodoulou and Elias Koutsoupias. The price of anarchy of finite congestion games. InProceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pages 67–73, 2005

2005
[6]

Selfish caching in distributed systems: a game-theoretic analysis

Byung-Gon Chun, Kamalika Chaudhuri, Hoeteck Wee, Marco Barreno, Christos H Papadim- itriou, and John Kubiatowicz. Selfish caching in distributed systems: a game-theoretic analysis. InProceedings of the 23rd Annual ACM Symposium on Principles of Distributed Computing (PODC), pages 21–30, 2004

2004
[7]

When is selfish routing bad? The price of anarchy in light and heavy traffic.Operations Research, 68(2):411–434, 2020

Riccardo Colini-Baldeschi, Roberto Cominetti, Panayotis Mertikopoulos, and Marco Scarsini. When is selfish routing bad? The price of anarchy in light and heavy traffic.Operations Research, 68(2):411–434, 2020

2020
[8]

Approximation and convergence of large atomic congestion games.Mathematics of Operations Research, 48(2): 784–811, 2023

Roberto Cominetti, Marco Scarsini, Marc Schröder, and Nicolás E Stier-Moses. Approximation and convergence of large atomic congestion games.Mathematics of Operations Research, 48(2): 784–811, 2023. 35

2023
[9]

The complexity of computing a Nash equilibrium

Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a Nash equilibrium. InProceedings of the 38th Annual ACM Symposium on Theory of Computing (STOC), pages 71–78, 2006

2006
[10]

Finding social optima in congestion games with positive externalities

Bart de Keijzer and Guido Schäfer. Finding social optima in congestion games with positive externalities. InProceedings of the 20th European Symposium on Algorithms (ESA), volume 7501 ofLNCS, pages 395–406. Springer, 2012

2012
[11]

Generalized Nash equilibrium problems.4OR, 5(3): 173–210, 2007

Francisco Facchinei and Christian Kanzow. Generalized Nash equilibrium problems.4OR, 5(3): 173–210, 2007

2007
[12]

Rasoul Etesami

Fatemeh Fardno and S. Rasoul Etesami. A game-theoretic framework for distributed load balancing: Static and dynamic game models.arXiv preprint arXiv:2501.15324, 2025. To appear in Automatica, 2026

work page arXiv 2025
[13]

Incentives in dominant resource fair allocation under dynamic demands

Giannis Fikioris, Rachit Agarwal, and Éva Tardos. Incentives in dominant resource fair allocation under dynamic demands. InSymposium on Algorithmic Game Theory (SAGT), 2024

2024
[14]

The price of anarchy of strategic queuing systems.Journal of the ACM, 70(3):20:1–20:63, 2023

Jason Gaitonde and Éva Tardos. The price of anarchy of strategic queuing systems.Journal of the ACM, 70(3):20:1–20:63, 2023

2023
[15]

Dominant resource fairness: Fair allocation of multiple resource types

Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. Dominant resource fairness: Fair allocation of multiple resource types. InProceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI), pages 323–336, 2011

2011
[16]

The price of anarchy in the Markovian single server queue.IEEE Transactions on Automatic Control, 59(2):455–459, 2014

Gail Gilboa-Freedman, Refael Hassin, and Yoav Kerner. The price of anarchy in the Markovian single server queue.IEEE Transactions on Automatic Control, 59(2):455–459, 2014

2014
[17]

KV Pareto: Systems-level optimization of KV cache and model compression for long context inference

Sai Gokhale, Devleena Das, Rajeev Patwari, Ashish Sirasao, and Elliott Delaye. KV Pareto: Systems-level optimization of KV cache and model compression for long context inference. arXiv preprint arXiv:2512.01953, 2025

work page arXiv 2025
[18]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

The price of anarchy in an exponential multi-server

Moshe Haviv and Tim Roughgarden. The price of anarchy in an exponential multi-server. Operations Research Letters, 35(4):421–426, 2007

2007
[20]

Efficient memory management for large language model serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626, 2023

2023
[21]

Selfish caching games on directed graphs

Qian Ma, Edmund Yeh, and Jianwei Huang. Selfish caching games on directed graphs. IEEE/ACM Transactions on Networking, 29(2):709–722, 2021

2021
[22]

Themis: Fair and efficient GPU cluster scheduling

Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. Themis: Fair and efficient GPU cluster scheduling. InProceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 289–304, 2020. 36

2020
[23]

Congestion games with player-specific payoff functions.Games and Economic Behavior, 13(1):111–124, 1996

Igal Milchtaich. Congestion games with player-specific payoff functions.Games and Economic Behavior, 13(1):111–124, 1996

1996
[24]

Potential games.Games and Economic Behavior, 14(1): 124–143, 1996

Dov Monderer and Lloyd S Shapley. Potential games.Games and Economic Behavior, 14(1): 124–143, 1996

1996
[25]

Cambridge University Press, 2007

Noam Nisan, Tim Roughgarden, Éva Tardos, and Vijay V Vazirani.Algorithmic Game Theory. Cambridge University Press, 2007

2007
[26]

NVIDIA Dynamo: Disaggregated inference serving for LLMs.https: //developer.nvidia.com/dynamo, 2025

NVIDIA Corporation. NVIDIA Dynamo: Disaggregated inference serving for LLMs.https: //developer.nvidia.com/dynamo, 2025. Version 0.9.0, accessed June 2026

2025
[27]

arXiv preprint arXiv:2311.18677 , year=

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative LLM inference using phase splitting. In Proceedings of the 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024. arXiv:2311.18677

work page arXiv 2024
[28]

Comparison of Nash bargaining and myopic equilibrium for resources allocation in cloud computing

Giovanni Perin, Gianluca Fighera, and Leonardo Badia. Comparison of Nash bargaining and myopic equilibrium for resources allocation in cloud computing. InIEEE Global Communications Conference (GLOBECOM), pages 1–6. IEEE, 2019

2019
[29]

Existence and uniqueness of equilibrium points for concave N-person games

J Ben Rosen. Existence and uniqueness of equilibrium points for concave N-person games. Econometrica, 33(3):520–534, 1965

1965
[30]

A class of games possessing pure-strategy Nash equilibria.International Journal of Game Theory, 2(1):65–67, 1973

Robert W Rosenthal. A class of games possessing pure-strategy Nash equilibria.International Journal of Game Theory, 2(1):65–67, 1973

1973
[31]

Intrinsic robustness of the price of anarchy.Journal of the ACM, 62(5): 1–42, 2015

Tim Roughgarden. Intrinsic robustness of the price of anarchy.Journal of the ACM, 62(5): 1–42, 2015

2015
[32]

How bad is selfish routing?Journal of the ACM, 49(2): 236–259, 2002

Tim Roughgarden and Éva Tardos. How bad is selfish routing?Journal of the ACM, 49(2): 236–259, 2002

2002
[33]

FlexGen: High-throughput generative inference of large language models with a single GPU

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. FlexGen: High-throughput generative inference of large language models with a single GPU. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023

2023
[34]

Prefill-decode aggregation or disaggregation? Unifying both for goodput-optimized LLM serving

Chao Wang, Pengfei Zuo, Zhangyu Chen, Yunkai Liang, Zhou Yu, and Ming-Chang Yang. Prefill-decode aggregation or disaggregation? Unifying both for goodput-optimized LLM serving. arXiv preprint arXiv:2508.01989, 2025

work page arXiv 2025
[35]

Some theoretical aspects of road traffic research.Proceedings of the Institution of Civil Engineers, 1(3):325–362, 1952

John Glen Wardrop. Some theoretical aspects of road traffic research.Proceedings of the Institution of Civil Engineers, 1(3):325–362, 1952

1952
[36]

Serving long-context LLMs at the mobile edge: Test-time reinforcement learning-based model caching and inference offloading.arXiv preprint arXiv:2501.14205, 2025

Minrui Xu, Dusit Niyato, and Christopher G Brinton. Serving long-context LLMs at the mobile edge: Test-time reinforcement learning-based model caching and inference offloading.arXiv preprint arXiv:2501.14205, 2025

work page arXiv 2025
[37]

AIConfigurator: Lightning-fast configuration optimization for multi-framework LLM serving.arXiv preprint arXiv:2601.06288, 2026

Tianhao Xu, Yiming Liu, et al. AIConfigurator: Lightning-fast configuration optimization for multi-framework LLM serving.arXiv preprint arXiv:2601.06288, 2026. 37

work page arXiv 2026
[38]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[39]

Shock- wave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning

Pengfei Zheng, Rui Pan, Tarannum Khan, Shivaram Venkataraman, and Aditya Akella. Shock- wave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2023

2023
[40]

DistServe: Disaggregating Prefill and Decoding for Goodput- optimized Large Language Model Serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. InProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 193–210, 2024. arXiv:2401.09670. 38

work page arXiv 2024

[1] [1]

Gulavani, Ramachandran Ramjee, and Alexey Tumanov

Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, and Alexey Tumanov. Vidur: A large-scale simulation framework for LLM inference. InProceedings of Machine Learning and Systems (MLSys), 2024

2024

[2] [2]

Exact price of anarchy for polynomial congestion games.SIAM Journal on Computing, 40(5):1211–1233, 2011

Sebastian Aland, Dominic Dumrauf, Martin Gairing, Burkhard Monien, and Florian Schopp- mann. Exact price of anarchy for polynomial congestion games.SIAM Journal on Computing, 40(5):1211–1233, 2011

2011

[3] [3]

The price of routing unsplittable flow

Baruch Awerbuch, Yossi Azar, and Amir Epstein. The price of routing unsplittable flow. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pages 57–66, 2005

2005

[4] [4]

Settling the complexity of two-player Nash equilibrium

Xi Chen and Xiaotie Deng. Settling the complexity of two-player Nash equilibrium. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 261–272, 2006

2006

[5] [5]

The price of anarchy of finite congestion games

George Christodoulou and Elias Koutsoupias. The price of anarchy of finite congestion games. InProceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pages 67–73, 2005

2005

[6] [6]

Selfish caching in distributed systems: a game-theoretic analysis

Byung-Gon Chun, Kamalika Chaudhuri, Hoeteck Wee, Marco Barreno, Christos H Papadim- itriou, and John Kubiatowicz. Selfish caching in distributed systems: a game-theoretic analysis. InProceedings of the 23rd Annual ACM Symposium on Principles of Distributed Computing (PODC), pages 21–30, 2004

2004

[7] [7]

When is selfish routing bad? The price of anarchy in light and heavy traffic.Operations Research, 68(2):411–434, 2020

Riccardo Colini-Baldeschi, Roberto Cominetti, Panayotis Mertikopoulos, and Marco Scarsini. When is selfish routing bad? The price of anarchy in light and heavy traffic.Operations Research, 68(2):411–434, 2020

2020

[8] [8]

Approximation and convergence of large atomic congestion games.Mathematics of Operations Research, 48(2): 784–811, 2023

Roberto Cominetti, Marco Scarsini, Marc Schröder, and Nicolás E Stier-Moses. Approximation and convergence of large atomic congestion games.Mathematics of Operations Research, 48(2): 784–811, 2023. 35

2023

[9] [9]

The complexity of computing a Nash equilibrium

Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a Nash equilibrium. InProceedings of the 38th Annual ACM Symposium on Theory of Computing (STOC), pages 71–78, 2006

2006

[10] [10]

Finding social optima in congestion games with positive externalities

Bart de Keijzer and Guido Schäfer. Finding social optima in congestion games with positive externalities. InProceedings of the 20th European Symposium on Algorithms (ESA), volume 7501 ofLNCS, pages 395–406. Springer, 2012

2012

[11] [11]

Generalized Nash equilibrium problems.4OR, 5(3): 173–210, 2007

Francisco Facchinei and Christian Kanzow. Generalized Nash equilibrium problems.4OR, 5(3): 173–210, 2007

2007

[12] [12]

Rasoul Etesami

Fatemeh Fardno and S. Rasoul Etesami. A game-theoretic framework for distributed load balancing: Static and dynamic game models.arXiv preprint arXiv:2501.15324, 2025. To appear in Automatica, 2026

work page arXiv 2025

[13] [13]

Incentives in dominant resource fair allocation under dynamic demands

Giannis Fikioris, Rachit Agarwal, and Éva Tardos. Incentives in dominant resource fair allocation under dynamic demands. InSymposium on Algorithmic Game Theory (SAGT), 2024

2024

[14] [14]

The price of anarchy of strategic queuing systems.Journal of the ACM, 70(3):20:1–20:63, 2023

Jason Gaitonde and Éva Tardos. The price of anarchy of strategic queuing systems.Journal of the ACM, 70(3):20:1–20:63, 2023

2023

[15] [15]

Dominant resource fairness: Fair allocation of multiple resource types

Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. Dominant resource fairness: Fair allocation of multiple resource types. InProceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI), pages 323–336, 2011

2011

[16] [16]

The price of anarchy in the Markovian single server queue.IEEE Transactions on Automatic Control, 59(2):455–459, 2014

Gail Gilboa-Freedman, Refael Hassin, and Yoav Kerner. The price of anarchy in the Markovian single server queue.IEEE Transactions on Automatic Control, 59(2):455–459, 2014

2014

[17] [17]

KV Pareto: Systems-level optimization of KV cache and model compression for long context inference

Sai Gokhale, Devleena Das, Rajeev Patwari, Ashish Sirasao, and Elliott Delaye. KV Pareto: Systems-level optimization of KV cache and model compression for long context inference. arXiv preprint arXiv:2512.01953, 2025

work page arXiv 2025

[18] [18]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

The price of anarchy in an exponential multi-server

Moshe Haviv and Tim Roughgarden. The price of anarchy in an exponential multi-server. Operations Research Letters, 35(4):421–426, 2007

2007

[20] [20]

Efficient memory management for large language model serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626, 2023

2023

[21] [21]

Selfish caching games on directed graphs

Qian Ma, Edmund Yeh, and Jianwei Huang. Selfish caching games on directed graphs. IEEE/ACM Transactions on Networking, 29(2):709–722, 2021

2021

[22] [22]

Themis: Fair and efficient GPU cluster scheduling

Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. Themis: Fair and efficient GPU cluster scheduling. InProceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 289–304, 2020. 36

2020

[23] [23]

Congestion games with player-specific payoff functions.Games and Economic Behavior, 13(1):111–124, 1996

Igal Milchtaich. Congestion games with player-specific payoff functions.Games and Economic Behavior, 13(1):111–124, 1996

1996

[24] [24]

Potential games.Games and Economic Behavior, 14(1): 124–143, 1996

Dov Monderer and Lloyd S Shapley. Potential games.Games and Economic Behavior, 14(1): 124–143, 1996

1996

[25] [25]

Cambridge University Press, 2007

Noam Nisan, Tim Roughgarden, Éva Tardos, and Vijay V Vazirani.Algorithmic Game Theory. Cambridge University Press, 2007

2007

[26] [26]

NVIDIA Dynamo: Disaggregated inference serving for LLMs.https: //developer.nvidia.com/dynamo, 2025

NVIDIA Corporation. NVIDIA Dynamo: Disaggregated inference serving for LLMs.https: //developer.nvidia.com/dynamo, 2025. Version 0.9.0, accessed June 2026

2025

[27] [27]

arXiv preprint arXiv:2311.18677 , year=

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative LLM inference using phase splitting. In Proceedings of the 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024. arXiv:2311.18677

work page arXiv 2024

[28] [28]

Comparison of Nash bargaining and myopic equilibrium for resources allocation in cloud computing

Giovanni Perin, Gianluca Fighera, and Leonardo Badia. Comparison of Nash bargaining and myopic equilibrium for resources allocation in cloud computing. InIEEE Global Communications Conference (GLOBECOM), pages 1–6. IEEE, 2019

2019

[29] [29]

Existence and uniqueness of equilibrium points for concave N-person games

J Ben Rosen. Existence and uniqueness of equilibrium points for concave N-person games. Econometrica, 33(3):520–534, 1965

1965

[30] [30]

A class of games possessing pure-strategy Nash equilibria.International Journal of Game Theory, 2(1):65–67, 1973

Robert W Rosenthal. A class of games possessing pure-strategy Nash equilibria.International Journal of Game Theory, 2(1):65–67, 1973

1973

[31] [31]

Intrinsic robustness of the price of anarchy.Journal of the ACM, 62(5): 1–42, 2015

Tim Roughgarden. Intrinsic robustness of the price of anarchy.Journal of the ACM, 62(5): 1–42, 2015

2015

[32] [32]

How bad is selfish routing?Journal of the ACM, 49(2): 236–259, 2002

Tim Roughgarden and Éva Tardos. How bad is selfish routing?Journal of the ACM, 49(2): 236–259, 2002

2002

[33] [33]

FlexGen: High-throughput generative inference of large language models with a single GPU

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. FlexGen: High-throughput generative inference of large language models with a single GPU. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023

2023

[34] [34]

Prefill-decode aggregation or disaggregation? Unifying both for goodput-optimized LLM serving

Chao Wang, Pengfei Zuo, Zhangyu Chen, Yunkai Liang, Zhou Yu, and Ming-Chang Yang. Prefill-decode aggregation or disaggregation? Unifying both for goodput-optimized LLM serving. arXiv preprint arXiv:2508.01989, 2025

work page arXiv 2025

[35] [35]

Some theoretical aspects of road traffic research.Proceedings of the Institution of Civil Engineers, 1(3):325–362, 1952

John Glen Wardrop. Some theoretical aspects of road traffic research.Proceedings of the Institution of Civil Engineers, 1(3):325–362, 1952

1952

[36] [36]

Serving long-context LLMs at the mobile edge: Test-time reinforcement learning-based model caching and inference offloading.arXiv preprint arXiv:2501.14205, 2025

Minrui Xu, Dusit Niyato, and Christopher G Brinton. Serving long-context LLMs at the mobile edge: Test-time reinforcement learning-based model caching and inference offloading.arXiv preprint arXiv:2501.14205, 2025

work page arXiv 2025

[37] [37]

AIConfigurator: Lightning-fast configuration optimization for multi-framework LLM serving.arXiv preprint arXiv:2601.06288, 2026

Tianhao Xu, Yiming Liu, et al. AIConfigurator: Lightning-fast configuration optimization for multi-framework LLM serving.arXiv preprint arXiv:2601.06288, 2026. 37

work page arXiv 2026

[38] [38]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[39] [39]

Shock- wave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning

Pengfei Zheng, Rui Pan, Tarannum Khan, Shivaram Venkataraman, and Aditya Akella. Shock- wave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2023

2023

[40] [40]

DistServe: Disaggregating Prefill and Decoding for Goodput- optimized Large Language Model Serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. InProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 193–210, 2024. arXiv:2401.09670. 38

work page arXiv 2024