pith. machine review for the scientific record. sign in

arxiv: 2604.10907 · v1 · submitted 2026-04-13 · 💻 cs.NI · cs.DC

Recognition: unknown

RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:33 UTC · model grok-4.3

classification 💻 cs.NI cs.DC
keywords multi-model LLM servingjoint resource allocationlatency-aware routingGPU clusteroutput qualitylatency SLOdual-price optimization
0
0 comments X

The pith

Jointly tuning GPU shares and routing fractions across models raises output quality by up to 87 percent while meeting a fixed latency target.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that routing each prompt to the right model among several deployed LLMs cannot be separated from the question of how much GPU memory and compute is given to each model. Because models share the same hardware, the latency of any one model depends on both its own allocation and on how many requests the router sends to it. RouterWise therefore enumerates feasible ways to partition the GPUs, profiles each partition to learn its load-dependent latency curves, and then solves a dual-price routing problem to maximize a quality score subject to the latency SLO. A reader should care because the same total hardware can deliver dramatically different quality depending on how the resources are divided before routing begins.

Core claim

The central claim is that, even when the total GPU cluster is held fixed, different feasible partitions of those GPUs among a set of models produce routing policies whose highest achievable quality scores differ by as much as 87 percent, all while satisfying the same end-to-end latency service-level objective. RouterWise obtains this result by first building a latency model for every candidate deployment setup through system profiling, then using a dual-price formulation to compute the routing fractions that maximize quality for that fixed setup, and finally selecting the setup-plus-routing pair with the best quality.

What carries the argument

Setup-specific latency models obtained from profiling, used inside a dual-price optimization that allocates routing fractions to maximize quality under a latency constraint.

Load-bearing premise

The latency curves measured during offline profiling still correctly predict end-to-end latency once the chosen routing fractions start sending a steady load to each model under its allocated resources.

What would settle it

Profile three or more models on a fixed GPU cluster, run RouterWise to select a setup and routing policy, then measure actual quality and latency in a live workload; if the measured quality lies within a few percent of the quality obtained from a hand-chosen allocation that ignores the joint search, the performance gap disappears.

Figures

Figures reproduced from arXiv: 2604.10907 by Adel N. Toosi, Christopher Leckie, Hossein Hosseini Kasnavieh.

Figure 1
Figure 1. Figure 1: Resource allocation strategies in multi-model LLM serving. Models [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: P95 TTFT versus GPU thread percentage under different traffic loads [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of predicted routing scores for different models on Router [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average TTFT as a function of input load under different tensor parallelism levels and maximum GPU compute shares ( [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Score-latency scatter plots of retained setups under two settings of [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scalability of the setup space as the number of GPUs increases. Each [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Multi-model LLM routing has emerged as an effective approach for reducing serving cost and latency while maintaining output quality by assigning each prompt to an appropriate model. However, prior routing methods typically assume that each model has a fixed latency. In real deployments, this assumption is inaccurate: multiple models often share limited GPU resources, and a model's latency depends strongly on both its allocated resources and the request load induced by the routing policy. Consequently, routing and resource allocation are tightly coupled. In this work, we study joint resource allocation and routing for latency-aware multi-model LLM serving in GPU clusters. Given a set of deployed models and a latency service-level objective (SLO), we seek a system setup and routing policy that maximize overall output quality while satisfying the latency target. We formalize this problem as a constrained joint optimization over deployment setup and routing fractions, and propose RouterWise, which combines a dual-price formulation for score-maximizing routing with setup-specific latency models derived from system profiling. RouterWise searches over feasible system setups and, for each fixed setup, computes the best routing policy under the latency target. Our results show that even on the same GPU cluster, achievable output-quality score can vary by up to 87% across retained setups, highlighting that resource allocation is a key determinant of routing performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RouterWise, a system for jointly optimizing resource allocation (GPU setups) and routing fractions in multi-model LLM serving to maximize aggregate output quality subject to a latency SLO. It searches over feasible deployments, derives per-setup latency models via profiling, and applies a dual-price formulation to solve for quality-maximizing routing policies under the latency constraint for each setup. The central empirical claim is that, even on identical GPU hardware, achievable output-quality scores vary by up to 87% across retained setups, demonstrating that resource allocation is a key determinant of routing performance.

Significance. If the profiled latency models accurately predict end-to-end behavior under routing-induced loads, the result is significant: it shows that treating model latencies as fixed (a common assumption in prior routing work) can miss large quality gains achievable by co-optimizing allocation and routing. The dual-price plus profiling approach offers a practical, extensible method for this joint problem. Credit is due for the formal constrained-optimization framing and the concrete demonstration of quality variation across setups.

major comments (2)
  1. [Latency Modeling and Experimental Evaluation] Latency model validation (profiling section and experimental evaluation): The 87% quality-variation claim rests on correctly classifying setups as feasible or infeasible under the solved routing fractions. The manuscript derives latency models from system profiling but provides no cross-validation or end-to-end measurements showing prediction accuracy once the routing policy actually loads each model (including cross-model interference, batching dynamics, and queuing). If profiling occurs under synthetic or isolated conditions, some setups may be misclassified, directly affecting the reported spread.
  2. [Results] Experimental details supporting the 87% result (results section and tables): The headline variation is stated without reporting the total number of setups searched, the exact quality-scoring procedure, number of trials per setup, error bars, or statistical tests. This information is required to evaluate whether the spread is robust and whether resource allocation is indeed the dominant factor rather than an artifact of the evaluation methodology.
minor comments (2)
  1. [Abstract] Abstract: 'Retained setups' is used without defining the retention filter or how many candidate setups are discarded before the 87% comparison.
  2. [RouterWise Formulation] Notation: The dual-price update rule and convergence criterion for the routing subproblem would benefit from an explicit algorithm box or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate additional validation and experimental details as outlined.

read point-by-point responses
  1. Referee: [Latency Modeling and Experimental Evaluation] Latency model validation (profiling section and experimental evaluation): The 87% quality-variation claim rests on correctly classifying setups as feasible or infeasible under the solved routing fractions. The manuscript derives latency models from system profiling but provides no cross-validation or end-to-end measurements showing prediction accuracy once the routing policy actually loads each model (including cross-model interference, batching dynamics, and queuing). If profiling occurs under synthetic or isolated conditions, some setups may be misclassified, directly affecting the reported spread.

    Authors: We agree that explicit cross-validation of the latency models under routing-induced loads is important to confirm the accuracy of feasibility classifications. While our profiling incorporated concurrent executions to capture interference and batching, we did not report direct comparisons between predicted and measured end-to-end latencies for the optimized policies. In the revised manuscript, we will add a validation subsection presenting end-to-end measurements for representative setups under the solved routing fractions, including quantification of prediction errors and confirmation that the reported quality variation is not affected by misclassifications. revision: yes

  2. Referee: [Results] Experimental details supporting the 87% result (results section and tables): The headline variation is stated without reporting the total number of setups searched, the exact quality-scoring procedure, number of trials per setup, error bars, or statistical tests. This information is required to evaluate whether the spread is robust and whether resource allocation is indeed the dominant factor rather than an artifact of the evaluation methodology.

    Authors: We agree that these experimental details are necessary to substantiate the robustness of the 87% quality variation. The revised results section will report the total number of setups searched and retained, the precise quality-scoring procedure and aggregation method, the number of trials per setup, error bars on all quality metrics, and statistical tests (such as significance testing across setups) to demonstrate that the observed spread is not an artifact of the evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external profiling and independent optimization

full rationale

The paper's central result (up to 87% quality variation across setups) is obtained by profiling latency models for each candidate resource allocation, then solving a constrained optimization over routing fractions to maximize quality subject to the SLO under those models, and finally comparing the optimized qualities. This chain uses measured inputs and a standard optimization procedure rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. The latency models are constructed from system profiling data independent of the final routing solution, and the variation metric is computed post-optimization. No equation or step reduces the claimed performance spread to a quantity defined by the paper's own fitted outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on profiled latency models being predictive and on the dual-price method correctly solving the routing subproblem for each fixed setup.

free parameters (1)
  • latency model coefficients
    Parameters of the setup-specific latency models are derived from profiling measurements and therefore fitted to observed data.
axioms (1)
  • domain assumption A model's latency is a function of its allocated GPU resources and the request load induced by the routing policy
    Explicitly stated as the reason prior fixed-latency assumptions fail.

pith-pipeline@v0.9.0 · 5542 in / 1211 out tokens · 30362 ms · 2026-05-10T16:33:24.658836+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 18 canonical work pages · 8 internal anchors

  1. [1]

    Misa-akmc:achieve kinetic monte carlo simulation of 20 quadrillion atoms on gpu clusters,

    Z. Mo, J. Liao, H. Xu, Z. Zhou, and C. Xu, “Hetis: Serving llms in heterogeneous gpu clusters with fine-grained and dynamic parallelism,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 1710–1724. [Online]. Available: https://doi.org/10.1145/3712285.3759784

  2. [2]

    Matrix is all you need: Rearchitecting quantum chemistry to scale on AI accelerators,

    T. Guo, X. Zhang, J. Du, Z. Chen, N. Xiao, and Y . Lu, “gllm: Global balanced pipeline parallelism systems for distributed llms serving with token throttling,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 1725–1741. [Online]. Available: https: //doi.org/10.1145/3712285.3759823

  3. [3]

    ChatGPT,

    OpenAI, “ChatGPT,” https://openai.com, 2025

  4. [4]

    Anthropic, “Claude,” https://www.anthropic.com, 2025

  5. [5]

    The Llama 4 Herd: The beginning of a new era of natively multimodal AI innovation,

    Meta AI, “The Llama 4 Herd: The beginning of a new era of natively multimodal AI innovation,”Meta AI Blog, April 2025, technical Release and Model Documentation. [Online]. Available: https://ai.meta.com/blog/llama-4-multimodal-intelligence/

  6. [6]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.09388

  7. [7]

    BEST-Route: Adaptive LLM routing with test-time optimal compute,

    D. Ding, A. Mallick, S. Zhang, C. Wang, D. Madrigal, M. D. C. H. Garcia, M. Xia, L. V . S. Lakshmanan, Q. Wu, and V . R ¨uhle, “BEST-Route: Adaptive LLM routing with test-time optimal compute,” inProceedings of the International Conference on Machine Learning,

  8. [8]

    Aurélien Garivier and Eric Moulines

    [Online]. Available: https://doi.org/10.48550/arXiv.2506.22716

  9. [9]

    Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models

    K. Lu, H. Yuan, R. Lin, J. Lin, Z. Yuan, C. Zhou, and J. Zhou, “Routing to the expert: Efficient reward-guided ensemble of large language models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024, pp. 1964–1974. [Online]. Available: https://doi.org/10.186...

  10. [10]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to use large language models while reducing cost and improving performance,” Transactions on Machine Learning Research, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.05176

  11. [11]

    RouteLLM: Learning to Route LLMs with Preference Data

    I. Ong, A. Almahairi, V . Wu, W.-L. Chiang, T. Wu, J. E. Gonzalez, M. W. Kadous, and I. Stoica, “RouteLLM: Learning to route LLMs from preference data,” inProceedings of the International Conference on Learning Representations, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2406.18665

  12. [12]

    Hybrid LLM: Cost-efficient and quality-aware query routing

    D. Ding, A. Mallick, C. Wang, R. Sim, S. Mukherjee, V . R ¨uhle, L. V . S. Lakshmanan, and A. H. Awadallah, “Hybrid LLM: Cost-efficient and quality-aware query routing,” inProceedings of the International Conference on Learning Representations, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2404.14618

  13. [13]

    Learning to route LLMs with confidence tokens,

    Y .-N. Chuang, P. K. Sarma, P. Gopalan, J. Boccio, S. Bolouki, X. Hu, and H. Zhou, “Learning to route LLMs with confidence tokens,” in Proceedings of the International Conference on Machine Learning,

  14. [14]

    Available: https://doi.org/10.48550/arXiv.2410.13284

    [Online]. Available: https://doi.org/10.48550/arXiv.2410.13284

  15. [15]

    Introlm: Introspective language models via prefilling-time self- evaluation,

    H. H. Kasnavieh, G. Haffari, C. Leckie, and A. N. Toosi, “Introlm: Introspective language models via prefilling-time self- evaluation,”arXiv preprint arXiv:2601.03511, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2601.03511

  16. [16]

    Aegaeon: Effective gpu pooling for concurrent llm serving on the market,

    Y . Xiang, X. Li, K. Qian, Y . Yang, D. Zhu, W. Yu, E. Zhai, X. Liu, X. Jin, and J. Zhou, “Aegaeon: Effective gpu pooling for concurrent llm serving on the market,” inProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, 2025, pp. 1030–1045

  17. [17]

    [Online]

    NVIDIA,NVIDIA CUDA Multi-Process Service, 2025, nVIDIA Deployment Guide, Release r590. [Online]. Available: https://docs. nvidia.com/deploy/mps/index.html

  18. [18]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    P. He, X. Liu, J. Gao, and W. Chen, “DeBERTa: Decoding-enhanced BERT with disentangled attention,”arXiv preprint arXiv:2006.03654,

  19. [19]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    [Online]. Available: https://doi.org/10.48550/arXiv.2006.03654

  20. [20]

    Routerbench: A benchmark for multi- LLM routing system,

    Q. J. Hu, J. Bieker, X. Li, N. Jiang, B. Keigwin, G. Ranganath, K. Keutzer, and S. K. Upadhyay, “Routerbench: A benchmark for multi- LLM routing system,” inAgentic Markets Workshop at ICML 2024,

  21. [21]
  22. [22]

    and Sudderth, Erik B

    J. Duchi, S. Shalev-Shwartz, Y . Singer, and T. Chandra, “Efficient projections onto the l1-ball for learning in high dimensions,” inProceedings of the 25th International Conference on Machine Learning, ser. ICML ’08. New York, NY , USA: Association for Computing Machinery, 2008, p. 272–279. [Online]. Available: https://doi.org/10.1145/1390156.1390191

  23. [23]

    AutoMix: Automatically mixing language models

    P. Aggarwal, A. Madaan, A. Anand, S. P. Potharaju, S. Mishra, P. Zhou, A. Gupta, D. Rajagopal, K. Kappaganthu, Y . Yanget al., “AutoMix: Automatically mixing language models,”Advances in Neural Information Processing Systems, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.12963

  24. [24]

    Introducing Mistral 3,

    Mistral AI Team, “Introducing Mistral 3,” Mistral AI Blog, December 2025, official Model Release and Technical Documentation. [Online]. Available: https://mistral.ai/news/mistral-3

  25. [25]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, C. Feng, and D. Tao, “Wizardlm: Empowering large language models to follow complex instructions,”arXiv preprint arXiv:2304.12244, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2304.12244

  26. [26]

    Yi: Open Foundation Models by 01.AI

    A. Young, B. Chen, B. Li, C. Cai, D. Cao, G. Ge, H. Li, H. Lin, N. Ning, J. Qu, S. Rajasekaranet al., “Yi: Open foundation models by 01.ai,”arXiv preprint arXiv:2403.04652, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2403.04652

  27. [27]

    In: Proceedings of the 29th Symposium on Operating Systems Principles

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 611–626. [Online]. Availabl...