Recognition: unknown
RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving
Pith reviewed 2026-05-10 16:33 UTC · model grok-4.3
The pith
Jointly tuning GPU shares and routing fractions across models raises output quality by up to 87 percent while meeting a fixed latency target.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that, even when the total GPU cluster is held fixed, different feasible partitions of those GPUs among a set of models produce routing policies whose highest achievable quality scores differ by as much as 87 percent, all while satisfying the same end-to-end latency service-level objective. RouterWise obtains this result by first building a latency model for every candidate deployment setup through system profiling, then using a dual-price formulation to compute the routing fractions that maximize quality for that fixed setup, and finally selecting the setup-plus-routing pair with the best quality.
What carries the argument
Setup-specific latency models obtained from profiling, used inside a dual-price optimization that allocates routing fractions to maximize quality under a latency constraint.
Load-bearing premise
The latency curves measured during offline profiling still correctly predict end-to-end latency once the chosen routing fractions start sending a steady load to each model under its allocated resources.
What would settle it
Profile three or more models on a fixed GPU cluster, run RouterWise to select a setup and routing policy, then measure actual quality and latency in a live workload; if the measured quality lies within a few percent of the quality obtained from a hand-chosen allocation that ignores the joint search, the performance gap disappears.
Figures
read the original abstract
Multi-model LLM routing has emerged as an effective approach for reducing serving cost and latency while maintaining output quality by assigning each prompt to an appropriate model. However, prior routing methods typically assume that each model has a fixed latency. In real deployments, this assumption is inaccurate: multiple models often share limited GPU resources, and a model's latency depends strongly on both its allocated resources and the request load induced by the routing policy. Consequently, routing and resource allocation are tightly coupled. In this work, we study joint resource allocation and routing for latency-aware multi-model LLM serving in GPU clusters. Given a set of deployed models and a latency service-level objective (SLO), we seek a system setup and routing policy that maximize overall output quality while satisfying the latency target. We formalize this problem as a constrained joint optimization over deployment setup and routing fractions, and propose RouterWise, which combines a dual-price formulation for score-maximizing routing with setup-specific latency models derived from system profiling. RouterWise searches over feasible system setups and, for each fixed setup, computes the best routing policy under the latency target. Our results show that even on the same GPU cluster, achievable output-quality score can vary by up to 87% across retained setups, highlighting that resource allocation is a key determinant of routing performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RouterWise, a system for jointly optimizing resource allocation (GPU setups) and routing fractions in multi-model LLM serving to maximize aggregate output quality subject to a latency SLO. It searches over feasible deployments, derives per-setup latency models via profiling, and applies a dual-price formulation to solve for quality-maximizing routing policies under the latency constraint for each setup. The central empirical claim is that, even on identical GPU hardware, achievable output-quality scores vary by up to 87% across retained setups, demonstrating that resource allocation is a key determinant of routing performance.
Significance. If the profiled latency models accurately predict end-to-end behavior under routing-induced loads, the result is significant: it shows that treating model latencies as fixed (a common assumption in prior routing work) can miss large quality gains achievable by co-optimizing allocation and routing. The dual-price plus profiling approach offers a practical, extensible method for this joint problem. Credit is due for the formal constrained-optimization framing and the concrete demonstration of quality variation across setups.
major comments (2)
- [Latency Modeling and Experimental Evaluation] Latency model validation (profiling section and experimental evaluation): The 87% quality-variation claim rests on correctly classifying setups as feasible or infeasible under the solved routing fractions. The manuscript derives latency models from system profiling but provides no cross-validation or end-to-end measurements showing prediction accuracy once the routing policy actually loads each model (including cross-model interference, batching dynamics, and queuing). If profiling occurs under synthetic or isolated conditions, some setups may be misclassified, directly affecting the reported spread.
- [Results] Experimental details supporting the 87% result (results section and tables): The headline variation is stated without reporting the total number of setups searched, the exact quality-scoring procedure, number of trials per setup, error bars, or statistical tests. This information is required to evaluate whether the spread is robust and whether resource allocation is indeed the dominant factor rather than an artifact of the evaluation methodology.
minor comments (2)
- [Abstract] Abstract: 'Retained setups' is used without defining the retention filter or how many candidate setups are discarded before the 87% comparison.
- [RouterWise Formulation] Notation: The dual-price update rule and convergence criterion for the routing subproblem would benefit from an explicit algorithm box or pseudocode.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate additional validation and experimental details as outlined.
read point-by-point responses
-
Referee: [Latency Modeling and Experimental Evaluation] Latency model validation (profiling section and experimental evaluation): The 87% quality-variation claim rests on correctly classifying setups as feasible or infeasible under the solved routing fractions. The manuscript derives latency models from system profiling but provides no cross-validation or end-to-end measurements showing prediction accuracy once the routing policy actually loads each model (including cross-model interference, batching dynamics, and queuing). If profiling occurs under synthetic or isolated conditions, some setups may be misclassified, directly affecting the reported spread.
Authors: We agree that explicit cross-validation of the latency models under routing-induced loads is important to confirm the accuracy of feasibility classifications. While our profiling incorporated concurrent executions to capture interference and batching, we did not report direct comparisons between predicted and measured end-to-end latencies for the optimized policies. In the revised manuscript, we will add a validation subsection presenting end-to-end measurements for representative setups under the solved routing fractions, including quantification of prediction errors and confirmation that the reported quality variation is not affected by misclassifications. revision: yes
-
Referee: [Results] Experimental details supporting the 87% result (results section and tables): The headline variation is stated without reporting the total number of setups searched, the exact quality-scoring procedure, number of trials per setup, error bars, or statistical tests. This information is required to evaluate whether the spread is robust and whether resource allocation is indeed the dominant factor rather than an artifact of the evaluation methodology.
Authors: We agree that these experimental details are necessary to substantiate the robustness of the 87% quality variation. The revised results section will report the total number of setups searched and retained, the precise quality-scoring procedure and aggregation method, the number of trials per setup, error bars on all quality metrics, and statistical tests (such as significance testing across setups) to demonstrate that the observed spread is not an artifact of the evaluation. revision: yes
Circularity Check
No significant circularity; derivation relies on external profiling and independent optimization
full rationale
The paper's central result (up to 87% quality variation across setups) is obtained by profiling latency models for each candidate resource allocation, then solving a constrained optimization over routing fractions to maximize quality subject to the SLO under those models, and finally comparing the optimized qualities. This chain uses measured inputs and a standard optimization procedure rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. The latency models are constructed from system profiling data independent of the final routing solution, and the variation metric is computed post-optimization. No equation or step reduces the claimed performance spread to a quantity defined by the paper's own fitted outputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- latency model coefficients
axioms (1)
- domain assumption A model's latency is a function of its allocated GPU resources and the request load induced by the routing policy
Reference graph
Works this paper leans on
-
[1]
Misa-akmc:achieve kinetic monte carlo simulation of 20 quadrillion atoms on gpu clusters,
Z. Mo, J. Liao, H. Xu, Z. Zhou, and C. Xu, “Hetis: Serving llms in heterogeneous gpu clusters with fine-grained and dynamic parallelism,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 1710–1724. [Online]. Available: https://doi.org/10.1145/3712285.3759784
-
[2]
Matrix is all you need: Rearchitecting quantum chemistry to scale on AI accelerators,
T. Guo, X. Zhang, J. Du, Z. Chen, N. Xiao, and Y . Lu, “gllm: Global balanced pipeline parallelism systems for distributed llms serving with token throttling,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 1725–1741. [Online]. Available: https: //doi.org/10.1145/3712285.3759823
-
[3]
ChatGPT,
OpenAI, “ChatGPT,” https://openai.com, 2025
2025
-
[4]
Anthropic, “Claude,” https://www.anthropic.com, 2025
2025
-
[5]
The Llama 4 Herd: The beginning of a new era of natively multimodal AI innovation,
Meta AI, “The Llama 4 Herd: The beginning of a new era of natively multimodal AI innovation,”Meta AI Blog, April 2025, technical Release and Model Documentation. [Online]. Available: https://ai.meta.com/blog/llama-4-multimodal-intelligence/
2025
-
[6]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.09388
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[7]
BEST-Route: Adaptive LLM routing with test-time optimal compute,
D. Ding, A. Mallick, S. Zhang, C. Wang, D. Madrigal, M. D. C. H. Garcia, M. Xia, L. V . S. Lakshmanan, Q. Wu, and V . R ¨uhle, “BEST-Route: Adaptive LLM routing with test-time optimal compute,” inProceedings of the International Conference on Machine Learning,
-
[8]
Aurélien Garivier and Eric Moulines
[Online]. Available: https://doi.org/10.48550/arXiv.2506.22716
-
[9]
Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models
K. Lu, H. Yuan, R. Lin, J. Lin, Z. Yuan, C. Zhou, and J. Zhou, “Routing to the expert: Efficient reward-guided ensemble of large language models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024, pp. 1964–1974. [Online]. Available: https://doi.org/10.186...
-
[10]
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to use large language models while reducing cost and improving performance,” Transactions on Machine Learning Research, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.05176
work page internal anchor Pith review doi:10.48550/arxiv.2305.05176 2023
-
[11]
RouteLLM: Learning to Route LLMs with Preference Data
I. Ong, A. Almahairi, V . Wu, W.-L. Chiang, T. Wu, J. E. Gonzalez, M. W. Kadous, and I. Stoica, “RouteLLM: Learning to route LLMs from preference data,” inProceedings of the International Conference on Learning Representations, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2406.18665
work page internal anchor Pith review doi:10.48550/arxiv.2406.18665 2025
-
[12]
Hybrid LLM: Cost-efficient and quality-aware query routing
D. Ding, A. Mallick, C. Wang, R. Sim, S. Mukherjee, V . R ¨uhle, L. V . S. Lakshmanan, and A. H. Awadallah, “Hybrid LLM: Cost-efficient and quality-aware query routing,” inProceedings of the International Conference on Learning Representations, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2404.14618
-
[13]
Learning to route LLMs with confidence tokens,
Y .-N. Chuang, P. K. Sarma, P. Gopalan, J. Boccio, S. Bolouki, X. Hu, and H. Zhou, “Learning to route LLMs with confidence tokens,” in Proceedings of the International Conference on Machine Learning,
-
[14]
Available: https://doi.org/10.48550/arXiv.2410.13284
[Online]. Available: https://doi.org/10.48550/arXiv.2410.13284
-
[15]
Introlm: Introspective language models via prefilling-time self- evaluation,
H. H. Kasnavieh, G. Haffari, C. Leckie, and A. N. Toosi, “Introlm: Introspective language models via prefilling-time self- evaluation,”arXiv preprint arXiv:2601.03511, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2601.03511
work page internal anchor Pith review doi:10.48550/arxiv.2601.03511 2026
-
[16]
Aegaeon: Effective gpu pooling for concurrent llm serving on the market,
Y . Xiang, X. Li, K. Qian, Y . Yang, D. Zhu, W. Yu, E. Zhai, X. Liu, X. Jin, and J. Zhou, “Aegaeon: Effective gpu pooling for concurrent llm serving on the market,” inProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, 2025, pp. 1030–1045
2025
-
[17]
[Online]
NVIDIA,NVIDIA CUDA Multi-Process Service, 2025, nVIDIA Deployment Guide, Release r590. [Online]. Available: https://docs. nvidia.com/deploy/mps/index.html
2025
-
[18]
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
P. He, X. Liu, J. Gao, and W. Chen, “DeBERTa: Decoding-enhanced BERT with disentangled attention,”arXiv preprint arXiv:2006.03654,
work page internal anchor Pith review arXiv 2006
-
[19]
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
[Online]. Available: https://doi.org/10.48550/arXiv.2006.03654
work page internal anchor Pith review doi:10.48550/arxiv.2006.03654 2006
-
[20]
Routerbench: A benchmark for multi- LLM routing system,
Q. J. Hu, J. Bieker, X. Li, N. Jiang, B. Keigwin, G. Ranganath, K. Keutzer, and S. K. Upadhyay, “Routerbench: A benchmark for multi- LLM routing system,” inAgentic Markets Workshop at ICML 2024,
2024
-
[21]
[Online]. Available: https://doi.org/10.48550/arXiv.2403.12031
-
[22]
J. Duchi, S. Shalev-Shwartz, Y . Singer, and T. Chandra, “Efficient projections onto the l1-ball for learning in high dimensions,” inProceedings of the 25th International Conference on Machine Learning, ser. ICML ’08. New York, NY , USA: Association for Computing Machinery, 2008, p. 272–279. [Online]. Available: https://doi.org/10.1145/1390156.1390191
-
[23]
AutoMix: Automatically mixing language models
P. Aggarwal, A. Madaan, A. Anand, S. P. Potharaju, S. Mishra, P. Zhou, A. Gupta, D. Rajagopal, K. Kappaganthu, Y . Yanget al., “AutoMix: Automatically mixing language models,”Advances in Neural Information Processing Systems, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.12963
-
[24]
Introducing Mistral 3,
Mistral AI Team, “Introducing Mistral 3,” Mistral AI Blog, December 2025, official Model Release and Technical Documentation. [Online]. Available: https://mistral.ai/news/mistral-3
2025
-
[25]
WizardLM: Empowering large pre-trained language models to follow complex instructions
C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, C. Feng, and D. Tao, “Wizardlm: Empowering large language models to follow complex instructions,”arXiv preprint arXiv:2304.12244, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2304.12244
work page internal anchor Pith review doi:10.48550/arxiv.2304.12244 2023
-
[26]
Yi: Open Foundation Models by 01.AI
A. Young, B. Chen, B. Li, C. Cai, D. Cao, G. Ge, H. Li, H. Lin, N. Ning, J. Qu, S. Rajasekaranet al., “Yi: Open foundation models by 01.ai,”arXiv preprint arXiv:2403.04652, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2403.04652
work page internal anchor Pith review doi:10.48550/arxiv.2403.04652 2024
-
[27]
In: Proceedings of the 29th Symposium on Operating Systems Principles
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 611–626. [Online]. Availabl...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.