The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM Routers

Hanjie Chen; Jiarong Xing; Qiyue Zhang; Shenrun Zhang; Yifan Lu; Zhibo Yu; Zhuang Wang

arxiv: 2606.07587 · v1 · pith:CSQKNOZCnew · submitted 2026-05-27 · 💻 cs.LG

The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM Routers

Yifan Lu , Qiyue Zhang , Shenrun Zhang , Zhibo Yu , Zhuang Wang , Hanjie Chen , Jiarong Xing This is my paper

Pith reviewed 2026-06-29 14:04 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM routingmodel selectionpredictability bottleneckrouting plateauquery classificationaccuracy limitsLLM servingfine-tuning

0 comments

The pith

LLM routers reach similar accuracy because they learn global performance trends rather than query-specific signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tests of 21 routing methods across five benchmarks show they converge to nearly identical accuracy well below an oracle router. The shared limit stems from routers capturing only broad patterns in which models perform well on average, instead of the fine details needed to pick the right model for each query. As a result the methods overlap on easy cases but all miss the hard queries that demand instance-level decisions. Larger training sets, stronger query encoders, and end-to-end fine-tuning each raise accuracy past the observed range. This pattern explains why routing has delivered smaller efficiency gains than expected for LLM services.

Core claim

The routing plateau is largely caused by a predictability bottleneck: current routers mainly learn global averaged model-performance trends rather than fine-grained query-specific routing signals. As a result, they solve overlapping easy queries but collectively fail on hard queries that require instance-specific routing decisions.

What carries the argument

the predictability bottleneck, where routers capture only averaged model-performance trends across queries instead of instance-specific signals

If this is right

Larger training datasets raise routing accuracy beyond the plateau.
Stronger encoders raise routing accuracy beyond the plateau.
End-to-end fine-tuning raises routing accuracy beyond the plateau.
Current routers collectively fail on hard queries that need instance-specific decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Routers could improve by adding explicit signals for query difficulty to target the cases where global trends are insufficient.
Training on more diverse or synthetic hard queries might reduce the overlap in solved cases across methods.
In practice the plateau implies that routing will continue to yield only modest cost savings on complex or varied query streams unless the bottleneck is directly addressed.

Load-bearing premise

That similar performance across the 21 methods on the five benchmarks reflects a fundamental limit in predictable signals rather than shared benchmark or implementation constraints.

What would settle it

A router using explicit query-specific performance features that reaches accuracy clearly above the narrow range observed for the 21 methods on the same benchmarks would falsify the bottleneck explanation.

Figures

Figures reproduced from arXiv: 2606.07587 by Hanjie Chen, Jiarong Xing, Qiyue Zhang, Shenrun Zhang, Yifan Lu, Zhibo Yu, Zhuang Wang.

**Figure 1.** Figure 1: Routing accuracy vs. normalized inference cost. All routers from heterogeneous method families collapse [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Per-experiment routing accuracy at λ=0. Each point is one (method, encoder, architecture, seed) cell. The red band marks the top-decile plateau; the green dashed line marks oracle accuracy [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Test-accuracy standard deviation at λ=0, computed per method and benchmark. Each blue dot is one routing method; black ticks show the mean across methods. Most values are below 0.5%. each method’s λ=0 probabilities to select routes post hoc for each query. The same near-tie pattern persists: at λ=0.5, the top-5 methods differ by only ∼ 0.2%, which is below the seed-noise envelope εnoise. The identity of th… view at source ↗

**Figure 4.** Figure 4: True per-LLM correctness mass (left) vs. routers’ [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Per-method gain in test routing accuracy [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: RouterBench: selection frequency on hard vs. all queries for nine additional routers. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: CARROT/SPROUT: selection frequency on hard vs. all queries for nine additional routers. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: R2-Bench: selection frequency on hard vs. all queries for nine additional routers. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Nine-by-30k: selection frequency on hard vs. all queries for nine additional routers. [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

read the original abstract

LLM routing has become a popular approach to improve the cost-quality trade-off of LLM services by dynamically selecting a model for each query. Recent work has explored a broad range of routing methods, including clustering-based routers, learned classifiers, pairwise ranking, and confidence-based approaches. Our extensive study of 21 routing methods across five benchmarks reveals a consistent phenomenon that we call the routing plateau: many methods, including kNN, achieve very similar accuracy and converge to a narrow performance range that remains far below the oracle router. Our investigation shows that the plateau is largely caused by a predictability bottleneck: current routers mainly learn global averaged model-performance trends rather than fine-grained query-specific routing signals. As a result, they solve overlapping easy queries but collectively fail on hard queries that require instance-specific routing decisions. We further study how to move beyond the plateau and find that larger training datasets, stronger encoders, and end-to-end fine-tuning can further improve routing accuracy. These findings characterize the common limits of current routing methods and provide insights and actionable directions for the community to build more effective routing systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper documents a consistent performance plateau across 21 routers on five benchmarks and links it to routers learning global model averages rather than query-specific signals.

read the letter

The core observation is that diverse routing methods, from simple kNN to learned classifiers and rankers, land in a narrow accuracy band well below the oracle on the tested benchmarks. The authors frame this as a predictability bottleneck where routers pick up broad trends in which models tend to do well overall instead of fine-grained signals about individual queries.

The scale of the comparison is the main strength. Running 21 methods across five benchmarks gives a clear picture of how current approaches overlap on easy cases and miss the hard ones together. The practical suggestions—scaling up training data, using stronger encoders, and trying end-to-end fine-tuning—are concrete and testable without requiring new theory.

The causal claim is the softer part. The stress-test note is fair: if the benchmarks have queries where model performances are inherently correlated, the similarity across routers could appear even if instance-level signals exist. The abstract does not describe controls that would show router decisions tracking average accuracies more than query embeddings after accounting for benchmark statistics. That leaves the bottleneck explanation as a reasonable hypothesis rather than a fully isolated result.

This work is aimed at people who deploy or tune LLM routers in production and want a realistic map of where incremental method tweaks stop helping. The empirical breadth makes it worth referee time even if the interpretation needs more evidence to hold up. I would send it for review.

Referee Report

2 major / 1 minor

Summary. The manuscript reports an empirical study of 21 LLM routing methods across five benchmarks, documenting a consistent 'routing plateau' in which diverse approaches (kNN, classifiers, rankers, etc.) converge to similar accuracy levels well below the oracle router. The central claim is that this plateau arises from a predictability bottleneck: routers primarily capture global averaged model-performance trends rather than fine-grained, query-specific signals, causing them to solve overlapping easy queries while failing on hard instances. The authors further examine mitigations, reporting gains from larger training sets, stronger encoders, and end-to-end fine-tuning.

Significance. If substantiated, the work usefully characterizes shared limits of current routing techniques and supplies concrete directions for improvement. The breadth of the comparison (21 methods) is a positive feature that highlights the robustness of the observed plateau. The empirical nature of the study, however, requires tighter statistical controls and direct evidence for the proposed causal mechanism to elevate its impact.

major comments (2)

[Abstract / investigation section] Abstract and the investigation of the plateau: the attribution of the narrow performance band to a predictability bottleneck rests on the convergence of 21 methods, yet the manuscript provides no direct analysis showing that router decisions correlate more strongly with per-model average accuracy than with query embeddings or other instance-level features after controlling for benchmark statistics. Without this separation, the causal link remains under-supported and could instead reflect correlated model accuracies inherent to the five benchmarks.
[Results section] Results describing the 21 methods and five benchmarks: the abstract and main results report consistent performance across methods but supply no details on statistical significance testing, error bars, data splits, or controls for implementation variance. This omission weakens the claim that the plateau is a fundamental phenomenon rather than an artifact of experimental setup.

minor comments (1)

[Benchmark description] The manuscript would benefit from explicit discussion of how the five benchmarks were chosen and whether their query distributions may induce correlated model performances by construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to strengthen the causal evidence and experimental rigor, which we will address in revision. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Abstract / investigation section] Abstract and the investigation of the plateau: the attribution of the narrow performance band to a predictability bottleneck rests on the convergence of 21 methods, yet the manuscript provides no direct analysis showing that router decisions correlate more strongly with per-model average accuracy than with query embeddings or other instance-level features after controlling for benchmark statistics. Without this separation, the causal link remains under-supported and could instead reflect correlated model accuracies inherent to the five benchmarks.

Authors: We agree that a direct correlation analysis would provide stronger support for the proposed mechanism. While the convergence of 21 diverse methods (classifiers, rankers, kNN, etc.) already suggests the bottleneck is not an artifact of any single approach, we will add in the revised manuscript an explicit analysis comparing router decision correlations with global per-model accuracy averages versus instance-level features (e.g., query embeddings), after controlling for benchmark-level statistics. This will help separate the predictability bottleneck from benchmark-inherent correlations. revision: yes
Referee: [Results section] Results describing the 21 methods and five benchmarks: the abstract and main results report consistent performance across methods but supply no details on statistical significance testing, error bars, data splits, or controls for implementation variance. This omission weakens the claim that the plateau is a fundamental phenomenon rather than an artifact of experimental setup.

Authors: We acknowledge the importance of these details for establishing robustness. In the revised version we will expand the results section to include: (i) error bars and standard deviations from multiple independent runs with different random seeds, (ii) statistical significance tests (e.g., paired t-tests across methods), (iii) explicit descriptions of train/test splits and benchmark preprocessing, and (iv) controls for implementation variance such as hyperparameter sensitivity checks. These additions will better substantiate that the plateau is not an experimental artifact. revision: yes

Circularity Check

0 steps flagged

Empirical comparison study with no derivations or self-referential reductions

full rationale

The paper reports direct measurements of routing accuracy across 21 methods on five benchmarks, observing convergence to a narrow performance band. The central claim attributes this to a predictability bottleneck (routers learning global trends over instance-specific signals) as an interpretive inference from those measurements. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the similarity result is not constructed from the paper's own inputs by definition. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical benchmarking study. No free parameters, invented entities, or non-standard axioms are introduced in the abstract; the analysis rests on standard assumptions that the chosen benchmarks and methods are representative.

axioms (1)

domain assumption The five benchmarks are representative of real-world query distributions for LLM routing
The plateau conclusion depends on these benchmarks capturing the relevant hard queries.

pith-pipeline@v0.9.1-grok · 5736 in / 1081 out tokens · 34625 ms · 2026-06-29T14:04:06.820701+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 42 canonical work pages · 17 internal anchors

[1]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. InNAACL-HLT, 2019

2019
[2]

On the Cross-lingual Transferability of Monolingual Representations

Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. On the Cross-lingual Transferability of Monolingual Representations. InACL, 2020

2020
[3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program Synthesis with Large Language Models, 2021. URLhttps://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset, 2018. URL https: //arxiv.org/abs/1611.09268

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Semantic Parsing on Freebase from Question- Answer Pairs

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic Parsing on Freebase from Question- Answer Pairs. InEMNLP, 2013

2013
[6]

No Answer Needed: Predicting LLM Answer Accuracy from QQuestion-Only Linear Probes.arXiv preprint arXiv:2509.10625, 2025

Iván Vicente Moreno Cencerrado, Arnau Padrés Masdemont, Anton Gonzalvez Hawthorne, David Demitri Africa, and Lorenzo Pacchiardi. No Answer Needed: Predicting LLM Answer Accuracy from QQuestion-Only Linear Probes.arXiv preprint arXiv:2509.10625, 2025

work page arXiv 2025
[7]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating Large Language Models Trained on Code, 2021. URLhttps://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Kwok, and Yu Zhang

Shuhao Chen, Weisen Jiang, Baijiong Lin, James T. Kwok, and Yu Zhang. RouterDC: Query-Based Router by Dual Contrastive Learning for Assembling Large Language Models, 2024. URL https://arxiv.org/abs/ 2409.19886

work page arXiv 2024
[9]

FinQA: A Dataset of Numerical Reasoning over Financial Data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. FinQA: A Dataset of Numerical Reasoning over Financial Data. InEMNLP, 2021

2021
[10]

QuAC: Question Answering in Context

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. QuAC: Question Answering in Context. InEMNLP, 2018

2018
[11]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. InNAACL-HLT, 2019. 9 The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM RoutersA PREPRINT

2019
[12]

TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages.Transactions of the Association for Computational Linguistics, 8:454–470, 2020

Jonathan H Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages.Transactions of the Association for Computational Linguistics, 8:454–470, 2020

2020
[13]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, 2018. URL https://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems, 2021. URLhttps://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V . S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing, 2024. URL https://arxiv.org/abs/2404.14618

work page arXiv 2024
[16]

SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, V olkan Cirik, and Kyunghyun Cho. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine, 2017. URL https://arxiv.org/abs/ 1704.05179

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

GraphRouter: A Graph-based Router for LLM Selections, 2025

Tao Feng, Yanzhen Shen, and Jiaxuan You. GraphRouter: A Graph-based Router for LLM Selections, 2025. URL https://arxiv.org/abs/2410.03834

work page arXiv 2025
[18]

Angelopoulos, and Ion Stoica

Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios N. Angelopoulos, and Ion Stoica. Prompt-to-Leaderboard, 2025. URLhttps://arxiv.org/abs/2502.14855

work page arXiv 2025
[19]

Aligning AI With Shared Human Values

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning AI With Shared Human Values. InInternational Conference on Learning Representations (ICLR), 2021

2021
[20]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding. InInternational Conference on Learning Representations (ICLR), 2021

2021
[21]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring Mathematical Problem Solving With the MATH Dataset. InNeurIPS Datasets and Benchmarks Track, 2021

2021
[22]

RouterBench: A Benchmark for Multi-LLM Routing System

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. RouterBench: A Benchmark for Multi-LLM Routing System, 2024. URL https://arxiv.org/abs/2403.12031

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code, 2024. URLhttps://arxiv.org/abs/2403.07974

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14), 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14), 2021

2021
[25]

Universal Model Routing for Efficient LLM Inference, 2025

Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, and Sanjiv Kumar. Universal Model Routing for Efficient LLM Inference, 2025. URLhttps://arxiv.org/abs/2502.08773

work page arXiv 2025
[26]

QASC: A Dataset for Question Answering via Sentence Composition

Tushar Khot, Peter Clark, Michal Guerquin, Peter Jansen, and Ashish Sabharwal. QASC: A Dataset for Question Answering via Sentence Composition. InAAAI, 2020

2020
[27]

The NarrativeQA Reading Comprehension Challenge.Transactions of the Association for Computational Linguistics, 6:317–328, 2018

Tomáš Koˇciský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Ed- ward Grefenstette. The NarrativeQA Reading Comprehension Challenge.Transactions of the Association for Computational Linguistics, 6:317–328, 2018

2018
[28]

Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational Linguistics, 7:452–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural Questions: A Benchmark for Question Answering Research.Transacti...

2019
[29]

When Routing Collapses: On the Degenerate Convergence of LLM Routers, 2026

Guannan Lai and Han-Jia Ye. When Routing Collapses: On the Degenerate Convergence of LLM Routers, 2026. URLhttps://arxiv.org/abs/2602.03478

work page arXiv 2026
[30]

MLQA: Evaluating Cross-lingual Extractive Question Answering

Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. MLQA: Evaluating Cross-lingual Extractive Question Answering. InACL, 2020. 10 The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM RoutersA PREPRINT

2020
[31]

LLMRouterBench: A massive benchmark and unified framework for llm routing.arXiv preprint arXiv:2601.07206, 2026

Hao Li, Yiqun Zhang, Zhaoyan Guo, Chenxu Wang, Shengji Tang, Qiaosheng Zhang, Yang Chen, Biqing Qi, Peng Ye, Lei Bai, Zhen Wang, and Shuyue Hu. LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing, 2026. URLhttps://arxiv.org/abs/2601.07206

work page arXiv 2026
[32]

Schapire

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th international conference on World wide web, WWW ’10, page 661–670. ACM, April 2010. doi: 10.1145/1772690.1772758. URL http://dx.doi.org/10.1145/1772690. 1772758

work page doi:10.1145/1772690.1772758 2010
[33]

Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers,

Yang Li. Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers,
[34]

URLhttps://arxiv.org/abs/2505.12601

work page internal anchor Pith review Pith/arXiv arXiv
[35]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring How Models Mimic Human Falsehoods. InACL, 2022

2022
[36]

Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems. InACL, 2017

2017
[37]

OptLLM: Optimal Assignment of Queries to Large Language Models, 2024

Yueyue Liu, Hongyu Zhang, Yuantian Miao, Van-Hoang Le, and Zhiqiang Li. OptLLM: Optimal Assignment of Queries to Large Language Models, 2024. URLhttps://arxiv.org/abs/2405.15130

work page arXiv 2024
[38]

Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models, 2023

Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models, 2023. URL https://arxiv.org/abs/ 2311.08692

work page arXiv 2023
[39]

RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers, 2025

Yifan Lu, Rixin Liu, Jiayi Yuan, Xingqi Cui, Shenrun Zhang, Hongyi Liu, and Jiarong Xing. RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers, 2025. URL https://arxiv.org/abs/2510. 00202

2025
[40]

A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers

Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers. InACL, 2020

2020
[41]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. InEMNLP, 2018

2018
[42]

RoRF: Routing on Random Forests

Not Diamond. RoRF: Routing on Random Forests. https://github.com/Not-Diamond/RoRF, 2025. Ac- cessed: 2026-04-25

2025
[43]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. RouteLLM: Learning to Route LLMs with Preference Data, 2025. URL https: //arxiv.org/abs/2406.18665

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. InConference on Health, Inference, and Learning (CHIL), 2022

2022
[45]

Are NLP Models really able to Solve Simple Math Word Problems? InNAACL-HLT, 2021

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP Models really able to Solve Simple Math Word Problems? InNAACL-HLT, 2021

2021
[46]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 Technical Report, 2024. URLhttps://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. InEMNLP, 2016

2016
[48]

CoQA: A Conversational Question Answering Challenge

Siva Reddy, Danqi Chen, and Christopher D Manning. CoQA: A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019

2019
[49]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark, 2023. URL https://arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Quizbowl: The Case for Incremental Question Answering, 2019

Pedro Rodriguez, Shi Feng, Mohit Iyyer, He He, and Jordan Boyd-Graber. Quizbowl: The Case for Incremental Question Answering, 2019. URLhttps://arxiv.org/abs/1904.04792

work page arXiv 2019
[51]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An Adversarial Winograd Schema Challenge at Scale. InAAAI, 2020

2020
[52]

Fly-Swat or Cannon? Cost-Effective Language Model Choice via Meta-Modeling

Marija Sakota, Maxime Peyrard, and Robert West. Fly-Swat or Cannon? Cost-Effective Language Model Choice via Meta-Modeling. InProceedings of the 17th ACM International Conference on Web Search and Data Mining, WSDM ’24, page 606–615. ACM, March 2024. doi: 10.1145/3616855.3635825. URL http: //dx.doi.org/10.1145/3616855.3635825. 11 The Routing Plateau: Unde...

work page doi:10.1145/3616855.3635825 2024
[53]

Social IQa: Commonsense Reasoning about Social Interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense Reasoning about Social Interactions. InEMNLP-IJCNLP, 2019

2019
[54]

Carrot: A cost aware rate optimal router, 2025

Seamus Somerstep, Felipe Maia Polo, Allysson Flavio Melo de Oliveira, Prattyush Mangal, Mírian Silva, Onkar Bhardwaj, Mikhail Yurochkin, and Subha Maity. CARROT: A Cost Aware Rate Optimal Router, 2025. URL https://arxiv.org/abs/2502.03261

work page arXiv 2025
[55]

MPNet: Masked and Permuted Pre-training for Language Understanding, 2020

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MPNet: Masked and Permuted Pre-training for Language Understanding, 2020. URLhttps://arxiv.org/abs/2004.09297

work page arXiv 2020
[56]

IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory, 2025

Wei Song, Zhenya Huang, Cheng Cheng, Weibo Gao, Bihan Xu, GuanHao Zhao, Fei Wang, and Runze Wu. IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory, 2025. URL https: //arxiv.org/abs/2506.01048

work page arXiv 2025
[57]

CP-Router: An Uncertainty-Aware Router Between LLM and LRM, 2025

Jiayuan Su, Fulin Lin, Zhaopeng Feng, Han Zheng, Teng Wang, Zhenyu Xiao, Xinlong Zhao, Zuozhu Liu, Lu Cheng, and Hongwei Wang. CP-Router: An Uncertainty-Aware Router Between LLM and LRM, 2025. URL https://arxiv.org/abs/2505.19970

work page arXiv 2025
[58]

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. InNAACL-HLT, 2019

2019
[59]

NewsQA: A Machine Comprehension Dataset

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. NewsQA: A Machine Comprehension Dataset. In2nd Workshop on Representation Learning for NLP (RepL4NLP), 2017

2017
[60]

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.BMC Bioinformatics, 16(138), 2015

George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.BMC Bioinformatics, 16(138), 2015

2015
[61]

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. InNeurIPS, 2019

2019
[62]

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers, 2020

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers, 2020. URL https://arxiv.org/ abs/2002.10957

work page arXiv 2020
[63]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark, 2024. URLhttps://arxiv.org/abs/2406.01574

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference, 20...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Crowdsourcing Multiple Choice Science Questions

Johannes Welbl, Nelson F Liu, and Matt Gardner. Crowdsourcing Multiple Choice Science Questions. In Proceedings of the 3rd Workshop on Noisy User-generated Text, 2017

2017
[66]

Constructing Datasets for Multi-hop Reading Com- prehension Across Documents.Transactions of the Association for Computational Linguistics, 6:287–302, 2018

Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. Constructing Datasets for Multi-hop Reading Com- prehension Across Documents.Transactions of the Association for Computational Linguistics, 6:287–302, 2018

2018
[67]

MESS+: Dynamically Learned Inference-Time LLM Routing in Model Zoos with Service Level Guarantees, 2025

Herbert Woisetschläger, Ryan Zhang, Shiqiang Wang, and Hans-Arno Jacobsen. MESS+: Dynamically Learned Inference-Time LLM Routing in Model Zoos with Service Level Guarantees, 2025. URL https://arxiv.org/ abs/2505.19947

work page arXiv 2025
[68]

Generalized Correctness Mod- els: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns.arXiv preprint arXiv:2509.24988, 2025

Hanqi Xiao, Vaidehi Patil, Hyunji Lee, Elias Stengel-Eskin, and Mohit Bansal. Generalized Correctness Mod- els: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns.arXiv preprint arXiv:2509.24988, 2025

work page arXiv 2025
[69]

C-Pack: Packed Resources For General Chinese Embeddings

Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-Pack: Packed Resources For General Chinese Embeddings, 2023. URLhttps://arxiv.org/abs/2309.07597

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

R2-Router: A New Paradigm for LLM Routing with Reasoning

Jiaqi Xue, Qian Lou, Jiarong Xing, and Heng Huang. R2-Router: A New Paradigm for LLM Routing with Reasoning, 2026. URLhttps://arxiv.org/abs/2602.02823

work page internal anchor Pith review Pith/arXiv arXiv 2026
[71]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InEMNLP, 2018. 12 The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM RoutersA PREPRINT

2018
[72]

ChatMusician: Understanding and Generating Music Intrinsically with LLM, 2024

Ruibin Yuan, Hanfeng Lin, Yi Wang, Zeyue Tian, Shangda Wu, Tianhao Shen, Ge Zhang, Yuhang Wu, Cong Liu, Ziya Zhou, et al. ChatMusician: Understanding and Generating Music Intrinsically with LLM, 2024. URL https://arxiv.org/abs/2402.16153

work page arXiv 2024
[73]

HellaSwag: Can a Machine Really Finish Your Sentence? InACL, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a Machine Really Finish Your Sentence? InACL, 2019

2019
[74]

Model Spider: Learning to Rank Pre-Trained Models Efficiently, 2023

Yi-Kai Zhang, Ting-Ji Huang, Yao-Xiang Ding, De-Chuan Zhan, and Han-Jia Ye. Model Spider: Learning to Rank Pre-Trained Models Efficiently, 2023. URLhttps://arxiv.org/abs/2306.03900

work page arXiv 2023
[75]

Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing

Yiqun Zhang, Hao Li, Jianhao Chen, Hangfan Zhang, Peng Ye, Lei Bai, and Shuyue Hu. Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing. InProceedings of the 2025 The Seventh International Conference on Distributed Artificial Intelligence, DAI ’25, page 122–129. ACM, November 2025. doi: 10.1145/3772429.3772445. URLhttp...

work page doi:10.1145/3772429.3772445 2025
[76]

Morley Mao

Zesen Zhao, Shuowei Jin, and Z. Morley Mao. Eagle: Efficient Training-Free Router for Multi-LLM Inference,
[77]

URLhttps://arxiv.org/abs/2409.15518

work page arXiv
[78]

When does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset

Lucia Zheng, Neel Guha, Brandon R Anderson, Peter Henderson, and Daniel E Ho. When does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. InICAIL, 2021

2021
[79]

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models, 2023. URL https://arxiv.org/abs/2304.06364

work page internal anchor Pith review Pith/arXiv arXiv 2023
[80]

correctness

Richard Zhuang, Tianhao Wu, Zhaojin Wen, Andrew Li, Jiantao Jiao, and Kannan Ramchandran. EmbedLLM: Learning Compact Representations of Large Language Models, 2024. URL https://arxiv.org/abs/2410. 02223. 13 The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM RoutersA PREPRINT A Broader Impact LLM routing is a cost-quality optimizati...

work page arXiv 2024

[1] [1]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. InNAACL-HLT, 2019

2019

[2] [2]

On the Cross-lingual Transferability of Monolingual Representations

Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. On the Cross-lingual Transferability of Monolingual Representations. InACL, 2020

2020

[3] [3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program Synthesis with Large Language Models, 2021. URLhttps://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset, 2018. URL https: //arxiv.org/abs/1611.09268

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Semantic Parsing on Freebase from Question- Answer Pairs

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic Parsing on Freebase from Question- Answer Pairs. InEMNLP, 2013

2013

[6] [6]

No Answer Needed: Predicting LLM Answer Accuracy from QQuestion-Only Linear Probes.arXiv preprint arXiv:2509.10625, 2025

Iván Vicente Moreno Cencerrado, Arnau Padrés Masdemont, Anton Gonzalvez Hawthorne, David Demitri Africa, and Lorenzo Pacchiardi. No Answer Needed: Predicting LLM Answer Accuracy from QQuestion-Only Linear Probes.arXiv preprint arXiv:2509.10625, 2025

work page arXiv 2025

[7] [7]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating Large Language Models Trained on Code, 2021. URLhttps://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Kwok, and Yu Zhang

Shuhao Chen, Weisen Jiang, Baijiong Lin, James T. Kwok, and Yu Zhang. RouterDC: Query-Based Router by Dual Contrastive Learning for Assembling Large Language Models, 2024. URL https://arxiv.org/abs/ 2409.19886

work page arXiv 2024

[9] [9]

FinQA: A Dataset of Numerical Reasoning over Financial Data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. FinQA: A Dataset of Numerical Reasoning over Financial Data. InEMNLP, 2021

2021

[10] [10]

QuAC: Question Answering in Context

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. QuAC: Question Answering in Context. InEMNLP, 2018

2018

[11] [11]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. InNAACL-HLT, 2019. 9 The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM RoutersA PREPRINT

2019

[12] [12]

TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages.Transactions of the Association for Computational Linguistics, 8:454–470, 2020

Jonathan H Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages.Transactions of the Association for Computational Linguistics, 8:454–470, 2020

2020

[13] [13]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, 2018. URL https://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems, 2021. URLhttps://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[15] [15]

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V . S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing, 2024. URL https://arxiv.org/abs/2404.14618

work page arXiv 2024

[16] [16]

SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, V olkan Cirik, and Kyunghyun Cho. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine, 2017. URL https://arxiv.org/abs/ 1704.05179

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

GraphRouter: A Graph-based Router for LLM Selections, 2025

Tao Feng, Yanzhen Shen, and Jiaxuan You. GraphRouter: A Graph-based Router for LLM Selections, 2025. URL https://arxiv.org/abs/2410.03834

work page arXiv 2025

[18] [18]

Angelopoulos, and Ion Stoica

Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios N. Angelopoulos, and Ion Stoica. Prompt-to-Leaderboard, 2025. URLhttps://arxiv.org/abs/2502.14855

work page arXiv 2025

[19] [19]

Aligning AI With Shared Human Values

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning AI With Shared Human Values. InInternational Conference on Learning Representations (ICLR), 2021

2021

[20] [20]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding. InInternational Conference on Learning Representations (ICLR), 2021

2021

[21] [21]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring Mathematical Problem Solving With the MATH Dataset. InNeurIPS Datasets and Benchmarks Track, 2021

2021

[22] [22]

RouterBench: A Benchmark for Multi-LLM Routing System

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. RouterBench: A Benchmark for Multi-LLM Routing System, 2024. URL https://arxiv.org/abs/2403.12031

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code, 2024. URLhttps://arxiv.org/abs/2403.07974

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14), 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14), 2021

2021

[25] [25]

Universal Model Routing for Efficient LLM Inference, 2025

Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, and Sanjiv Kumar. Universal Model Routing for Efficient LLM Inference, 2025. URLhttps://arxiv.org/abs/2502.08773

work page arXiv 2025

[26] [26]

QASC: A Dataset for Question Answering via Sentence Composition

Tushar Khot, Peter Clark, Michal Guerquin, Peter Jansen, and Ashish Sabharwal. QASC: A Dataset for Question Answering via Sentence Composition. InAAAI, 2020

2020

[27] [27]

The NarrativeQA Reading Comprehension Challenge.Transactions of the Association for Computational Linguistics, 6:317–328, 2018

Tomáš Koˇciský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Ed- ward Grefenstette. The NarrativeQA Reading Comprehension Challenge.Transactions of the Association for Computational Linguistics, 6:317–328, 2018

2018

[28] [28]

Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational Linguistics, 7:452–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural Questions: A Benchmark for Question Answering Research.Transacti...

2019

[29] [29]

When Routing Collapses: On the Degenerate Convergence of LLM Routers, 2026

Guannan Lai and Han-Jia Ye. When Routing Collapses: On the Degenerate Convergence of LLM Routers, 2026. URLhttps://arxiv.org/abs/2602.03478

work page arXiv 2026

[30] [30]

MLQA: Evaluating Cross-lingual Extractive Question Answering

Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. MLQA: Evaluating Cross-lingual Extractive Question Answering. InACL, 2020. 10 The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM RoutersA PREPRINT

2020

[31] [31]

LLMRouterBench: A massive benchmark and unified framework for llm routing.arXiv preprint arXiv:2601.07206, 2026

Hao Li, Yiqun Zhang, Zhaoyan Guo, Chenxu Wang, Shengji Tang, Qiaosheng Zhang, Yang Chen, Biqing Qi, Peng Ye, Lei Bai, Zhen Wang, and Shuyue Hu. LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing, 2026. URLhttps://arxiv.org/abs/2601.07206

work page arXiv 2026

[32] [32]

Schapire

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th international conference on World wide web, WWW ’10, page 661–670. ACM, April 2010. doi: 10.1145/1772690.1772758. URL http://dx.doi.org/10.1145/1772690. 1772758

work page doi:10.1145/1772690.1772758 2010

[33] [33]

Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers,

Yang Li. Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers,

[34] [34]

URLhttps://arxiv.org/abs/2505.12601

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring How Models Mimic Human Falsehoods. InACL, 2022

2022

[36] [36]

Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems. InACL, 2017

2017

[37] [37]

OptLLM: Optimal Assignment of Queries to Large Language Models, 2024

Yueyue Liu, Hongyu Zhang, Yuantian Miao, Van-Hoang Le, and Zhiqiang Li. OptLLM: Optimal Assignment of Queries to Large Language Models, 2024. URLhttps://arxiv.org/abs/2405.15130

work page arXiv 2024

[38] [38]

Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models, 2023

Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models, 2023. URL https://arxiv.org/abs/ 2311.08692

work page arXiv 2023

[39] [39]

RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers, 2025

Yifan Lu, Rixin Liu, Jiayi Yuan, Xingqi Cui, Shenrun Zhang, Hongyi Liu, and Jiarong Xing. RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers, 2025. URL https://arxiv.org/abs/2510. 00202

2025

[40] [40]

A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers

Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers. InACL, 2020

2020

[41] [41]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. InEMNLP, 2018

2018

[42] [42]

RoRF: Routing on Random Forests

Not Diamond. RoRF: Routing on Random Forests. https://github.com/Not-Diamond/RoRF, 2025. Ac- cessed: 2026-04-25

2025

[43] [43]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. RouteLLM: Learning to Route LLMs with Preference Data, 2025. URL https: //arxiv.org/abs/2406.18665

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. InConference on Health, Inference, and Learning (CHIL), 2022

2022

[45] [45]

Are NLP Models really able to Solve Simple Math Word Problems? InNAACL-HLT, 2021

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP Models really able to Solve Simple Math Word Problems? InNAACL-HLT, 2021

2021

[46] [46]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 Technical Report, 2024. URLhttps://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. InEMNLP, 2016

2016

[48] [48]

CoQA: A Conversational Question Answering Challenge

Siva Reddy, Danqi Chen, and Christopher D Manning. CoQA: A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019

2019

[49] [49]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark, 2023. URL https://arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Quizbowl: The Case for Incremental Question Answering, 2019

Pedro Rodriguez, Shi Feng, Mohit Iyyer, He He, and Jordan Boyd-Graber. Quizbowl: The Case for Incremental Question Answering, 2019. URLhttps://arxiv.org/abs/1904.04792

work page arXiv 2019

[51] [51]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An Adversarial Winograd Schema Challenge at Scale. InAAAI, 2020

2020

[52] [52]

Fly-Swat or Cannon? Cost-Effective Language Model Choice via Meta-Modeling

Marija Sakota, Maxime Peyrard, and Robert West. Fly-Swat or Cannon? Cost-Effective Language Model Choice via Meta-Modeling. InProceedings of the 17th ACM International Conference on Web Search and Data Mining, WSDM ’24, page 606–615. ACM, March 2024. doi: 10.1145/3616855.3635825. URL http: //dx.doi.org/10.1145/3616855.3635825. 11 The Routing Plateau: Unde...

work page doi:10.1145/3616855.3635825 2024

[53] [53]

Social IQa: Commonsense Reasoning about Social Interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense Reasoning about Social Interactions. InEMNLP-IJCNLP, 2019

2019

[54] [54]

Carrot: A cost aware rate optimal router, 2025

Seamus Somerstep, Felipe Maia Polo, Allysson Flavio Melo de Oliveira, Prattyush Mangal, Mírian Silva, Onkar Bhardwaj, Mikhail Yurochkin, and Subha Maity. CARROT: A Cost Aware Rate Optimal Router, 2025. URL https://arxiv.org/abs/2502.03261

work page arXiv 2025

[55] [55]

MPNet: Masked and Permuted Pre-training for Language Understanding, 2020

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MPNet: Masked and Permuted Pre-training for Language Understanding, 2020. URLhttps://arxiv.org/abs/2004.09297

work page arXiv 2020

[56] [56]

IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory, 2025

Wei Song, Zhenya Huang, Cheng Cheng, Weibo Gao, Bihan Xu, GuanHao Zhao, Fei Wang, and Runze Wu. IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory, 2025. URL https: //arxiv.org/abs/2506.01048

work page arXiv 2025

[57] [57]

CP-Router: An Uncertainty-Aware Router Between LLM and LRM, 2025

Jiayuan Su, Fulin Lin, Zhaopeng Feng, Han Zheng, Teng Wang, Zhenyu Xiao, Xinlong Zhao, Zuozhu Liu, Lu Cheng, and Hongwei Wang. CP-Router: An Uncertainty-Aware Router Between LLM and LRM, 2025. URL https://arxiv.org/abs/2505.19970

work page arXiv 2025

[58] [58]

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. InNAACL-HLT, 2019

2019

[59] [59]

NewsQA: A Machine Comprehension Dataset

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. NewsQA: A Machine Comprehension Dataset. In2nd Workshop on Representation Learning for NLP (RepL4NLP), 2017

2017

[60] [60]

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.BMC Bioinformatics, 16(138), 2015

George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.BMC Bioinformatics, 16(138), 2015

2015

[61] [61]

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. InNeurIPS, 2019

2019

[62] [62]

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers, 2020

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers, 2020. URL https://arxiv.org/ abs/2002.10957

work page arXiv 2020

[63] [63]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark, 2024. URLhttps://arxiv.org/abs/2406.01574

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference, 20...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

Crowdsourcing Multiple Choice Science Questions

Johannes Welbl, Nelson F Liu, and Matt Gardner. Crowdsourcing Multiple Choice Science Questions. In Proceedings of the 3rd Workshop on Noisy User-generated Text, 2017

2017

[66] [66]

Constructing Datasets for Multi-hop Reading Com- prehension Across Documents.Transactions of the Association for Computational Linguistics, 6:287–302, 2018

Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. Constructing Datasets for Multi-hop Reading Com- prehension Across Documents.Transactions of the Association for Computational Linguistics, 6:287–302, 2018

2018

[67] [67]

MESS+: Dynamically Learned Inference-Time LLM Routing in Model Zoos with Service Level Guarantees, 2025

Herbert Woisetschläger, Ryan Zhang, Shiqiang Wang, and Hans-Arno Jacobsen. MESS+: Dynamically Learned Inference-Time LLM Routing in Model Zoos with Service Level Guarantees, 2025. URL https://arxiv.org/ abs/2505.19947

work page arXiv 2025

[68] [68]

Generalized Correctness Mod- els: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns.arXiv preprint arXiv:2509.24988, 2025

Hanqi Xiao, Vaidehi Patil, Hyunji Lee, Elias Stengel-Eskin, and Mohit Bansal. Generalized Correctness Mod- els: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns.arXiv preprint arXiv:2509.24988, 2025

work page arXiv 2025

[69] [69]

C-Pack: Packed Resources For General Chinese Embeddings

Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-Pack: Packed Resources For General Chinese Embeddings, 2023. URLhttps://arxiv.org/abs/2309.07597

work page internal anchor Pith review Pith/arXiv arXiv 2023

[70] [70]

R2-Router: A New Paradigm for LLM Routing with Reasoning

Jiaqi Xue, Qian Lou, Jiarong Xing, and Heng Huang. R2-Router: A New Paradigm for LLM Routing with Reasoning, 2026. URLhttps://arxiv.org/abs/2602.02823

work page internal anchor Pith review Pith/arXiv arXiv 2026

[71] [71]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InEMNLP, 2018. 12 The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM RoutersA PREPRINT

2018

[72] [72]

ChatMusician: Understanding and Generating Music Intrinsically with LLM, 2024

Ruibin Yuan, Hanfeng Lin, Yi Wang, Zeyue Tian, Shangda Wu, Tianhao Shen, Ge Zhang, Yuhang Wu, Cong Liu, Ziya Zhou, et al. ChatMusician: Understanding and Generating Music Intrinsically with LLM, 2024. URL https://arxiv.org/abs/2402.16153

work page arXiv 2024

[73] [73]

HellaSwag: Can a Machine Really Finish Your Sentence? InACL, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a Machine Really Finish Your Sentence? InACL, 2019

2019

[74] [74]

Model Spider: Learning to Rank Pre-Trained Models Efficiently, 2023

Yi-Kai Zhang, Ting-Ji Huang, Yao-Xiang Ding, De-Chuan Zhan, and Han-Jia Ye. Model Spider: Learning to Rank Pre-Trained Models Efficiently, 2023. URLhttps://arxiv.org/abs/2306.03900

work page arXiv 2023

[75] [75]

Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing

Yiqun Zhang, Hao Li, Jianhao Chen, Hangfan Zhang, Peng Ye, Lei Bai, and Shuyue Hu. Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing. InProceedings of the 2025 The Seventh International Conference on Distributed Artificial Intelligence, DAI ’25, page 122–129. ACM, November 2025. doi: 10.1145/3772429.3772445. URLhttp...

work page doi:10.1145/3772429.3772445 2025

[76] [76]

Morley Mao

Zesen Zhao, Shuowei Jin, and Z. Morley Mao. Eagle: Efficient Training-Free Router for Multi-LLM Inference,

[77] [77]

URLhttps://arxiv.org/abs/2409.15518

work page arXiv

[78] [78]

When does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset

Lucia Zheng, Neel Guha, Brandon R Anderson, Peter Henderson, and Daniel E Ho. When does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. InICAIL, 2021

2021

[79] [79]

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models, 2023. URL https://arxiv.org/abs/2304.06364

work page internal anchor Pith review Pith/arXiv arXiv 2023

[80] [80]

correctness

Richard Zhuang, Tianhao Wu, Zhaojin Wen, Andrew Li, Jiantao Jiao, and Kannan Ramchandran. EmbedLLM: Learning Compact Representations of Large Language Models, 2024. URL https://arxiv.org/abs/2410. 02223. 13 The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM RoutersA PREPRINT A Broader Impact LLM routing is a cost-quality optimizati...

work page arXiv 2024