arxiv: 2605.07112 · v1 · submitted 2026-05-08 · 💻 cs.AI · cs.MA

Recognition: no theorem link

Switchcraft: AI Model Router for Agentic Tool Calling

Sharad Agarwal , Pooria Namyar , Alec Wolman , Rahul Ambavat , Ankur Gupta , Qizheng Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:59 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords model routingagentic AItool callingcost optimizationDistilBERTfunction callinginference cost

0 comments

The pith

Switchcraft routes each tool call to the cheapest model that can handle it correctly, matching the best single model's accuracy at 84% lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic AI systems that invoke external tools are powerful but costly because developers default to large models. Switchcraft introduces the first router built specifically for tool calling instead of chat. It uses a DistilBERT classifier to pick the lowest-cost model for each query while meeting a correctness threshold and latency limit. On five function-calling benchmarks the router reaches 82.9 percent accuracy, equal to or better than any one model, and cuts inference cost by 84 percent. This approach lets developers run capable agents without overspending on every query.

Core claim

Switchcraft is a model router optimized for agentic tool calling that selects the lowest-cost model subject to correctness. A DistilBERT classifier is trained on five function-calling benchmarks and deployed inline under a latency budget. It achieves 82.9 percent accuracy matching or exceeding the strongest individual model while reducing inference cost by 84 percent and saving over $3,600 per million queries. The evaluation also reveals that larger models do not always outperform smaller ones on tool-use tasks and that nominally cheaper models can produce higher total cost through token-intensive reasoning.

What carries the argument

DistilBERT classifier trained on five function-calling benchmarks to predict and select the lowest-cost model that meets correctness for each inline tool-calling query.

If this is right

Larger models do not consistently outperform smaller ones on tool-use tasks.
Nominally cheaper models can incur higher total cost due to token-intensive reasoning.
Cost-aware agentic AI deployment becomes feasible without sacrificing correctness.
Savings exceed $3,600 per million queries while accuracy stays at or above 82.9 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same routing logic could be tested on agent tasks that mix tool calls with open-ended reasoning.
Performance may change if production queries shift in style or complexity from the benchmark set.
Pairing the router with prompt-length controls could produce additional cost reductions beyond the reported 84 percent.

Load-bearing premise

The classifier trained on the five benchmarks will keep selecting correct low-cost models on new real-world agent queries without large accuracy drops or hidden cost increases.

What would settle it

Measure accuracy and total token cost when Switchcraft routes a fresh collection of diverse, real-world tool-calling queries outside the original five benchmarks.

Figures

Figures reproduced from arXiv: 2605.07112 by Alec Wolman, Ankur Gupta, Pooria Namyar, Qizheng Zhang, Rahul Ambavat, Sharad Agarwal.

**Figure 1.** Figure 1: Motivating example from BFCL v3 [24]. Agentic queries demand correct functions and precise parameter values at every step; small errors (wrong type, wrong value, wrong order) produce consequential failures. router (66M parameters) that takes an agent’s query and context as input and predicts the most suitable model for execution. Key Results. Switchcraft achieves 82.9% accuracy—matching or exceeding the be… view at source ↗

**Figure 2.** Figure 2: System architecture. Left: fine-tuning pipeline (ingest benchmarks, run inference across LLMs, score via AST, fine-tune router). Right: inference-time routing (DistilBERT classifier and cost model select the cheapest predicted-correct LLM). weighted losses, BLEU and LLM-as-judge scoring); rejected alternatives are detailed in Appendix B [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy–cost Pareto plot on the held-out test set (12,282 examples). Switchcraft (red [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Acceptable variation in an agentic parallel tool-calling scenario (BFCL v3 [ [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: DistilBERT fine-tuning curves across 20 seeds. Left: training and validation loss. Right: [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: DistilBERT router accuracy across 20 random seeds. Each dot is one seed; the red bar [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Accuracy–cost Pareto frontier for different thresholding strategies. Each faint curve traces [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Expected vs. actual average cost per query on the validation set. [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗

**Figure 9.** Figure 9: Accuracy–cost Pareto plot for the earlier model basket on the held-out test set (12,282 [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗

**Figure 10.** Figure 10: Distribution of correctness probabilities across 20 invocations, grouped by model. Each [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗

**Figure 11.** Figure 11: Distribution of correctness probabilities across 20 invocations, grouped by dataset. Multi [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗

**Figure 12.** Figure 12: Accuracy–cost Pareto plot with probabilistic correctness labels (1,224 test examples). The [PITH_FULL_IMAGE:figures/full_fig_p037_12.png] view at source ↗

read the original abstract

Agentic AI systems that invoke external tools are powerful but costly, leading developers to default to large models and overspend inference budgets. Model routing can mitigate this, but existing routers are designed for chat completion rather than tool use. We present Switchcraft, the first (to the best of our knowledge) model router optimized for agentic tool calling. Switchcraft operates inline, selecting the lowest-cost model subject to correctness. We construct an evaluation framework on five function-calling benchmarks and train a DistilBERT-based classifier, deployed under a latency budget. Switchcraft achieves 82.9% accuracy -- matching or exceeding the best individual model -- while reducing inference cost by 84%, saving over $3,600 per million queries. We find that larger models do not consistently outperform smaller ones on tool-use tasks, and that nominally cheaper models can incur higher total cost due to token-intensive reasoning. Our work enables cost-aware agentic AI deployment without sacrificing correctness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Switchcraft trains a DistilBERT router for tool-calling agents and reports 84% cost cuts with matched accuracy on its five benchmarks, but those gains stay untested outside the training distribution.

read the letter

The main point is that Switchcraft uses a small classifier to route each tool-calling query to the cheapest model that still gets the answer right. On the five function-calling benchmarks they assembled, it reaches 82.9% accuracy while cutting total inference cost by 84% and saving thousands of dollars per million queries. They also show that larger models do not always beat smaller ones on these tasks and that cheap models can sometimes cost more overall because they produce longer reasoning traces.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Switchcraft, a DistilBERT-based model router for agentic tool calling. It trains a classifier on five function-calling benchmarks to select the lowest-cost model subject to correctness under a latency budget, reporting 82.9% accuracy that matches or exceeds the best individual model while reducing inference cost by 84% (over $3,600 savings per million queries). The work additionally observes that larger models do not consistently outperform smaller ones on tool-use tasks and that nominally cheaper models can incur higher total token cost due to longer reasoning traces.

Significance. If the benchmark results prove robust and reproducible, the approach offers a practical, inline mechanism for cost-efficient agentic AI deployment without correctness loss. The observation that model scale does not reliably predict tool-calling performance is a useful empirical contribution for the field.

major comments (3)

[Abstract] Abstract: the headline claims of 82.9% accuracy and 84% cost reduction are stated without any details on training/validation data splits, error bars, statistical significance, or the precise protocol used to measure total token cost (including cases where cheaper models produce longer traces). These omissions directly undermine assessment of the central accuracy and savings assertions.
[Evaluation framework] Evaluation framework: all accuracy and cost figures are measured exclusively on the same five benchmarks used to train the DistilBERT classifier. No results are reported on out-of-distribution agent queries, production logs, or tasks with different tool schemas or reasoning depths, leaving the generalization assumption untested despite the abstract's own note on variable token costs.
[Methods] Methods: the classifier training procedure, label construction (how 'correct' model selections are defined), cross-validation strategy, and latency-budget deployment mechanics are insufficiently specified to support reproduction or verification of the reported performance numbers.

minor comments (1)

[Abstract] Abstract: the claim of being 'the first (to the best of our knowledge)' router optimized for agentic tool calling would benefit from a short related-work paragraph contrasting it with existing chat-completion routers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve reproducibility, add necessary details, and acknowledge limitations.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claims of 82.9% accuracy and 84% cost reduction are stated without any details on training/validation data splits, error bars, statistical significance, or the precise protocol used to measure total token cost (including cases where cheaper models produce longer traces). These omissions directly undermine assessment of the central accuracy and savings assertions.

Authors: We agree that the abstract omits key details. In the revision we will expand the methods and results sections (and update the abstract where space permits) to specify the 80/20 train/test split on the combined benchmarks, report standard deviations and 5-fold cross-validation error bars, include statistical significance tests for the accuracy and cost comparisons, and describe the total token cost protocol that sums prompt plus completion tokens while explicitly accounting for longer reasoning traces produced by smaller models. revision: yes
Referee: [Evaluation framework] Evaluation framework: all accuracy and cost figures are measured exclusively on the same five benchmarks used to train the DistilBERT classifier. No results are reported on out-of-distribution agent queries, production logs, or tasks with different tool schemas or reasoning depths, leaving the generalization assumption untested despite the abstract's own note on variable token costs.

Authors: This is a valid limitation. All current numbers are in-distribution. We will add a dedicated Limitations section that states this explicitly and discusses risks of distribution shift. We will also report a new small-scale experiment on held-out queries with altered tool schemas to provide preliminary generalization evidence. Production logs are unavailable to us, so we cannot add those results. revision: partial
Referee: [Methods] Methods: the classifier training procedure, label construction (how 'correct' model selections are defined), cross-validation strategy, and latency-budget deployment mechanics are insufficiently specified to support reproduction or verification of the reported performance numbers.

Authors: We acknowledge the methods section is underspecified. The revision will detail: label construction (a model is labeled correct only if it emits the exact ground-truth tool call and arguments), DistilBERT fine-tuning hyperparameters and procedure, 5-fold cross-validation protocol with per-fold results, and the latency-budget mechanism (models whose profiled latency exceeds the budget are filtered before cheapest-valid selection). Pseudocode for the full routing logic will be added to an appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of trained router is independent of inputs

full rationale

The paper trains a DistilBERT classifier on data from five function-calling benchmarks and reports measured accuracy (82.9%) plus cost reduction (84%) on the same benchmarks' evaluation splits. These quantities are computed directly from the classifier's output selections versus ground-truth correctness and per-model token costs; they do not reduce by construction to the fitted parameters themselves. No equations, self-citations, uniqueness theorems, or ansatzes appear in the provided text that would make the headline result equivalent to its training inputs. The pipeline is a standard supervised-learning evaluation and therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract. The system relies on standard supervised classification and benchmark evaluation.

pith-pipeline@v0.9.0 · 5476 in / 1193 out tokens · 37264 ms · 2026-05-11T00:59:29.859303+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 24 canonical work pages · 4 internal anchors

[1]

Abou Ali, F

Mohamad Abou Ali, Fadi Dornaika, and Jinan Charafeddine. Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions.Artificial Intelligence Review, 59(11), 2025. doi: 10.1007/s10462-025-11422-4. URL https://link.springer.com/ article/10.1007/s10462-025-11422-4

work page doi:10.1007/s10462-025-11422-4 2025
[2]

Automix: Automatically mixing language models

Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, and Mausam . Automix: Automatically mixing language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[3]

CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions

Tamer Alkhouli, Katerina Margatina, James Gung, Raphael Shu, Claudia Zaghi, Monica Sunkara, and Yi Zhang. CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7993–8006, Vienna, Austria, jul
[4]

doi: 10.18653/v1/2025.acl-long.394

Association for Computational Linguistics. doi: 10.18653/v1/2025.acl-long.394. URL https://aclanthology.org/2025.acl-long.394/. Creative Commons Attribution 4.0 International License

work page doi:10.18653/v1/2025.acl-long.394 2025
[5]

Validation of modern json schema: Formalization and complexity

Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger. Validation of modern json schema: Formalization and complexity. Proc. ACM Program. Lang., 8(POPL), January 2024. doi: 10.1145/3632891. URL https: //doi.org/10.1145/3632891

work page doi:10.1145/3632891 2024
[6]

Routenator: A router-based multi-modal architecture for generating synthetic training data for function calling llms, 2025

Vibha Belavadi, Tushar Vatsa, Dewang Sultania, Suhas Suresha, Ishita Verma, Cheng Chen, Tracy Holloway King, and Michael Friedrich. Routenator: A router-based multi-modal architecture for generating synthetic training data for function calling llms, 2025. URL https://arxiv.org/abs/2505.10495

work page arXiv 2025
[7]

Routerdc: Query- based router by dual contrastive learning for assembling large language models

Shuhao Chen, Weisen Jiang, Baijiong Lin, James Kwok, and Yu Zhang. Routerdc: Query- based router by dual contrastive learning for assembling large language models. InAd- vances in Neural Information Processing Systems, volume 37, 2024. doi: 10.52202/ 079017-2120. URL https://proceedings.neurips.cc/paper_files/paper/2024/ file/7a641b8ec86162fc875fb9f6456a5...

2024
[8]

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Rühle, Laks V . S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM: Cost-efficient and quality- aware query routing. InThe Twelfth International Conference on Learning Representations, 2024

2024
[9]

Dujian Ding, Ankur Mallick, Shaokun Zhang, Chi Wang, Daniel Madrigal, Mirian Del Car- men Hipolito Garcia, Menglin Xia, Laks V . S. Lakshmanan, Qingyun Wu, and Victor Rühle. BEST-route: Adaptive LLM routing with test-time optimal compute. InForty-second Interna- tional Conference on Machine Learning, 2025

2025
[10]

IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs.arXiv preprint arXiv:2509.06274, 2025

Aosong Feng, Balasubramaniam Srinivasan, Yun Zhou, Zhichao Xu, Kang Zhou, Sheng Guan, Yueyan Chen, Xian Wu, Ninad Kulkarni, Yi Zhang, Zhengyuan Shen, Dmitriy Bespalov, Soumya Smruti Mishra, Yifei Teng, Darren Yow-Bang Wang, Haibo Ding, and Lin Lee Cheong. IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs.arXiv preprint arXiv:250...

work page arXiv 2025
[11]

Graphrouter: A graph-based router for llm selections,

Tao Feng, Yanzhen Shen, and Jiaxuan You. Graphrouter: A graph-based router for llm selections,
[12]

URLhttps://arxiv.org/abs/2410.03834

work page arXiv
[13]

Glaive Function Calling v2

Glaive AI. Glaive Function Calling v2. Dataset available from HuggingFace, 2023. URLhttps: //huggingface.co/datasets/glaiveai/glaive-function-calling-v2 . Apache 2.0 License

2023
[14]

Routerbench: A benchmark for multi-LLM routing system

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-LLM routing system. InAgentic Markets Workshop at ICML 2024, 2024

2024
[15]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv preprint arXiv:2310.06770, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Levente Kocsis and Csaba Szepesvári

Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, and Sanjiv Kumar. Universal Model Routing for Efficient LLM Inference.arXiv preprint arXiv:2502.08773, 2025

work page arXiv 2025
[17]

LLexus: An AI Agent System for Incident Management.ACM SIGOPS Operating Systems Review, 58 (1):23–36, 2024

Pedro Las-Casas, Alok Gautum Kumbhare, Rodrigo Fonseca, and Sharad Agarwal. LLexus: An AI Agent System for Incident Management.ACM SIGOPS Operating Systems Review, 58 (1):23–36, 2024. doi: 10.1145/3689051.3689056

work page doi:10.1145/3689051.3689056 2024
[18]

LLMRouterBench: A massive benchmark and unified framework for LLM routing.arXiv preprint arXiv:2601.07206, 2026

Hao Li, Yiqun Zhang, Zhaoyan Guo, Chenxu Wang, Shengji Tang, Qiaosheng Zhang, Yang Chen, Biqing Qi, Peng Ye, Lei Bai, Zhen Wang, and Shuyue Hu. LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing.arXiv preprint arXiv:2601.07206, 2026

work page arXiv 2026
[19]

RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers.arXiv preprint arXiv:2510.00202, 2025

Yifan Lu, Rixin Liu, Jiayi Yuan, Xingqi Cui, Shenrun Zhang, Hongyi Liu, and Jiarong Xing. RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers.arXiv preprint arXiv:2510.00202, 2025

work page arXiv 2025
[20]

OmniRouter: Budget and Performance Controllable Multi-LLM Routing.arXiv preprint arXiv:2502.20576, 2025

Kai Mei, Wujiang Xu, Minghao Guo, Shuhang Lin, and Yongfeng Zhang. OmniRouter: Budget and Performance Controllable Multi-LLM Routing.arXiv preprint arXiv:2502.20576, 2025

work page arXiv 2025
[21]

InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofProceedings of Machine Learning Research, pages 19274–19286

Quang H. Nguyen, Thinh Dao, Duy C. Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V . Chawla, and Khoa D. Doan. Metallm: A high-performant and cost-efficient dynamic framework for wrapping llms, 2025. URLhttps://arxiv.org/abs/2407.10834

work page arXiv 2025
[22]

Hermes 3 Technical Report

Nous Research. Hermes 3 Technical Report. Technical report, Nous Re- search, 2024. URL https://nousresearch.com/wp-content/uploads/2024/08/ Hermes-3-Technical-Report.pdf. Apache 2.0 License

2024
[23]

Explainable Model Routing for Agentic Workflows

Mika Okamoto, Ansel Kaplan Erol, and Mark Riedl. Explainable model routing for agentic workflows, 2026. URLhttps://arxiv.org/abs/2604.03527

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Gonzalez, M Waleed Kadous, and Ion Stoica

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. RouteLLM: Learning to route LLMs from preference data. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[25]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. InAdvances in Neural Information Processing Systems, volume 37. Curran Associates, Inc., 2024. doi: 10.52202/079017-4020

work page doi:10.52202/079017-4020 2024
[26]

Gonzalez

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. InProceedings of the 42nd International Conference on Machine Learning, pages 48371–48392, 2025. Apache 2.0 License

2025
[27]

Fly-swat or cannon? cost-effective language model choice via meta-modeling

Marija Sakota, Maxime Peyrard, and Robert West. Fly-swat or cannon? cost-effective language model choice via meta-modeling. InProceedings of the 17th ACM International Conference on Web Search and Data Mining, WSDM ’24, page 606–615, 2024. URL https://doi.org/10. 1145/3616855.3635825. 11

work page arXiv 2024
[28]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter.arXiv preprint arXiv:1910.01108, 2019. URLhttps://arxiv.org/abs/1910.01108

work page internal anchor Pith review arXiv 1910
[29]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, volume 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/d842425e4bf79ba...

2023
[30]

Large language model routing with benchmark datasets.arXiv preprint arXiv:2309.15789, 2023

Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thomp- son, and Mikhail Yurochkin. Large language model routing with benchmark datasets, 2023. URLhttps://arxiv.org/abs/2309.15789

work page arXiv 2023
[31]

Carrot: A cost aware rate optimal router, 2025

Seamus Somerstep, Felipe Maia Polo, Allysson Flavio Melo de Oliveira, Prattyush Mangal, Mírian Silva, Onkar Bhardwaj, Mikhail Yurochkin, and Subha Maity. Carrot: A cost aware rate optimal router, 2025. URLhttps://arxiv.org/abs/2502.03261

work page arXiv 2025
[32]

IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory

Wei Song, Zhenya Huang, Cheng Cheng, Weibo Gao, Bihan Xu, GuanHao Zhao, Fei Wang, and Runze Wu. IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025

2025
[33]

Tensoropera router: A multi-model router for efficient llm inference, 2024

Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, and Chaoyang He. Tensoropera router: A multi-model router for efficient llm inference, 2024. URLhttps://arxiv.org/abs/2408.12320

work page arXiv 2024
[34]

vllm semantic router

vLLM Semantic Router Team. vllm semantic router. https://github.com/vllm-project/ semantic-router, 2025

2025
[35]

Icl-router: In-context learned model representations for llm routing

Chenxu Wang, Hao Li, Yiqun Zhang, Linyao Chen, Jianhao Chen, Ping Jian, Peng Ye, Qiaosheng Zhang, and Shuyue Hu. Icl-router: In-context learned model representations for llm routing. InAAAI Conference on Artificial Intelligence, 2025. URL https://api. semanticscholar.org/CorpusID:282057625

2025
[36]

R2-Router: A new paradigm for LLM routing with reasoning.arXiv preprint arXiv:2602.02823, 2026

Jiaqi Xue, Qian Lou, Jiarong Xing, and Heng Huang. R2-router: A new paradigm for llm routing with reasoning.arXiv preprint arXiv:2602.02823, 2026

work page arXiv 2026
[37]

Shunyu Yao, Jian Pei, Yue Ma, and Howard Chen.τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review arXiv 2024
[38]

arXiv preprint arXiv:2409.03215

Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, Zhiwei Liu, Yihao Feng, Tulika Awalgaonkar, Rithesh Murthy, Eric Hu, Zeyuan Chen, Ran Xu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, and Caiming Xiong. xLAM: A Family of Large Action Models to Empower AI ...

work page arXiv 2024
[39]

Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching

Qizheng Zhang, Michael Wornow, and Kunle Olukotun. Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching. InInternational Conference on Machine Learning Workshops, 2025

2025
[40]

Office Products

Richard Zhuang, Tianhao Wu, Zhaojin Wen, Andrew Li, Jiantao Jiao, and Kannan Ramchandran. Embedllm: Learning compact representations of large language models, 2024. URL https: //arxiv.org/abs/2410.02223. 12 Available tool:calculate_sales_tax(purchase_amount, city, state) User:“Calculate the amount of sales tax to be added on a purchase amount of $30.45 in...

work page arXiv 2024
[41]

Chicago", state=

calculate_sales_tax(purchase_amount=30.45, city="Chicago", state="Illinois")
[42]

Sacramento

calculate_sales_tax(purchase_amount=52.33, city="Sacramento", state="California")
[43]

Portland

calculate_sales_tax(purchase_amount=11.23, city="Portland", state="Oregon") Response B✓(also correct—reordered calls with abbreviated parameter values):
[44]

Portland

calculate_sales_tax(purchase_amount=11.23, city="Portland", state="OR" )
[45]

CHI" , state=

calculate_sales_tax(purchase_amount=30.45, city="CHI" , state= "IL" )
[46]

Sacramento

calculate_sales_tax(purchase_amount=52.33, city="Sacramento", state="CA" ) Both responses are correct.Because these three calls areindependent(no data flows between them), any ordering is valid. Addi- tionally, semantically equivalent parameter values—“Illinois” vs. “IL”, “Chicago” vs. “CHI”—are equally acceptable. Figure 4: Acceptable variation in an age...

1990
[47]

reliable

Apply a threshold (θ= 0.5 ) to each sigmoid output to determine which models are predicted to be “reliable” for this query
[48]

Among the models above threshold, select thecheapestone (cost-aware tie-breaking using profiled per-query costs)
[49]

this model answers correctly only 60% of the time

If no model exceeds the threshold, fall back to the model with the highest predicted proba- bility (argmax). Probability distributions.Figure 10 shows the distribution of correctness probabilities per model. The distributions are strongly bimodal: the vast majority of model–query pairs have probability near 0 (always incorrect) or 1 (always correct), with...
[50]

Frozen encoder.Our multi-label router fine-tunes all 66M DistilBERT parameters end-to- end, allowing the encoder to learn task-specific representations for agentic function-calling queries. The MIRT router uses frozen embeddings from a general-purpose pre-trained model, which may not capture the fine-grained distinctions (e.g., JSON structure validity, to...
[51]

The ablation in Appendix K shows that token packing alone contributes 1.66 pp of accuracy; this accounts for most of the observed gap

Vanilla tokenization.The MIRT router uses simple text concatenation rather than our compressed token-packing strategy (Section 3.1), which prioritises the most recent user turn and tool signatures within the 512-token budget. The ablation in Appendix K shows that token packing alone contributes 1.66 pp of accuracy; this accounts for most of the observed g...