Recognition: no theorem link
Switchcraft: AI Model Router for Agentic Tool Calling
Pith reviewed 2026-05-11 00:59 UTC · model grok-4.3
The pith
Switchcraft routes each tool call to the cheapest model that can handle it correctly, matching the best single model's accuracy at 84% lower cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Switchcraft is a model router optimized for agentic tool calling that selects the lowest-cost model subject to correctness. A DistilBERT classifier is trained on five function-calling benchmarks and deployed inline under a latency budget. It achieves 82.9 percent accuracy matching or exceeding the strongest individual model while reducing inference cost by 84 percent and saving over $3,600 per million queries. The evaluation also reveals that larger models do not always outperform smaller ones on tool-use tasks and that nominally cheaper models can produce higher total cost through token-intensive reasoning.
What carries the argument
DistilBERT classifier trained on five function-calling benchmarks to predict and select the lowest-cost model that meets correctness for each inline tool-calling query.
If this is right
- Larger models do not consistently outperform smaller ones on tool-use tasks.
- Nominally cheaper models can incur higher total cost due to token-intensive reasoning.
- Cost-aware agentic AI deployment becomes feasible without sacrificing correctness.
- Savings exceed $3,600 per million queries while accuracy stays at or above 82.9 percent.
Where Pith is reading between the lines
- The same routing logic could be tested on agent tasks that mix tool calls with open-ended reasoning.
- Performance may change if production queries shift in style or complexity from the benchmark set.
- Pairing the router with prompt-length controls could produce additional cost reductions beyond the reported 84 percent.
Load-bearing premise
The classifier trained on the five benchmarks will keep selecting correct low-cost models on new real-world agent queries without large accuracy drops or hidden cost increases.
What would settle it
Measure accuracy and total token cost when Switchcraft routes a fresh collection of diverse, real-world tool-calling queries outside the original five benchmarks.
Figures
read the original abstract
Agentic AI systems that invoke external tools are powerful but costly, leading developers to default to large models and overspend inference budgets. Model routing can mitigate this, but existing routers are designed for chat completion rather than tool use. We present Switchcraft, the first (to the best of our knowledge) model router optimized for agentic tool calling. Switchcraft operates inline, selecting the lowest-cost model subject to correctness. We construct an evaluation framework on five function-calling benchmarks and train a DistilBERT-based classifier, deployed under a latency budget. Switchcraft achieves 82.9% accuracy -- matching or exceeding the best individual model -- while reducing inference cost by 84%, saving over $3,600 per million queries. We find that larger models do not consistently outperform smaller ones on tool-use tasks, and that nominally cheaper models can incur higher total cost due to token-intensive reasoning. Our work enables cost-aware agentic AI deployment without sacrificing correctness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Switchcraft, a DistilBERT-based model router for agentic tool calling. It trains a classifier on five function-calling benchmarks to select the lowest-cost model subject to correctness under a latency budget, reporting 82.9% accuracy that matches or exceeds the best individual model while reducing inference cost by 84% (over $3,600 savings per million queries). The work additionally observes that larger models do not consistently outperform smaller ones on tool-use tasks and that nominally cheaper models can incur higher total token cost due to longer reasoning traces.
Significance. If the benchmark results prove robust and reproducible, the approach offers a practical, inline mechanism for cost-efficient agentic AI deployment without correctness loss. The observation that model scale does not reliably predict tool-calling performance is a useful empirical contribution for the field.
major comments (3)
- [Abstract] Abstract: the headline claims of 82.9% accuracy and 84% cost reduction are stated without any details on training/validation data splits, error bars, statistical significance, or the precise protocol used to measure total token cost (including cases where cheaper models produce longer traces). These omissions directly undermine assessment of the central accuracy and savings assertions.
- [Evaluation framework] Evaluation framework: all accuracy and cost figures are measured exclusively on the same five benchmarks used to train the DistilBERT classifier. No results are reported on out-of-distribution agent queries, production logs, or tasks with different tool schemas or reasoning depths, leaving the generalization assumption untested despite the abstract's own note on variable token costs.
- [Methods] Methods: the classifier training procedure, label construction (how 'correct' model selections are defined), cross-validation strategy, and latency-budget deployment mechanics are insufficiently specified to support reproduction or verification of the reported performance numbers.
minor comments (1)
- [Abstract] Abstract: the claim of being 'the first (to the best of our knowledge)' router optimized for agentic tool calling would benefit from a short related-work paragraph contrasting it with existing chat-completion routers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve reproducibility, add necessary details, and acknowledge limitations.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claims of 82.9% accuracy and 84% cost reduction are stated without any details on training/validation data splits, error bars, statistical significance, or the precise protocol used to measure total token cost (including cases where cheaper models produce longer traces). These omissions directly undermine assessment of the central accuracy and savings assertions.
Authors: We agree that the abstract omits key details. In the revision we will expand the methods and results sections (and update the abstract where space permits) to specify the 80/20 train/test split on the combined benchmarks, report standard deviations and 5-fold cross-validation error bars, include statistical significance tests for the accuracy and cost comparisons, and describe the total token cost protocol that sums prompt plus completion tokens while explicitly accounting for longer reasoning traces produced by smaller models. revision: yes
-
Referee: [Evaluation framework] Evaluation framework: all accuracy and cost figures are measured exclusively on the same five benchmarks used to train the DistilBERT classifier. No results are reported on out-of-distribution agent queries, production logs, or tasks with different tool schemas or reasoning depths, leaving the generalization assumption untested despite the abstract's own note on variable token costs.
Authors: This is a valid limitation. All current numbers are in-distribution. We will add a dedicated Limitations section that states this explicitly and discusses risks of distribution shift. We will also report a new small-scale experiment on held-out queries with altered tool schemas to provide preliminary generalization evidence. Production logs are unavailable to us, so we cannot add those results. revision: partial
-
Referee: [Methods] Methods: the classifier training procedure, label construction (how 'correct' model selections are defined), cross-validation strategy, and latency-budget deployment mechanics are insufficiently specified to support reproduction or verification of the reported performance numbers.
Authors: We acknowledge the methods section is underspecified. The revision will detail: label construction (a model is labeled correct only if it emits the exact ground-truth tool call and arguments), DistilBERT fine-tuning hyperparameters and procedure, 5-fold cross-validation protocol with per-fold results, and the latency-budget mechanism (models whose profiled latency exceeds the budget are filtered before cheapest-valid selection). Pseudocode for the full routing logic will be added to an appendix. revision: yes
Circularity Check
No circularity: empirical evaluation of trained router is independent of inputs
full rationale
The paper trains a DistilBERT classifier on data from five function-calling benchmarks and reports measured accuracy (82.9%) plus cost reduction (84%) on the same benchmarks' evaluation splits. These quantities are computed directly from the classifier's output selections versus ground-truth correctness and per-model token costs; they do not reduce by construction to the fitted parameters themselves. No equations, self-citations, uniqueness theorems, or ansatzes appear in the provided text that would make the headline result equivalent to its training inputs. The pipeline is a standard supervised-learning evaluation and therefore self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Mohamad Abou Ali, Fadi Dornaika, and Jinan Charafeddine. Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions.Artificial Intelligence Review, 59(11), 2025. doi: 10.1007/s10462-025-11422-4. URL https://link.springer.com/ article/10.1007/s10462-025-11422-4
-
[2]
Automix: Automatically mixing language models
Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, and Mausam . Automix: Automatically mixing language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
2024
-
[3]
CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions
Tamer Alkhouli, Katerina Margatina, James Gung, Raphael Shu, Claudia Zaghi, Monica Sunkara, and Yi Zhang. CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7993–8006, Vienna, Austria, jul
-
[4]
doi: 10.18653/v1/2025.acl-long.394
Association for Computational Linguistics. doi: 10.18653/v1/2025.acl-long.394. URL https://aclanthology.org/2025.acl-long.394/. Creative Commons Attribution 4.0 International License
-
[5]
Validation of modern json schema: Formalization and complexity
Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger. Validation of modern json schema: Formalization and complexity. Proc. ACM Program. Lang., 8(POPL), January 2024. doi: 10.1145/3632891. URL https: //doi.org/10.1145/3632891
-
[6]
Vibha Belavadi, Tushar Vatsa, Dewang Sultania, Suhas Suresha, Ishita Verma, Cheng Chen, Tracy Holloway King, and Michael Friedrich. Routenator: A router-based multi-modal architecture for generating synthetic training data for function calling llms, 2025. URL https://arxiv.org/abs/2505.10495
-
[7]
Routerdc: Query- based router by dual contrastive learning for assembling large language models
Shuhao Chen, Weisen Jiang, Baijiong Lin, James Kwok, and Yu Zhang. Routerdc: Query- based router by dual contrastive learning for assembling large language models. InAd- vances in Neural Information Processing Systems, volume 37, 2024. doi: 10.52202/ 079017-2120. URL https://proceedings.neurips.cc/paper_files/paper/2024/ file/7a641b8ec86162fc875fb9f6456a5...
2024
-
[8]
Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Rühle, Laks V . S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM: Cost-efficient and quality- aware query routing. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[9]
Dujian Ding, Ankur Mallick, Shaokun Zhang, Chi Wang, Daniel Madrigal, Mirian Del Car- men Hipolito Garcia, Menglin Xia, Laks V . S. Lakshmanan, Qingyun Wu, and Victor Rühle. BEST-route: Adaptive LLM routing with test-time optimal compute. InForty-second Interna- tional Conference on Machine Learning, 2025
2025
-
[10]
Aosong Feng, Balasubramaniam Srinivasan, Yun Zhou, Zhichao Xu, Kang Zhou, Sheng Guan, Yueyan Chen, Xian Wu, Ninad Kulkarni, Yi Zhang, Zhengyuan Shen, Dmitriy Bespalov, Soumya Smruti Mishra, Yifei Teng, Darren Yow-Bang Wang, Haibo Ding, and Lin Lee Cheong. IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs.arXiv preprint arXiv:250...
-
[11]
Graphrouter: A graph-based router for llm selections,
Tao Feng, Yanzhen Shen, and Jiaxuan You. Graphrouter: A graph-based router for llm selections,
- [12]
-
[13]
Glaive Function Calling v2
Glaive AI. Glaive Function Calling v2. Dataset available from HuggingFace, 2023. URLhttps: //huggingface.co/datasets/glaiveai/glaive-function-calling-v2 . Apache 2.0 License
2023
-
[14]
Routerbench: A benchmark for multi-LLM routing system
Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-LLM routing system. InAgentic Markets Workshop at ICML 2024, 2024
2024
-
[15]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv preprint arXiv:2310.06770, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Levente Kocsis and Csaba Szepesvári
Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, and Sanjiv Kumar. Universal Model Routing for Efficient LLM Inference.arXiv preprint arXiv:2502.08773, 2025
-
[17]
Pedro Las-Casas, Alok Gautum Kumbhare, Rodrigo Fonseca, and Sharad Agarwal. LLexus: An AI Agent System for Incident Management.ACM SIGOPS Operating Systems Review, 58 (1):23–36, 2024. doi: 10.1145/3689051.3689056
-
[18]
Hao Li, Yiqun Zhang, Zhaoyan Guo, Chenxu Wang, Shengji Tang, Qiaosheng Zhang, Yang Chen, Biqing Qi, Peng Ye, Lei Bai, Zhen Wang, and Shuyue Hu. LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing.arXiv preprint arXiv:2601.07206, 2026
-
[19]
Yifan Lu, Rixin Liu, Jiayi Yuan, Xingqi Cui, Shenrun Zhang, Hongyi Liu, and Jiarong Xing. RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers.arXiv preprint arXiv:2510.00202, 2025
-
[20]
Kai Mei, Wujiang Xu, Minghao Guo, Shuhang Lin, and Yongfeng Zhang. OmniRouter: Budget and Performance Controllable Multi-LLM Routing.arXiv preprint arXiv:2502.20576, 2025
-
[21]
Quang H. Nguyen, Thinh Dao, Duy C. Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V . Chawla, and Khoa D. Doan. Metallm: A high-performant and cost-efficient dynamic framework for wrapping llms, 2025. URLhttps://arxiv.org/abs/2407.10834
-
[22]
Hermes 3 Technical Report
Nous Research. Hermes 3 Technical Report. Technical report, Nous Re- search, 2024. URL https://nousresearch.com/wp-content/uploads/2024/08/ Hermes-3-Technical-Report.pdf. Apache 2.0 License
2024
-
[23]
Explainable Model Routing for Agentic Workflows
Mika Okamoto, Ansel Kaplan Erol, and Mark Riedl. Explainable model routing for agentic workflows, 2026. URLhttps://arxiv.org/abs/2604.03527
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Gonzalez, M Waleed Kadous, and Ion Stoica
Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. RouteLLM: Learning to route LLMs from preference data. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[25]
Patil, Tianjun Zhang, Xin Wang, and Joseph E
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. InAdvances in Neural Information Processing Systems, volume 37. Curran Associates, Inc., 2024. doi: 10.52202/079017-4020
-
[26]
Gonzalez
Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. InProceedings of the 42nd International Conference on Machine Learning, pages 48371–48392, 2025. Apache 2.0 License
2025
-
[27]
Fly-swat or cannon? cost-effective language model choice via meta-modeling
Marija Sakota, Maxime Peyrard, and Robert West. Fly-swat or cannon? cost-effective language model choice via meta-modeling. InProceedings of the 17th ACM International Conference on Web Search and Data Mining, WSDM ’24, page 606–615, 2024. URL https://doi.org/10. 1145/3616855.3635825. 11
-
[28]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter.arXiv preprint arXiv:1910.01108, 2019. URLhttps://arxiv.org/abs/1910.01108
work page internal anchor Pith review arXiv 1910
-
[29]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, volume 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/d842425e4bf79ba...
2023
-
[30]
Large language model routing with benchmark datasets.arXiv preprint arXiv:2309.15789, 2023
Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thomp- son, and Mikhail Yurochkin. Large language model routing with benchmark datasets, 2023. URLhttps://arxiv.org/abs/2309.15789
-
[31]
Carrot: A cost aware rate optimal router, 2025
Seamus Somerstep, Felipe Maia Polo, Allysson Flavio Melo de Oliveira, Prattyush Mangal, Mírian Silva, Onkar Bhardwaj, Mikhail Yurochkin, and Subha Maity. Carrot: A cost aware rate optimal router, 2025. URLhttps://arxiv.org/abs/2502.03261
-
[32]
IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory
Wei Song, Zhenya Huang, Cheng Cheng, Weibo Gao, Bihan Xu, GuanHao Zhao, Fei Wang, and Runze Wu. IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025
2025
-
[33]
Tensoropera router: A multi-model router for efficient llm inference, 2024
Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, and Chaoyang He. Tensoropera router: A multi-model router for efficient llm inference, 2024. URLhttps://arxiv.org/abs/2408.12320
-
[34]
vllm semantic router
vLLM Semantic Router Team. vllm semantic router. https://github.com/vllm-project/ semantic-router, 2025
2025
-
[35]
Icl-router: In-context learned model representations for llm routing
Chenxu Wang, Hao Li, Yiqun Zhang, Linyao Chen, Jianhao Chen, Ping Jian, Peng Ye, Qiaosheng Zhang, and Shuyue Hu. Icl-router: In-context learned model representations for llm routing. InAAAI Conference on Artificial Intelligence, 2025. URL https://api. semanticscholar.org/CorpusID:282057625
2025
-
[36]
R2-Router: A new paradigm for LLM routing with reasoning.arXiv preprint arXiv:2602.02823, 2026
Jiaqi Xue, Qian Lou, Jiarong Xing, and Heng Huang. R2-router: A new paradigm for llm routing with reasoning.arXiv preprint arXiv:2602.02823, 2026
-
[37]
Shunyu Yao, Jian Pei, Yue Ma, and Howard Chen.τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.arXiv preprint arXiv:2406.12045, 2024
work page internal anchor Pith review arXiv 2024
-
[38]
arXiv preprint arXiv:2409.03215
Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, Zhiwei Liu, Yihao Feng, Tulika Awalgaonkar, Rithesh Murthy, Eric Hu, Zeyuan Chen, Ran Xu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, and Caiming Xiong. xLAM: A Family of Large Action Models to Empower AI ...
-
[39]
Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching
Qizheng Zhang, Michael Wornow, and Kunle Olukotun. Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching. InInternational Conference on Machine Learning Workshops, 2025
2025
-
[40]
Richard Zhuang, Tianhao Wu, Zhaojin Wen, Andrew Li, Jiantao Jiao, and Kannan Ramchandran. Embedllm: Learning compact representations of large language models, 2024. URL https: //arxiv.org/abs/2410.02223. 12 Available tool:calculate_sales_tax(purchase_amount, city, state) User:“Calculate the amount of sales tax to be added on a purchase amount of $30.45 in...
-
[41]
Chicago", state=
calculate_sales_tax(purchase_amount=30.45, city="Chicago", state="Illinois")
-
[42]
Sacramento
calculate_sales_tax(purchase_amount=52.33, city="Sacramento", state="California")
-
[43]
Portland
calculate_sales_tax(purchase_amount=11.23, city="Portland", state="Oregon") Response B✓(also correct—reordered calls with abbreviated parameter values):
-
[44]
Portland
calculate_sales_tax(purchase_amount=11.23, city="Portland", state="OR" )
-
[45]
CHI" , state=
calculate_sales_tax(purchase_amount=30.45, city="CHI" , state= "IL" )
-
[46]
Sacramento
calculate_sales_tax(purchase_amount=52.33, city="Sacramento", state="CA" ) Both responses are correct.Because these three calls areindependent(no data flows between them), any ordering is valid. Addi- tionally, semantically equivalent parameter values—“Illinois” vs. “IL”, “Chicago” vs. “CHI”—are equally acceptable. Figure 4: Acceptable variation in an age...
1990
-
[47]
reliable
Apply a threshold (θ= 0.5 ) to each sigmoid output to determine which models are predicted to be “reliable” for this query
-
[48]
Among the models above threshold, select thecheapestone (cost-aware tie-breaking using profiled per-query costs)
-
[49]
this model answers correctly only 60% of the time
If no model exceeds the threshold, fall back to the model with the highest predicted proba- bility (argmax). Probability distributions.Figure 10 shows the distribution of correctness probabilities per model. The distributions are strongly bimodal: the vast majority of model–query pairs have probability near 0 (always incorrect) or 1 (always correct), with...
-
[50]
Frozen encoder.Our multi-label router fine-tunes all 66M DistilBERT parameters end-to- end, allowing the encoder to learn task-specific representations for agentic function-calling queries. The MIRT router uses frozen embeddings from a general-purpose pre-trained model, which may not capture the fine-grained distinctions (e.g., JSON structure validity, to...
-
[51]
The ablation in Appendix K shows that token packing alone contributes 1.66 pp of accuracy; this accounts for most of the observed gap
Vanilla tokenization.The MIRT router uses simple text concatenation rather than our compressed token-packing strategy (Section 3.1), which prioritises the most recent user turn and tool signatures within the 512-token budget. The ablation in Appendix K shows that token packing alone contributes 1.66 pp of accuracy; this accounts for most of the observed g...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.