pith. machine review for the scientific record. sign in

arxiv: 2403.12031 · v2 · submitted 2024-03-18 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

RouterBench: A Benchmark for Multi-LLM Routing System

Authors on Pith no claims yet

Pith reviewed 2026-05-16 10:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords LLM routingbenchmarkmulti-LLM systemsevaluation frameworkrouting datasetinference outcomescost-performance trade-offmodel selection
0
0 comments X

The pith

RouterBench supplies a benchmark and over 405k inference results to evaluate systems that route queries across multiple LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RouterBench as an evaluation framework to measure how well routing systems select the right LLM for a given input while balancing quality and cost. It releases a dataset of more than 405,000 inference outcomes from representative models to enable consistent testing and development of routing strategies. The work also outlines a theoretical framework for routing and compares several existing approaches inside the new benchmark. Without such a standard, progress on hybrid LLM serving has been hard to track. The authors argue this setup will support more economical and capable deployments by letting routers exploit the complementary strengths of different models.

Core claim

RouterBench is a novel evaluation framework together with a dataset of over 405k inference outcomes from representative LLMs that allows systematic assessment of LLM routing systems. The authors further supply a theoretical framework for routing and deliver a comparative analysis of various routing approaches, highlighting their potentials and limitations.

What carries the argument

RouterBench evaluation framework and its accompanying dataset of inference outcomes that standardize measurement of routing decisions across tasks.

If this is right

  • Routing algorithms can now be compared under identical conditions and metrics.
  • Researchers can train and validate new routers directly on the released inference outcomes.
  • Production systems can adopt routers that demonstrably improve performance per dollar.
  • The theoretical framework supplies a common language for designing and analyzing future routers.
  • The benchmark establishes a baseline that later papers can use to quantify incremental gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Industry teams may begin treating routing as a first-class component rather than an afterthought in LLM serving stacks.
  • The dataset could be extended with newer models or multi-modal tasks to keep the benchmark relevant over time.
  • Routing research might shift from hand-crafted heuristics toward learned policies trained on the provided outcomes.
  • Adoption of such benchmarks could reduce redundant experimentation across different research groups.

Load-bearing premise

The selected tasks, models, and recorded outcomes sufficiently represent real-world usage patterns and future models so that results on the benchmark generalize.

What would settle it

A routing method that achieves strong results on RouterBench yet produces worse accuracy-cost trade-offs when deployed on a fresh collection of production tasks or newer LLMs would falsify the benchmark's claimed utility.

read the original abstract

As the range of applications for Large Language Models (LLMs) continues to grow, the demand for effective serving solutions becomes increasingly critical. Despite the versatility of LLMs, no single model can optimally address all tasks and applications, particularly when balancing performance with cost. This limitation has led to the development of LLM routing systems, which combine the strengths of various models to overcome the constraints of individual LLMs. Yet, the absence of a standardized benchmark for evaluating the performance of LLM routers hinders progress in this area. To bridge this gap, we present RouterBench, a novel evaluation framework designed to systematically assess the efficacy of LLM routing systems, along with a comprehensive dataset comprising over 405k inference outcomes from representative LLMs to support the development of routing strategies. We further propose a theoretical framework for LLM routing, and deliver a comparative analysis of various routing approaches through RouterBench, highlighting their potentials and limitations within our evaluation framework. This work not only formalizes and advances the development of LLM routing systems but also sets a standard for their assessment, paving the way for more accessible and economically viable LLM deployments. The code and data are available at https://github.com/withmartian/routerbench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RouterBench, a benchmark and evaluation framework for multi-LLM routing systems, releases a dataset of over 405k inference outcomes from representative LLMs, proposes a theoretical framework for routing, and provides a comparative analysis of routing approaches to support development of cost-effective LLM serving strategies.

Significance. If the dataset collection methodology and representativeness claims hold after detailed validation, RouterBench could establish a much-needed standard for evaluating LLM routers, accelerating research on hybrid model serving that balances accuracy and cost. The public release of code and data at the provided GitHub link is a clear strength for reproducibility.

major comments (2)
  1. [Abstract and Dataset Description] Abstract and dataset section: The central claim that the 405k-outcome dataset supports development of general routing strategies rests on unstated assumptions about task/model selection and outcome distributions; no methodology, validation steps, or statistical controls for representativeness are described, preventing assessment of whether the benchmark generalizes beyond the snapshot of current LLMs.
  2. [Evaluation Framework and Comparative Analysis] Evaluation and analysis sections: No experiments test robustness to post-cutoff models or shifted task distributions, which directly undermines the claim that RouterBench will remain useful for future routing strategies; the comparative results may overfit to the fixed accuracy/cost profiles in this static collection.
minor comments (2)
  1. [Theoretical Framework] The theoretical framework section uses several routing-specific terms without explicit definitions or references to prior work on multi-model selection, which could reduce accessibility.
  2. [Results] Figure captions and axis labels in the results plots should explicitly state the number of models and tasks included to allow readers to assess scale.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for the constructive feedback on our manuscript introducing RouterBench. We appreciate the referee's recognition of the benchmark's potential value and the public release of code and data. We address each major comment below, outlining planned revisions where appropriate to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract and Dataset Description] Abstract and dataset section: The central claim that the 405k-outcome dataset supports development of general routing strategies rests on unstated assumptions about task/model selection and outcome distributions; no methodology, validation steps, or statistical controls for representativeness are described, preventing assessment of whether the benchmark generalizes beyond the snapshot of current LLMs.

    Authors: We agree that the dataset section would benefit from greater explicitness on these points. In the revised manuscript, we will expand the dataset description with a new subsection detailing the task and model selection criteria, the inference sampling procedure, outcome distributions, and any statistical validation or controls applied to support representativeness claims. This will enable readers to more rigorously assess generalizability beyond the current snapshot. revision: yes

  2. Referee: [Evaluation Framework and Comparative Analysis] Evaluation and analysis sections: No experiments test robustness to post-cutoff models or shifted task distributions, which directly undermines the claim that RouterBench will remain useful for future routing strategies; the comparative results may overfit to the fixed accuracy/cost profiles in this static collection.

    Authors: We will add experiments simulating shifted task distributions (e.g., via subset cross-validation and controlled perturbations of the existing data) to the evaluation section to demonstrate robustness of the comparative results. For post-cutoff models, direct experiments are not feasible as such models do not yet exist; we will add an explicit limitations discussion noting this and recommending periodic benchmark updates as new models become available, thereby clarifying the scope of current claims about long-term utility. revision: partial

standing simulated objections not resolved
  • Direct experimental testing of robustness to post-cutoff models is not possible, as these models do not currently exist.

Circularity Check

0 steps flagged

No circularity: benchmark rests on new empirical data collection

full rationale

The paper's core contribution is the release of RouterBench plus a static 405k inference dataset collected from existing LLMs on chosen tasks. No derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps are present. The proposed theoretical framework is described at a high level without equations that reduce to the dataset by construction. Representativeness for future models is a generalizability concern, not a circularity issue. The work is self-contained as an empirical benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical benchmark contribution that relies on collected inference data rather than new theoretical axioms, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5537 in / 1064 out tokens · 46254 ms · 2026-05-16T10:43:52.460809+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CR^2: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference

    cs.IT 2026-05 unverdicted novelty 7.0

    CR^2 matches full-information routing performance for device-edge LLM inference using only device-side signals and cuts normalized deployment cost by up to 16.9% at matched accuracy.

  2. Efficient Ensemble Selection from Binary and Pairwise Feedback

    cs.GT 2026-05 unverdicted novelty 7.0

    The paper develops efficient algorithms for ensemble selection from binary and pairwise feedback, achieving (1-1/e) guarantees with query savings for coverage and PTAS-style results via submodular relaxation for theta...

  3. RouteProfile: Elucidating the Design Space of LLM Profiles for Routing

    cs.NI 2026-04 unverdicted novelty 7.0

    RouteProfile organizes LLM profile design into organizational form, representation type, aggregation depth, and learning configuration, with evaluations showing structured profiles outperform flat ones and aid general...

  4. Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

    cs.LG 2026-05 unverdicted novelty 6.0

    LQM-ContextRoute routes tool calls by expected quality per service cycle using contextual bandits and LLM-as-judge feedback, yielding +2.18 pp F1, up to +18 pp accuracy, and +2.91-3.22 pp NDCG gains over SW-UCB on web...

  5. Domain Restriction via Multi SAE Layer Transitions

    cs.AI 2026-05 unverdicted novelty 6.0

    Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.

  6. GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization

    cs.AI 2026-05 unverdicted novelty 6.0

    GAR routes LLM inference requests via constrained multi-objective optimization to cut per-request CO2 emissions while respecting accuracy floors and p95 latency SLOs.

  7. LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

    cs.AI 2026-05 unverdicted novelty 6.0

    LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.

  8. Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

    cs.AI 2026-05 unverdicted novelty 6.0

    RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.

  9. ModelLens: Finding the Best for Your Task from Myriads of Models

    cs.LG 2026-05 unverdicted novelty 6.0

    ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.

  10. CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation

    cs.AI 2026-04 unverdicted novelty 6.0

    CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.

  11. Privacy-Preserving LLMs Routing

    cs.CR 2026-04 unverdicted novelty 6.0

    PPRoute achieves plaintext-level LLM routing quality with MPC-based privacy and a 20x speedup over naive encrypted implementations via MPC-friendly encoders, multi-step training, and O(1) communication Top-k search.

  12. Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization

    cs.LG 2026-04 unverdicted novelty 6.0

    A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.

  13. RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving

    cs.NI 2026-04 unverdicted novelty 6.0

    Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.

  14. Anonymization-Enhanced Privacy Protection for Mobile GUI Agents: Available but Invisible

    cs.CR 2026-02 conditional novelty 6.0

    An anonymization framework replaces sensitive UI content with deterministic placeholders to protect privacy in mobile GUI agents while preserving task performance.

  15. RouteLLM: Learning to Route LLMs with Preference Data

    cs.LG 2024-06 unverdicted novelty 6.0

    Router models trained on preference data dynamically select between strong and weak LLMs, cutting inference costs by more than 2x on benchmarks with no quality loss and showing transfer to new model pairs.

  16. Agentic AI Systems Should Be Designed as Marginal Token Allocators

    cs.AI 2026-05 unverdicted novelty 5.0

    Agentic AI systems should be designed as marginal token allocators that balance benefit against cost, latency, and risk across their layers rather than as unit-priced text generators.

  17. Rethinking AI Hardware: A Three-Layer Cognitive Architecture for Autonomous Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    Tri-Spirit decomposes autonomous AI into planning, reasoning, and execution layers on heterogeneous hardware, yielding 75.6% lower latency, 71.1% less energy, and 77.6% offline task completion in 2000-task simulations.

  18. AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent

    cs.LG 2026-04 unverdicted novelty 5.0

    AgentOpt introduces a framework-agnostic package that uses algorithms like UCB-E to find cost-effective model assignments in multi-step LLM agent pipelines, cutting evaluation budgets by 62-76% while maintaining near-...

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · cited by 18 Pith papers · 16 internal anchors

  1. [1]

    2024 , journal =

    When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards , author =. 2024 , journal =

  2. [2]

    The Twelfth International Conference on Learning Representations , year=

    Large Language Models Are Not Robust Multiple Choice Selectors , author=. The Twelfth International Conference on Learning Representations , year=

  3. [4]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

  4. [5]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  5. [6]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  6. [7]

    M. J. Kearns , title =

  7. [8]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  8. [9]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  9. [10]

    Suppressed for Anonymity , author=

  10. [11]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  11. [12]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  12. [13]

    2023 , eprint=

    Large Language Model Routing with Benchmark Datasets , author=. 2023 , eprint=

  13. [15]

    2023 , eprint=

    OrchestraLLM: Efficient Orchestration of Language Models for Dialogue State Tracking , author=. 2023 , eprint=

  14. [17]

    2023 , eprint=

    Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models , author=. 2023 , eprint=

  15. [18]

    2023 , eprint=

    Tryage: Real-time, intelligent Routing of User Prompts to Large Language Models , author=. 2023 , eprint=

  16. [19]

    2023 , eprint=

    AutoMix: Automatically Mixing Language Models , author=. 2023 , eprint=

  17. [20]

    2023 , eprint=

    Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning , author=. 2023 , eprint=

  18. [22]

    2023 , eprint=

    BatchPrompt: Accomplish more with less , author=. 2023 , eprint=

  19. [23]

    LLML ingua: Compressing Prompts for Accelerated Inference of Large Language Models

    Jiang, Huiqiang and Wu, Qianhui and Lin, Chin-Yew and Yang, Yuqing and Qiu, Lili. LLML ingua: Compressing Prompts for Accelerated Inference of Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.825

  20. [24]

    2023 , eprint=

    On Optimal Caching and Model Multiplexing for Large Model Inference , author=. 2023 , eprint=

  21. [25]

    2023 , journal =

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance , author =. 2023 , journal =

  22. [26]

    2024 , journal =

    Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM , author =. 2024 , journal =

  23. [27]

    and Jordan, Michael I

    Jacobs, Robert A. and Jordan, Michael I. and Nowlan, Steven J. and Hinton, Geoffrey E. , journal=. Adaptive Mixtures of Local Experts , year=

  24. [28]

    Learning Factored Representations in a Deep Mixture of Experts

    Learning Factored Representations in a Deep Mixture of Experts , year =. arXiv , author =:1312.4314 , primaryclass =

  25. [29]

    Le and Geoffrey E

    Noam Shazeer and Azalia Mirhoseini and Krzysztof Maziarz and Andy Davis and Quoc V. Le and Geoffrey E. Hinton and Jeff Dean , booktitle =. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , url =

  26. [30]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , url =

    Dmitry Lepikhin and HyoukJoong Lee and Yuanzhong Xu and Dehao Chen and Orhan Firat and Yanping Huang and Maxim Krikun and Noam Shazeer and Zhifeng Chen , booktitle =. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , url =

  27. [31]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , year =. arXiv , author =:2101.03961 , primaryclass =

  28. [32]

    2022 , eprint=

    GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , author=. 2022 , eprint=

  29. [33]

    2022 , eprint=

    ST-MoE: Designing Stable and Transferable Sparse Expert Models , author=. 2022 , eprint=

  30. [34]

    Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , pages =

    He, Jiaao and Zhai, Jidong and Antunes, Tiago and Wang, Haojie and Luo, Fuwen and Shi, Shangfeng and Li, Qin , title =. Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , pages =. 2022 , isbn =. doi:10.1145/3503221.3508418 , abstract =

  31. [35]

    Proceedings of Machine Learning and Systems , volume =

    MegaBlocks: Efficient Sparse Training with Mixture-of-Experts , author =. Proceedings of Machine Learning and Systems , volume =

  32. [36]

    2023 , eprint=

    Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models , author=. 2023 , eprint=

  33. [37]

    Neural Information Processing Systems , year =

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Neural Information Processing Systems , year =

  34. [39]

    2023 , eprint=

    Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models , author=. 2023 , eprint=

  35. [40]

    2023 , eprint=

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. 2023 , eprint=

  36. [43]

    Measuring Massive Multitask Language Understanding , url =

    Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , booktitle =. Measuring Massive Multitask Language Understanding , url =

  37. [45]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  38. [46]

    Communications of the ACM , volume=

    Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

  39. [47]

    Thirteenth international conference on the principles of knowledge representation and reasoning , year=

    The winograd schema challenge , author=. Thirteenth international conference on the principles of knowledge representation and reasoning , year=

  40. [48]

    2021 , eprint=

    Program Synthesis with Large Language Models , author=. 2021 , eprint=

  41. [49]

    2023 , eprint=

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

  42. [52]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  43. [53]

    Stanford alpaca: An instruction-following llama model , author=

  44. [55]

    2024 , eprint=

    Mixtral of Experts , author=. 2024 , eprint=

  45. [56]

    Model Card and Evaluations for Claude Models , author=

  46. [57]

    2023 , eprint=

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2023 , eprint=

  47. [58]

    2023 , journal =

    SqueezeLLM: Dense-and-Sparse Quantization , author =. 2023 , journal =

  48. [59]

    2023 , booktitle =

    Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , title =. 2023 , booktitle =

  49. [60]

    2023 , journal =

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author =. 2023 , journal =

  50. [61]

    Joshi and Hanna Moazam and Heather Miller and Matei Zaharia and Christopher Potts , booktitle=

    Omar Khattab and Arnav Singhvi and Paridhi Maheshwari and Zhiyuan Zhang and Keshav Santhanam and Sri Vardhamanan A and Saiful Haq and Ashutosh Sharma and Thomas T. Joshi and Hanna Moazam and Heather Miller and Matei Zaharia and Christopher Potts , booktitle=. 2024 , url=

  51. [63]

    Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Qingwei Lin and Daxin Jiang , booktitle=. Wizard. 2024 , url=

  52. [64]

    2023 , journal =

    Mistral 7B , author =. 2023 , journal =

  53. [65]

    2023 , journal =

    Code Llama: Open Foundation Models for Code , author =. 2023 , journal =

  54. [66]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019

  55. [67]

    2024 , journal =

    Yi: Open Foundation Models by 01.AI , author =. 2024 , journal =

  56. [68]

    arXiv preprint arXiv:2403.02419 , year=

    Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems , author=. arXiv preprint arXiv:2403.02419 , year=

  57. [69]

    2023 , journal =

    DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines , author =. 2023 , journal =

  58. [72]

    The Shift from Models to Compound AI Systems , author=

  59. [73]

    01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and...

  60. [74]

    When benchmarks are targets: Revealing the sensitivity of large language model leaderboards

    Norah Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yusef Almushaykeh, Faisal Mirza, Nouf Alotaibi, Nora Altwairesh, Areeb Alowisheq, M Saiful Bari, and Haidar Khan. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. arXiv preprint arXiv: 2402.01781, 2024

  61. [75]

    Model card and evaluations for claude models, 2023

    Anthropic. Model card and evaluations for claude models, 2023. URL https://www-cdn.anthropic.com/files/4zrzovbb/website/bd2a28d2535bfb0494cc8e2a3bf135d2e7523226.pdf

  62. [76]

    Program synthesis with large language models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models. 2021

  63. [77]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    S \'e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

  64. [78]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv: Arxiv-2305.05176, 2023

  65. [79]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

  66. [80]

    Training verifiers to solve math word problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. 2021

  67. [81]

    Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, a...

  68. [82]

    Learning factored representations in a deep mixture of experts

    David Eigen, Marc'Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. 2014

  69. [83]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. 2022

  70. [84]

    Tryage: Real-time, intelligent routing of user prompts to large language models

    Surya Narayanan Hari and Matt Thomson. Tryage: Real-time, intelligent routing of user prompts to large language models. 2023

  71. [85]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

  72. [86]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b. arXiv preprint arXiv: ...

  73. [87]

    LLM -blender: Ensembling large language models with pairwise ranking and generative fusion

    Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. LLM -blender: Ensembling large language models with pairwise ranking and generative fusion. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 14165--14178, Toronto, Canada, July 20...

  74. [88]

    Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSP y: Compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learning Represe...

  75. [89]

    Mahoney, and Kurt Keutzer

    Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv: 2306.07629, 2023

  76. [90]

    and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

  77. [91]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP '23, pp.\ 611–626, 2023

  78. [92]

    Orchestrallm: Efficient orchestration of language models for dialogue state tracking

    Chia-Hsuan Lee, Hao Cheng, and Mari Ostendorf. Orchestrallm: Efficient orchestration of language models for dialogue state tracking. 2023

  79. [93]

    The winograd schema challenge

    Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012

  80. [94]

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv: 2306.00978, 2023

Showing first 80 references.