pith. machine review for the scientific record. sign in

arxiv: 2406.18665 · v4 · submitted 2024-06-26 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

RouteLLM: Learning to Route LLMs with Preference Data

Authors on Pith no claims yet

Pith reviewed 2026-05-11 23:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords LLM routingpreference datacost optimizationmodel selectiontransfer learninginference efficiencydynamic routing
0
0 comments X

The pith

Routers trained on human preference data can switch between strong and weak LLMs to cut costs by more than half while preserving answer quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces router models that decide at inference time whether to send a query to a powerful but expensive LLM or to a cheaper weaker one. These routers are trained using human preference data that labels which model's response is better for a given prompt, together with data augmentation to improve robustness. Across standard benchmarks the routers deliver more than twofold cost reductions compared with always using the strong model, yet response quality remains essentially unchanged. The same routers also continue to work well when the underlying strong and weak models are replaced with entirely different ones at test time.

Core claim

Efficient router models trained on human preference data and augmentation techniques can dynamically select between a stronger and a weaker LLM for each query, achieving over twofold cost reductions on widely used benchmarks without compromising response quality and retaining performance even when the strong-weak model pair changes at test time.

What carries the argument

A lightweight router model that learns to predict which of two LLMs will produce the preferred response for a given query, trained directly on human preference labels.

If this is right

  • Inference cost can be reduced by more than 2x on standard benchmarks while answer quality stays the same as the strong model.
  • A single router trained on one model pair continues to deliver savings when the strong and weak models are swapped.
  • Hybrid LLM deployments become practical: always use the expensive model only when the router predicts it is necessary.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Routing layers could be inserted as standard middleware in LLM serving stacks to optimize spend automatically.
  • Preference data collected once might support routers across many future model pairs, reducing the need to retrain for every new model release.
  • The same approach could be extended to routing among more than two models or to deciding when to use tool-augmented or multi-step reasoning paths.

Load-bearing premise

Human preference labels collected for one pair of models will let the router generalize to new queries and to different strong-weak model pairs without substantial loss of quality or cost savings.

What would settle it

Replace the test-time strong and weak models with two LLMs never seen during router training and measure whether the router's quality-cost curve drops below the curve obtained by always using the strong model.

read the original abstract

Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. More powerful models, though effective, come with higher expenses, while less capable models are more cost-effective. To address this dilemma, we propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference, aiming to optimize the balance between cost and response quality. We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance. Our evaluation on widely-recognized benchmarks shows that our approach significantly reduces costs-by over 2 times in certain cases-without compromising the quality of responses. Interestingly, our router models also demonstrate significant transfer learning capabilities, maintaining their performance even when the strong and weak models are changed at test time. This highlights the potential of these routers to provide a cost-effective yet high-performance solution for deploying LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes RouteLLM, a set of efficient router models trained on human preference data and data augmentation techniques to dynamically select between a stronger and weaker LLM at inference time. The central claims are that this yields >2x cost reductions on standard benchmarks without quality loss, and that the routers exhibit significant transfer learning by maintaining performance when the strong/weak model pair is changed at test time.

Significance. If the transfer-learning result holds under rigorous controls, the work would be a practical contribution to cost-efficient LLM serving, showing that preference-based routers can generalize beyond the training pair. The emphasis on human preference data and augmentation is a reasonable engineering choice, though the absence of parameter-free derivations or machine-checked proofs limits the theoretical weight.

major comments (2)
  1. [Abstract] Abstract: the headline transfer claim ('maintaining their performance even when the strong and weak models are changed at test time') is load-bearing for the generalization narrative, yet the manuscript provides no evidence that test pairs are completely disjoint from the training pair, nor any ablation that removes model-identity cues from the router features. Without these, the result could reflect pair-specific biases rather than intrinsic query difficulty.
  2. [Evaluation] Evaluation section (and associated tables/figures): the reported >2x cost reduction 'without compromising the quality' is presented without explicit baselines, statistical significance tests, data-split details, or controls for selection effects on the preference data. This weakens support for the central cost-quality tradeoff claim.
minor comments (2)
  1. [Abstract] Abstract: 'costs-by over 2 times' is missing a space and should read 'costs by over 2 times'.
  2. [Methods] Notation for router inputs/outputs and the precise definition of 'preference data' versus augmented labels should be clarified in the methods to avoid ambiguity when readers attempt to reproduce the transfer experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we will make to improve the clarity and rigor of our claims regarding transfer learning and evaluation results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline transfer claim ('maintaining their performance even when the strong and weak models are changed at test time') is load-bearing for the generalization narrative, yet the manuscript provides no evidence that test pairs are completely disjoint from the training pair, nor any ablation that removes model-identity cues from the router features. Without these, the result could reflect pair-specific biases rather than intrinsic query difficulty.

    Authors: We appreciate the referee's point on the need for rigorous controls in the transfer learning experiments. Our experiments do involve training routers on specific model pairs and evaluating on different pairs, which we describe in the Evaluation section. To address the concern about disjointness, we will explicitly list the training and test model pairs in a table to confirm they are disjoint. Additionally, we will add an ablation study where we remove or neutralize any potential model-identity information from the router's input features (e.g., by using only query text embeddings without model names), to demonstrate that the routing decisions are based on query difficulty rather than memorized pair-specific patterns. These changes will be included in the revised manuscript. revision: yes

  2. Referee: [Evaluation] Evaluation section (and associated tables/figures): the reported >2x cost reduction 'without compromising the quality' is presented without explicit baselines, statistical significance tests, data-split details, or controls for selection effects on the preference data. This weakens support for the central cost-quality tradeoff claim.

    Authors: Thank you for this feedback. We agree that providing more details on the experimental setup and controls will strengthen the paper. In the revision, we will: (1) add explicit baseline comparisons, including always routing to the strong model, always to the weak model, and a random router; (2) include statistical significance tests (e.g., using bootstrap resampling or t-tests) for the cost and quality metrics; (3) detail the data splits used for training the routers and evaluating on benchmarks, including how the human preference data was collected and augmented; and (4) discuss potential selection effects and how our data augmentation techniques help mitigate biases in the preference data. These additions will be made to the Evaluation section, tables, and figures. revision: yes

Circularity Check

0 steps flagged

No circularity: training uses external preference data; transfer claims are empirical observations, not derivations by construction.

full rationale

The paper trains router models on human preference data for specific model pairs and reports empirical results on benchmarks, including observed transfer to changed model pairs at test time. No equations or steps reduce the routing prediction to a fitted parameter or self-citation by construction; the central claims rest on external data collection and standard evaluation rather than internal redefinition or renaming of inputs. This is a standard supervised learning setup with no load-bearing self-referential loops.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard domain assumptions about LLM cost-quality trade-offs and the utility of preference data; no new physical entities or ad-hoc constants are introduced.

axioms (2)
  • domain assumption Stronger LLMs incur higher inference cost but produce higher-quality responses on average
    Stated explicitly as the core dilemma the routers are meant to resolve.
  • domain assumption Human preference judgments on model outputs can be used to supervise a router that generalizes to unseen queries
    Central premise of the training framework described in the abstract.

pith-pipeline@v0.9.0 · 5489 in / 1262 out tokens · 40519 ms · 2026-05-11T23:23:00.556350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FlowCompile: An Optimizing Compiler for Structured LLM Workflows

    cs.CL 2026-05 unverdicted novelty 8.0

    FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.

  2. KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

    cs.DC 2026-05 conditional novelty 7.0

    KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.

  3. A Regime Theory of Controller Class Selection for LLM Action Decisions

    cs.AI 2026-05 unverdicted novelty 7.0

    A regime theory selects the optimal controller class for LLM action decisions from a nested lattice of four classes using three data-estimable bottlenecks, with a Bernstein-tight threshold and empirical matches on mul...

  4. Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

    cs.AI 2026-05 conditional novelty 7.0

    Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

  5. MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents

    cs.MA 2026-05 unverdicted novelty 7.0

    MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.

  6. When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

    cs.PF 2026-05 conditional novelty 7.0

    Hosted open-weight LLMs function as heterogeneous, time-varying services rather than uniform model artifacts, with concentrated demand, decoupled supply and adoption, and measurable gains from task-aware routing.

  7. When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

    cs.PF 2026-05 unverdicted novelty 7.0

    Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and through...

  8. When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.

  9. Model Routing as a Trust Problem: Route Receipts for Adaptive AI Systems

    cs.AI 2026-05 conditional novelty 7.0

    The paper introduces route receipts as a portable runtime record of routing decisions to make adaptive AI systems more transparent and trustworthy.

  10. Credo: Declarative Control of LLM Pipelines via Beliefs and Policies

    cs.AI 2026-04 unverdicted novelty 7.0

    Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.

  11. Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

    cs.LG 2026-05 unverdicted novelty 6.0

    LQM-ContextRoute routes tool calls by expected quality per service cycle using contextual bandits and LLM-as-judge feedback, yielding +2.18 pp F1, up to +18 pp accuracy, and +2.91-3.22 pp NDCG gains over SW-UCB on web...

  12. GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization

    cs.AI 2026-05 unverdicted novelty 6.0

    GAR routes LLM inference requests via constrained multi-objective optimization to cut per-request CO2 emissions while respecting accuracy floors and p95 latency SLOs.

  13. LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

    cs.AI 2026-05 unverdicted novelty 6.0

    LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.

  14. Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

    cs.AI 2026-05 unverdicted novelty 6.0

    RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.

  15. Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs

    cs.AI 2026-05 unverdicted novelty 6.0

    A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.

  16. ModelLens: Finding the Best for Your Task from Myriads of Models

    cs.LG 2026-05 unverdicted novelty 6.0

    ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.

  17. Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.

  18. Zero-Shot Confidence Estimation for Small LLMs: When Supervised Baselines Aren't Worth Training

    cs.AI 2026-05 conditional novelty 6.0

    Average token log-probability provides a zero-shot confidence signal for small LLMs that matches supervised baselines in-distribution and outperforms them out-of-distribution, with a new retrieval-conditional variant ...

  19. AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?

    cs.AI 2026-05 unverdicted novelty 6.0

    Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.

  20. ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

    cs.AI 2026-04 unverdicted novelty 6.0

    ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut co...

  21. RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment

    cs.CL 2026-04 unverdicted novelty 6.0

    RouteLMT learns to route MT requests to large or small LLMs by predicting marginal quality gain from small-model token representations, yielding a better quality-budget Pareto frontier than baselines.

  22. Phase-Scheduled Multi-Agent Systems for Token-Efficient Coordination

    cs.AI 2026-04 unverdicted novelty 6.0

    PSMAS reduces token use in LLM multi-agent systems by 27.3% on average via phase-based temporal scheduling and context compression, with task performance staying within 2.1 points of full activation.

  23. Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization

    cs.LG 2026-04 unverdicted novelty 6.0

    A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.

  24. Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads

    cs.DC 2026-04 unverdicted novelty 6.0

    Combining local routing with prompt compression saves 45-79% cloud tokens on edit and explanation workloads, while a fuller set including draft-review saves 51% on RAG-heavy tasks.

  25. RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving

    cs.NI 2026-04 unverdicted novelty 6.0

    Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.

  26. Triage: Routing Software Engineering Tasks to Cost-Effective LLM Tiers via Code Quality Signals

    cs.SE 2026-04 unverdicted novelty 6.0

    Triage routes coding tasks to cost-effective LLM tiers based on code quality metrics to maintain verification quality at lower cost.

  27. Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents

    cs.CL 2026-04 conditional novelty 6.0

    A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.

  28. Policy-Governed LLM Routing with Intent Matching for Instrument Laboratories

    cs.CY 2026-04 conditional novelty 6.0

    A governed LLM routing system for lab tutoring raises challenge-alignment from 0.90 to 0.98, boosts productive-struggle time, and cuts token costs by two-thirds while preserving answer accuracy.

  29. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    cs.LG 2024-07 unverdicted novelty 6.0

    Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.

  30. Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation

    cs.AI 2026-05 unverdicted novelty 5.0

    RGAO combines retrieval-based complexity assessment with a formal budget algebra to enable dynamic topology selection in multi-agent code generation with provable conservation.

  31. Agentic AI Systems Should Be Designed as Marginal Token Allocators

    cs.AI 2026-05 unverdicted novelty 5.0

    Agentic AI systems should be designed as marginal token allocators that balance benefit against cost, latency, and risk across their layers rather than as unit-priced text generators.

  32. TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping

    cs.CL 2026-04 unverdicted novelty 5.0

    TRACES tags reasoning steps to enable adaptive early stopping, cutting token use by 20-50% on MATH500, GSM8K, AIME, MMLU and GPQA with comparable accuracy.

  33. A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs

    cs.DC 2026-04 unverdicted novelty 5.0

    A-IO adaptively orchestrates LLM inference on NPUs to address memory bottlenecks, model scaling paradoxes, and synchronization costs in speculative decoding.

  34. AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent

    cs.LG 2026-04 unverdicted novelty 5.0

    AgentOpt introduces a framework-agnostic package that uses algorithms like UCB-E to find cost-effective model assignments in multi-step LLM agent pipelines, cutting evaluation budgets by 62-76% while maintaining near-...

  35. Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    cs.CL 2025-03 accept novelty 5.0

    A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.

  36. Qualixar OS: A Universal Operating System for AI Agent Orchestration

    cs.AI 2026-04 unverdicted novelty 4.0

    Qualixar OS provides a runtime for multi-agent AI systems with support for 12 topologies, LLM-driven team design, dynamic routing, consensus judging, content attribution, and protocol bridging, achieving 100% accuracy...

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 35 Pith papers · 10 internal anchors

  1. [2]

    AutoMix: Automatically mixing language models

    Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, and Mausam. Automix: Automatically mixing language models, 2024. URL https://arxiv.org/abs/2310.12963

  2. [3]

    Llama 3.1 model card, 2024 a

    AI@Meta. Llama 3.1 model card, 2024 a . URL https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md. Accessed: 2024-09-29

  3. [4]

    Introducing meta llama 3: The most capable openly available llm to date, 2024 b

    AI@Meta. Introducing meta llama 3: The most capable openly available llm to date, 2024 b . URL https://ai.meta.com/blog/meta-llama-3/. Accessed: 2024-05-21

  4. [5]

    introducing the next generation of claude

    Anthropic. "introducing the next generation of claude", 2024. URL https://www.anthropic.com/news/claude-3-family. Accessed: 2024-05-22

  5. [6]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

  6. [7]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952

  7. [8]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  8. [9]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    S \'e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

  9. [10]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023

  10. [11]

    Gonzalez, and Ion Stoica

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024

  11. [12]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  12. [13]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  13. [14]

    Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor R \"u hle, Laks V. S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM : Cost-efficient and quality-aware query routing. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=02f3mUtqnM

  14. [15]

    Alpacafarm: A simulation framework for methods that learn from human feedback

    Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 2024

  15. [16]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020

  16. [17]

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T

    Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system, 2024. URL https://arxiv.org/abs/2403.12031

  17. [18]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

  18. [19]

    Llm-blender: Ensembling large language models with pairwise ranking and generative fusion, 2023

    Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023

  19. [20]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

  20. [21]

    Matrix factorization techniques for recommender systems

    Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42 0 (8): 0 30--37, 2009

  21. [22]

    Routing to the expert: Efficient reward-guided ensemble of large language models, 2023

    Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models, 2023. URL https://arxiv.org/abs/2311.08692

  22. [23]

    Martian router, 2024

    Martian. Martian router, 2024. URL https://withmartian.com/. Accessed: 2024-06-30

  23. [24]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  24. [25]

    Openai pricing, 2024

    OpenAI. Openai pricing, 2024. URL https://openai.com/api/pricing/. Accessed: 2024-06-30

  25. [26]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

  26. [27]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  27. [28]

    Together.ai pricing, 2024

    Together.AI. Together.ai pricing, 2024. URL https://www.together.ai/pricing. Accessed: 2024-06-30

  28. [29]

    The bigchaos solution to the netflix grand prize

    Andreas T \"o scher, Michael Jahrer, and Robert M Bell. The bigchaos solution to the netflix grand prize. Netflix prize documentation, pp.\ 1--52, 2009

  29. [30]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  30. [31]

    Unifyai, 2024

    UnifyAI. Unifyai, 2024. URL https://unify.ai. Accessed: 2024-06-30

  31. [32]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  32. [33]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021

  33. [34]

    Bartscore: Evaluating generated text as text generation

    Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34: 0 27263--27277, 2021

  34. [35]

    Gonzalez, and Ion Stoica

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://ope...

  35. [36]

    Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023

    Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, and Jiantao Jiao. Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023