arxiv: 2406.18665 · v4 · submitted 2024-06-26 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong , Amjad Almahairi , Vincent Wu , Wei-Lin Chiang , Tianhao Wu , Joseph E. Gonzalez , M Waleed Kadous , Ion Stoica

Authors on Pith no claims yet

Pith reviewed 2026-05-11 23:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords LLM routingpreference datacost optimizationmodel selectiontransfer learninginference efficiencydynamic routing

0 comments

The pith

Routers trained on human preference data can switch between strong and weak LLMs to cut costs by more than half while preserving answer quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces router models that decide at inference time whether to send a query to a powerful but expensive LLM or to a cheaper weaker one. These routers are trained using human preference data that labels which model's response is better for a given prompt, together with data augmentation to improve robustness. Across standard benchmarks the routers deliver more than twofold cost reductions compared with always using the strong model, yet response quality remains essentially unchanged. The same routers also continue to work well when the underlying strong and weak models are replaced with entirely different ones at test time.

Core claim

Efficient router models trained on human preference data and augmentation techniques can dynamically select between a stronger and a weaker LLM for each query, achieving over twofold cost reductions on widely used benchmarks without compromising response quality and retaining performance even when the strong-weak model pair changes at test time.

What carries the argument

A lightweight router model that learns to predict which of two LLMs will produce the preferred response for a given query, trained directly on human preference labels.

If this is right

Inference cost can be reduced by more than 2x on standard benchmarks while answer quality stays the same as the strong model.
A single router trained on one model pair continues to deliver savings when the strong and weak models are swapped.
Hybrid LLM deployments become practical: always use the expensive model only when the router predicts it is necessary.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Routing layers could be inserted as standard middleware in LLM serving stacks to optimize spend automatically.
Preference data collected once might support routers across many future model pairs, reducing the need to retrain for every new model release.
The same approach could be extended to routing among more than two models or to deciding when to use tool-augmented or multi-step reasoning paths.

Load-bearing premise

Human preference labels collected for one pair of models will let the router generalize to new queries and to different strong-weak model pairs without substantial loss of quality or cost savings.

What would settle it

Replace the test-time strong and weak models with two LLMs never seen during router training and measure whether the router's quality-cost curve drops below the curve obtained by always using the strong model.

read the original abstract

Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. More powerful models, though effective, come with higher expenses, while less capable models are more cost-effective. To address this dilemma, we propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference, aiming to optimize the balance between cost and response quality. We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance. Our evaluation on widely-recognized benchmarks shows that our approach significantly reduces costs-by over 2 times in certain cases-without compromising the quality of responses. Interestingly, our router models also demonstrate significant transfer learning capabilities, maintaining their performance even when the strong and weak models are changed at test time. This highlights the potential of these routers to provide a cost-effective yet high-performance solution for deploying LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RouteLLM trains routers on preference data to cut LLM costs by 2x with no quality drop and shows transfer to new model pairs, but the transfer evidence needs checking for true disjointness.

read the letter

The main takeaway is that this paper gives a concrete way to train small routers on human preference data so they can pick the cheaper LLM most of the time without hurting answer quality, and it reports that the same router still works when you swap in entirely different strong and weak models at test time. That combination of preference-based training plus the transfer result is the part worth paying attention to, since most prior routing work sticks to accuracy labels or stays within one fixed pair of models. The approach is straightforward: collect preferences, augment the data, train a lightweight router, and deploy it to decide per query. If the numbers hold, it directly tackles the cost side of running LLMs at scale, which matters for anyone who has to stay under a fixed inference budget. The paper does a decent job framing the problem in practical terms and showing that preference data can be turned into a usable signal for routing. The transfer experiment is the clearest new angle, because it tests whether the router has learned something general about query difficulty rather than just memorizing the training models' quirks. That said, the transfer claim is the soft spot. The abstract asserts it works when models change, but without seeing the exact training and test pairs it is hard to tell how disjoint they really are. If the new pairs still share model families or similar error patterns, the result is less surprising. The reported cost savings also rest on benchmark numbers that the abstract does not break down by baseline, statistical test, or data split, so the full paper has to supply those details to make the 2x claim convincing. This is the kind of work that would interest people who run LLM services or large-scale experiments and want to reduce spend without retraining everything. A reader focused on inference efficiency or routing methods would get something usable from the method and the transfer test. It is solid enough on its own terms to deserve a serious referee who can check the experimental controls and the exact model pairs used for transfer.

Referee Report

2 major / 2 minor

Summary. The paper proposes RouteLLM, a set of efficient router models trained on human preference data and data augmentation techniques to dynamically select between a stronger and weaker LLM at inference time. The central claims are that this yields >2x cost reductions on standard benchmarks without quality loss, and that the routers exhibit significant transfer learning by maintaining performance when the strong/weak model pair is changed at test time.

Significance. If the transfer-learning result holds under rigorous controls, the work would be a practical contribution to cost-efficient LLM serving, showing that preference-based routers can generalize beyond the training pair. The emphasis on human preference data and augmentation is a reasonable engineering choice, though the absence of parameter-free derivations or machine-checked proofs limits the theoretical weight.

major comments (2)

[Abstract] Abstract: the headline transfer claim ('maintaining their performance even when the strong and weak models are changed at test time') is load-bearing for the generalization narrative, yet the manuscript provides no evidence that test pairs are completely disjoint from the training pair, nor any ablation that removes model-identity cues from the router features. Without these, the result could reflect pair-specific biases rather than intrinsic query difficulty.
[Evaluation] Evaluation section (and associated tables/figures): the reported >2x cost reduction 'without compromising the quality' is presented without explicit baselines, statistical significance tests, data-split details, or controls for selection effects on the preference data. This weakens support for the central cost-quality tradeoff claim.

minor comments (2)

[Abstract] Abstract: 'costs-by over 2 times' is missing a space and should read 'costs by over 2 times'.
[Methods] Notation for router inputs/outputs and the precise definition of 'preference data' versus augmented labels should be clarified in the methods to avoid ambiguity when readers attempt to reproduce the transfer experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we will make to improve the clarity and rigor of our claims regarding transfer learning and evaluation results.

read point-by-point responses

Referee: [Abstract] Abstract: the headline transfer claim ('maintaining their performance even when the strong and weak models are changed at test time') is load-bearing for the generalization narrative, yet the manuscript provides no evidence that test pairs are completely disjoint from the training pair, nor any ablation that removes model-identity cues from the router features. Without these, the result could reflect pair-specific biases rather than intrinsic query difficulty.

Authors: We appreciate the referee's point on the need for rigorous controls in the transfer learning experiments. Our experiments do involve training routers on specific model pairs and evaluating on different pairs, which we describe in the Evaluation section. To address the concern about disjointness, we will explicitly list the training and test model pairs in a table to confirm they are disjoint. Additionally, we will add an ablation study where we remove or neutralize any potential model-identity information from the router's input features (e.g., by using only query text embeddings without model names), to demonstrate that the routing decisions are based on query difficulty rather than memorized pair-specific patterns. These changes will be included in the revised manuscript. revision: yes
Referee: [Evaluation] Evaluation section (and associated tables/figures): the reported >2x cost reduction 'without compromising the quality' is presented without explicit baselines, statistical significance tests, data-split details, or controls for selection effects on the preference data. This weakens support for the central cost-quality tradeoff claim.

Authors: Thank you for this feedback. We agree that providing more details on the experimental setup and controls will strengthen the paper. In the revision, we will: (1) add explicit baseline comparisons, including always routing to the strong model, always to the weak model, and a random router; (2) include statistical significance tests (e.g., using bootstrap resampling or t-tests) for the cost and quality metrics; (3) detail the data splits used for training the routers and evaluating on benchmarks, including how the human preference data was collected and augmented; and (4) discuss potential selection effects and how our data augmentation techniques help mitigate biases in the preference data. These additions will be made to the Evaluation section, tables, and figures. revision: yes

Circularity Check

0 steps flagged

No circularity: training uses external preference data; transfer claims are empirical observations, not derivations by construction.

full rationale

The paper trains router models on human preference data for specific model pairs and reports empirical results on benchmarks, including observed transfer to changed model pairs at test time. No equations or steps reduce the routing prediction to a fitted parameter or self-citation by construction; the central claims rest on external data collection and standard evaluation rather than internal redefinition or renaming of inputs. This is a standard supervised learning setup with no load-bearing self-referential loops.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard domain assumptions about LLM cost-quality trade-offs and the utility of preference data; no new physical entities or ad-hoc constants are introduced.

axioms (2)

domain assumption Stronger LLMs incur higher inference cost but produce higher-quality responses on average
Stated explicitly as the core dilemma the routers are meant to resolve.
domain assumption Human preference judgments on model outputs can be used to supervise a router that generalizes to unseen queries
Central premise of the training framework described in the abstract.

pith-pipeline@v0.9.0 · 5489 in / 1262 out tokens · 40519 ms · 2026-05-11T23:23:00.556350+00:00 · methodology

discussion (0)

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows
cs.CL 2026-05 unverdicted novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
cs.DC 2026-05 conditional novelty 7.0

KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
A Regime Theory of Controller Class Selection for LLM Action Decisions
cs.AI 2026-05 unverdicted novelty 7.0

A regime theory selects the optimal controller class for LLM action decisions from a nested lattice of four classes using three data-estimable bottlenecks, with a Bernstein-tight threshold and empirical matches on mul...
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
cs.AI 2026-05 conditional novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
cs.PF 2026-05 conditional novelty 7.0

Hosted open-weight LLMs function as heterogeneous, time-varying services rather than uniform model artifacts, with concentrated demand, decoupled supply and adoption, and measurable gains from task-aware routing.
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
cs.PF 2026-05 unverdicted novelty 7.0

Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and through...
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
Model Routing as a Trust Problem: Route Receipts for Adaptive AI Systems
cs.AI 2026-05 conditional novelty 7.0

The paper introduces route receipts as a portable runtime record of routing decisions to make adaptive AI systems more transparent and trustworthy.
Credo: Declarative Control of LLM Pipelines via Beliefs and Policies
cs.AI 2026-04 unverdicted novelty 7.0

Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents
cs.LG 2026-05 unverdicted novelty 6.0

LQM-ContextRoute routes tool calls by expected quality per service cycle using contextual bandits and LLM-as-judge feedback, yielding +2.18 pp F1, up to +18 pp accuracy, and +2.91-3.22 pp NDCG gains over SW-UCB on web...
GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization
cs.AI 2026-05 unverdicted novelty 6.0

GAR routes LLM inference requests via constrained multi-objective optimization to cut per-request CO2 emissions while respecting accuracy floors and p95 latency SLOs.
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
cs.AI 2026-05 unverdicted novelty 6.0

LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
cs.AI 2026-05 unverdicted novelty 6.0

RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
cs.AI 2026-05 unverdicted novelty 6.0

A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.
ModelLens: Finding the Best for Your Task from Myriads of Models
cs.LG 2026-05 unverdicted novelty 6.0

ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.
Zero-Shot Confidence Estimation for Small LLMs: When Supervised Baselines Aren't Worth Training
cs.AI 2026-05 conditional novelty 6.0

Average token log-probability provides a zero-shot confidence signal for small LLMs that matches supervised baselines in-distribution and outperforms them out-of-distribution, with a new retrieval-conditional variant ...
AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
cs.AI 2026-05 unverdicted novelty 6.0

Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
cs.AI 2026-04 unverdicted novelty 6.0

ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut co...
RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment
cs.CL 2026-04 unverdicted novelty 6.0

RouteLMT learns to route MT requests to large or small LLMs by predicting marginal quality gain from small-model token representations, yielding a better quality-budget Pareto frontier than baselines.
Phase-Scheduled Multi-Agent Systems for Token-Efficient Coordination
cs.AI 2026-04 unverdicted novelty 6.0

PSMAS reduces token use in LLM multi-agent systems by 27.3% on average via phase-based temporal scheduling and context compression, with task performance staying within 2.1 points of full activation.
Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.
Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads
cs.DC 2026-04 unverdicted novelty 6.0

Combining local routing with prompt compression saves 45-79% cloud tokens on edit and explanation workloads, while a fuller set including draft-review saves 51% on RAG-heavy tasks.
RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving
cs.NI 2026-04 unverdicted novelty 6.0

Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.
Triage: Routing Software Engineering Tasks to Cost-Effective LLM Tiers via Code Quality Signals
cs.SE 2026-04 unverdicted novelty 6.0

Triage routes coding tasks to cost-effective LLM tiers based on code quality metrics to maintain verification quality at lower cost.
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents
cs.CL 2026-04 conditional novelty 6.0

A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.
Policy-Governed LLM Routing with Intent Matching for Instrument Laboratories
cs.CY 2026-04 conditional novelty 6.0

A governed LLM routing system for lab tutoring raises challenge-alignment from 0.90 to 0.98, boosts productive-struggle time, and cuts token costs by two-thirds while preserving answer accuracy.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
cs.LG 2024-07 unverdicted novelty 6.0

Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation
cs.AI 2026-05 unverdicted novelty 5.0

RGAO combines retrieval-based complexity assessment with a formal budget algebra to enable dynamic topology selection in multi-agent code generation with provable conservation.
Agentic AI Systems Should Be Designed as Marginal Token Allocators
cs.AI 2026-05 unverdicted novelty 5.0

Agentic AI systems should be designed as marginal token allocators that balance benefit against cost, latency, and risk across their layers rather than as unit-priced text generators.
TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping
cs.CL 2026-04 unverdicted novelty 5.0

TRACES tags reasoning steps to enable adaptive early stopping, cutting token use by 20-50% on MATH500, GSM8K, AIME, MMLU and GPQA with comparable accuracy.
A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs
cs.DC 2026-04 unverdicted novelty 5.0

A-IO adaptively orchestrates LLM inference on NPUs to address memory bottlenecks, model scaling paradoxes, and synchronization costs in speculative decoding.
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent
cs.LG 2026-04 unverdicted novelty 5.0

AgentOpt introduces a framework-agnostic package that uses algorithms like UCB-E to find cost-effective model assignments in multi-step LLM agent pipelines, cutting evaluation budgets by 62-76% while maintaining near-...
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
cs.CL 2025-03 accept novelty 5.0

A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
Qualixar OS: A Universal Operating System for AI Agent Orchestration
cs.AI 2026-04 unverdicted novelty 4.0

Qualixar OS provides a runtime for multi-agent AI systems with support for 12 topologies, LLM-driven team design, dynamic routing, consensus judging, content attribution, and protocol bridging, achieving 100% accuracy...

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 35 Pith papers · 10 internal anchors

[2]

AutoMix: Automatically mixing language models

Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, and Mausam. Automix: Automatically mixing language models, 2024. URL https://arxiv.org/abs/2310.12963

work page arXiv 2024
[3]

Llama 3.1 model card, 2024 a

AI@Meta. Llama 3.1 model card, 2024 a . URL https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md. Accessed: 2024-09-29

work page 2024
[4]

Introducing meta llama 3: The most capable openly available llm to date, 2024 b

AI@Meta. Introducing meta llama 3: The most capable openly available llm to date, 2024 b . URL https://ai.meta.com/blog/meta-llama-3/. Accessed: 2024-05-21

work page 2024
[5]

introducing the next generation of claude

Anthropic. "introducing the next generation of claude", 2024. URL https://www.anthropic.com/news/claude-3-family. Accessed: 2024-05-22

work page 2024
[6]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952

work page 1952
[8]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[9]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S \'e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Gonzalez, and Ion Stoica

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024

work page 2024
[12]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor R \"u hle, Laks V. S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM : Cost-efficient and quality-aware query routing. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=02f3mUtqnM

work page 2024
[15]

Alpacafarm: A simulation framework for methods that learn from human feedback

Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[16]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020

work page 2020
[17]

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system, 2024. URL https://arxiv.org/abs/2403.12031

work page arXiv 2024
[18]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Llm-blender: Ensembling large language models with pairwise ranking and generative fusion, 2023

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023

work page arXiv 2023
[20]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

Matrix factorization techniques for recommender systems

Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42 0 (8): 0 30--37, 2009

work page 2009
[22]

Routing to the expert: Efficient reward-guided ensemble of large language models, 2023

Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models, 2023. URL https://arxiv.org/abs/2311.08692

work page arXiv 2023
[23]

Martian router, 2024

Martian. Martian router, 2024. URL https://withmartian.com/. Accessed: 2024-06-30

work page 2024
[24]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Openai pricing, 2024

OpenAI. Openai pricing, 2024. URL https://openai.com/api/pricing/. Accessed: 2024-06-30

work page 2024
[26]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

work page 2022
[27]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[28]

Together.ai pricing, 2024

Together.AI. Together.ai pricing, 2024. URL https://www.together.ai/pricing. Accessed: 2024-06-30

work page 2024
[29]

The bigchaos solution to the netflix grand prize

Andreas T \"o scher, Michael Jahrer, and Robert M Bell. The bigchaos solution to the netflix grand prize. Netflix prize documentation, pp.\ 1--52, 2009

work page 2009
[30]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Unifyai, 2024

UnifyAI. Unifyai, 2024. URL https://unify.ai. Accessed: 2024-06-30

work page 2024
[32]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[33]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

Bartscore: Evaluating generated text as text generation

Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34: 0 27263--27277, 2021

work page 2021
[35]

Gonzalez, and Ion Stoica

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://ope...

work page 2023
[36]

Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023

Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, and Jiantao Jiao. Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023

work page 2023