Recognition: 2 theorem links
· Lean TheoremRouteLLM: Learning to Route LLMs with Preference Data
Pith reviewed 2026-05-11 23:23 UTC · model grok-4.3
The pith
Routers trained on human preference data can switch between strong and weak LLMs to cut costs by more than half while preserving answer quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Efficient router models trained on human preference data and augmentation techniques can dynamically select between a stronger and a weaker LLM for each query, achieving over twofold cost reductions on widely used benchmarks without compromising response quality and retaining performance even when the strong-weak model pair changes at test time.
What carries the argument
A lightweight router model that learns to predict which of two LLMs will produce the preferred response for a given query, trained directly on human preference labels.
If this is right
- Inference cost can be reduced by more than 2x on standard benchmarks while answer quality stays the same as the strong model.
- A single router trained on one model pair continues to deliver savings when the strong and weak models are swapped.
- Hybrid LLM deployments become practical: always use the expensive model only when the router predicts it is necessary.
Where Pith is reading between the lines
- Routing layers could be inserted as standard middleware in LLM serving stacks to optimize spend automatically.
- Preference data collected once might support routers across many future model pairs, reducing the need to retrain for every new model release.
- The same approach could be extended to routing among more than two models or to deciding when to use tool-augmented or multi-step reasoning paths.
Load-bearing premise
Human preference labels collected for one pair of models will let the router generalize to new queries and to different strong-weak model pairs without substantial loss of quality or cost savings.
What would settle it
Replace the test-time strong and weak models with two LLMs never seen during router training and measure whether the router's quality-cost curve drops below the curve obtained by always using the strong model.
read the original abstract
Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. More powerful models, though effective, come with higher expenses, while less capable models are more cost-effective. To address this dilemma, we propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference, aiming to optimize the balance between cost and response quality. We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance. Our evaluation on widely-recognized benchmarks shows that our approach significantly reduces costs-by over 2 times in certain cases-without compromising the quality of responses. Interestingly, our router models also demonstrate significant transfer learning capabilities, maintaining their performance even when the strong and weak models are changed at test time. This highlights the potential of these routers to provide a cost-effective yet high-performance solution for deploying LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RouteLLM, a set of efficient router models trained on human preference data and data augmentation techniques to dynamically select between a stronger and weaker LLM at inference time. The central claims are that this yields >2x cost reductions on standard benchmarks without quality loss, and that the routers exhibit significant transfer learning by maintaining performance when the strong/weak model pair is changed at test time.
Significance. If the transfer-learning result holds under rigorous controls, the work would be a practical contribution to cost-efficient LLM serving, showing that preference-based routers can generalize beyond the training pair. The emphasis on human preference data and augmentation is a reasonable engineering choice, though the absence of parameter-free derivations or machine-checked proofs limits the theoretical weight.
major comments (2)
- [Abstract] Abstract: the headline transfer claim ('maintaining their performance even when the strong and weak models are changed at test time') is load-bearing for the generalization narrative, yet the manuscript provides no evidence that test pairs are completely disjoint from the training pair, nor any ablation that removes model-identity cues from the router features. Without these, the result could reflect pair-specific biases rather than intrinsic query difficulty.
- [Evaluation] Evaluation section (and associated tables/figures): the reported >2x cost reduction 'without compromising the quality' is presented without explicit baselines, statistical significance tests, data-split details, or controls for selection effects on the preference data. This weakens support for the central cost-quality tradeoff claim.
minor comments (2)
- [Abstract] Abstract: 'costs-by over 2 times' is missing a space and should read 'costs by over 2 times'.
- [Methods] Notation for router inputs/outputs and the precise definition of 'preference data' versus augmented labels should be clarified in the methods to avoid ambiguity when readers attempt to reproduce the transfer experiments.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we will make to improve the clarity and rigor of our claims regarding transfer learning and evaluation results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline transfer claim ('maintaining their performance even when the strong and weak models are changed at test time') is load-bearing for the generalization narrative, yet the manuscript provides no evidence that test pairs are completely disjoint from the training pair, nor any ablation that removes model-identity cues from the router features. Without these, the result could reflect pair-specific biases rather than intrinsic query difficulty.
Authors: We appreciate the referee's point on the need for rigorous controls in the transfer learning experiments. Our experiments do involve training routers on specific model pairs and evaluating on different pairs, which we describe in the Evaluation section. To address the concern about disjointness, we will explicitly list the training and test model pairs in a table to confirm they are disjoint. Additionally, we will add an ablation study where we remove or neutralize any potential model-identity information from the router's input features (e.g., by using only query text embeddings without model names), to demonstrate that the routing decisions are based on query difficulty rather than memorized pair-specific patterns. These changes will be included in the revised manuscript. revision: yes
-
Referee: [Evaluation] Evaluation section (and associated tables/figures): the reported >2x cost reduction 'without compromising the quality' is presented without explicit baselines, statistical significance tests, data-split details, or controls for selection effects on the preference data. This weakens support for the central cost-quality tradeoff claim.
Authors: Thank you for this feedback. We agree that providing more details on the experimental setup and controls will strengthen the paper. In the revision, we will: (1) add explicit baseline comparisons, including always routing to the strong model, always to the weak model, and a random router; (2) include statistical significance tests (e.g., using bootstrap resampling or t-tests) for the cost and quality metrics; (3) detail the data splits used for training the routers and evaluating on benchmarks, including how the human preference data was collected and augmented; and (4) discuss potential selection effects and how our data augmentation techniques help mitigate biases in the preference data. These additions will be made to the Evaluation section, tables, and figures. revision: yes
Circularity Check
No circularity: training uses external preference data; transfer claims are empirical observations, not derivations by construction.
full rationale
The paper trains router models on human preference data for specific model pairs and reports empirical results on benchmarks, including observed transfer to changed model pairs at test time. No equations or steps reduce the routing prediction to a fitted parameter or self-citation by construction; the central claims rest on external data collection and standard evaluation rather than internal redefinition or renaming of inputs. This is a standard supervised learning setup with no load-bearing self-referential loops.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Stronger LLMs incur higher inference cost but produce higher-quality responses on average
- domain assumption Human preference judgments on model outputs can be used to supervise a router that generalizes to unseen queries
Forward citations
Cited by 36 Pith papers
-
FlowCompile: An Optimizing Compiler for Structured LLM Workflows
FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
-
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
-
A Regime Theory of Controller Class Selection for LLM Action Decisions
A regime theory selects the optimal controller class for LLM action decisions from a nested lattice of four classes using three data-estimable bottlenecks, with a Bernstein-tight threshold and empirical matches on mul...
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
-
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
Hosted open-weight LLMs function as heterogeneous, time-varying services rather than uniform model artifacts, with concentrated demand, decoupled supply and adoption, and measurable gains from task-aware routing.
-
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and through...
-
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
-
Model Routing as a Trust Problem: Route Receipts for Adaptive AI Systems
The paper introduces route receipts as a portable runtime record of routing decisions to make adaptive AI systems more transparent and trustworthy.
-
Credo: Declarative Control of LLM Pipelines via Beliefs and Policies
Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
-
Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents
LQM-ContextRoute routes tool calls by expected quality per service cycle using contextual bandits and LLM-as-judge feedback, yielding +2.18 pp F1, up to +18 pp accuracy, and +2.91-3.22 pp NDCG gains over SW-UCB on web...
-
GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization
GAR routes LLM inference requests via constrained multi-objective optimization to cut per-request CO2 emissions while respecting accuracy floors and p95 latency SLOs.
-
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
-
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
-
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.
-
ModelLens: Finding the Best for Your Task from Myriads of Models
ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
-
Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning
A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.
-
Zero-Shot Confidence Estimation for Small LLMs: When Supervised Baselines Aren't Worth Training
Average token log-probability provides a zero-shot confidence signal for small LLMs that matches supervised baselines in-distribution and outperforms them out-of-distribution, with a new retrieval-conditional variant ...
-
AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.
-
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut co...
-
RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment
RouteLMT learns to route MT requests to large or small LLMs by predicting marginal quality gain from small-model token representations, yielding a better quality-budget Pareto frontier than baselines.
-
Phase-Scheduled Multi-Agent Systems for Token-Efficient Coordination
PSMAS reduces token use in LLM multi-agent systems by 27.3% on average via phase-based temporal scheduling and context compression, with task performance staying within 2.1 points of full activation.
-
Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.
-
Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads
Combining local routing with prompt compression saves 45-79% cloud tokens on edit and explanation workloads, while a fuller set including draft-review saves 51% on RAG-heavy tasks.
-
RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving
Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.
-
Triage: Routing Software Engineering Tasks to Cost-Effective LLM Tiers via Code Quality Signals
Triage routes coding tasks to cost-effective LLM tiers based on code quality metrics to maintain verification quality at lower cost.
-
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents
A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.
-
Policy-Governed LLM Routing with Intent Matching for Instrument Laboratories
A governed LLM routing system for lab tutoring raises challenge-alignment from 0.90 to 0.98, boosts productive-struggle time, and cuts token costs by two-thirds while preserving answer accuracy.
-
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
-
Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation
RGAO combines retrieval-based complexity assessment with a formal budget algebra to enable dynamic topology selection in multi-agent code generation with provable conservation.
-
Agentic AI Systems Should Be Designed as Marginal Token Allocators
Agentic AI systems should be designed as marginal token allocators that balance benefit against cost, latency, and risk across their layers rather than as unit-priced text generators.
-
TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping
TRACES tags reasoning steps to enable adaptive early stopping, cutting token use by 20-50% on MATH500, GSM8K, AIME, MMLU and GPQA with comparable accuracy.
-
A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs
A-IO adaptively orchestrates LLM inference on NPUs to address memory bottlenecks, model scaling paradoxes, and synchronization costs in speculative decoding.
-
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent
AgentOpt introduces a framework-agnostic package that uses algorithms like UCB-E to find cost-effective model assignments in multi-step LLM agent pipelines, cutting evaluation budgets by 62-76% while maintaining near-...
-
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
-
Qualixar OS: A Universal Operating System for AI Agent Orchestration
Qualixar OS provides a runtime for multi-agent AI systems with support for 12 topologies, LLM-driven team design, dynamic routing, consensus judging, content attribution, and protocol bridging, achieving 100% accuracy...
Reference graph
Works this paper leans on
-
[2]
AutoMix: Automatically mixing language models
Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, and Mausam. Automix: Automatically mixing language models, 2024. URL https://arxiv.org/abs/2310.12963
-
[3]
AI@Meta. Llama 3.1 model card, 2024 a . URL https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md. Accessed: 2024-09-29
work page 2024
-
[4]
Introducing meta llama 3: The most capable openly available llm to date, 2024 b
AI@Meta. Introducing meta llama 3: The most capable openly available llm to date, 2024 b . URL https://ai.meta.com/blog/meta-llama-3/. Accessed: 2024-05-21
work page 2024
-
[5]
introducing the next generation of claude
Anthropic. "introducing the next generation of claude", 2024. URL https://www.anthropic.com/news/claude-3-family. Accessed: 2024-05-22
work page 2024
-
[6]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Rank analysis of incomplete block designs: I
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952
work page 1952
-
[8]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020
work page 1901
-
[9]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
S \'e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024
work page 2024
-
[12]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor R \"u hle, Laks V. S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM : Cost-efficient and quality-aware query routing. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=02f3mUtqnM
work page 2024
-
[15]
Alpacafarm: A simulation framework for methods that learn from human feedback
Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[16]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020
work page 2020
-
[17]
Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system, 2024. URL https://arxiv.org/abs/2403.12031
-
[18]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Llm-blender: Ensembling large language models with pairwise ranking and generative fusion, 2023
Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023
-
[20]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
Matrix factorization techniques for recommender systems
Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42 0 (8): 0 30--37, 2009
work page 2009
-
[22]
Routing to the expert: Efficient reward-guided ensemble of large language models, 2023
Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models, 2023. URL https://arxiv.org/abs/2311.08692
-
[23]
Martian. Martian router, 2024. URL https://withmartian.com/. Accessed: 2024-06-30
work page 2024
-
[24]
OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
OpenAI. Openai pricing, 2024. URL https://openai.com/api/pricing/. Accessed: 2024-06-30
work page 2024
-
[26]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022
work page 2022
-
[27]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019
work page 2019
-
[28]
Together.AI. Together.ai pricing, 2024. URL https://www.together.ai/pricing. Accessed: 2024-06-30
work page 2024
-
[29]
The bigchaos solution to the netflix grand prize
Andreas T \"o scher, Michael Jahrer, and Robert M Bell. The bigchaos solution to the netflix grand prize. Netflix prize documentation, pp.\ 1--52, 2009
work page 2009
-
[30]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [31]
-
[32]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[33]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[34]
Bartscore: Evaluating generated text as text generation
Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34: 0 27263--27277, 2021
work page 2021
-
[35]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://ope...
work page 2023
-
[36]
Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023
Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, and Jiantao Jiao. Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.