Harnessing Multiple Large Language Models: A Survey on LLM Ensemble

Dingqi Yang; Hailong Sun; Jingzheng Li; Kai Sun; Likang Xiao; Ming Li; Pengpeng Chen; Philip S. Yu; Qianren Mao; Xiaodong Lu

arxiv: 2502.18036 · v6 · submitted 2025-02-25 · 💻 cs.CL

Harnessing Multiple Large Language Models: A Survey on LLM Ensemble

Zhijun Chen , Xiaodong Lu , Jingzheng Li , Pengpeng Chen , Zhuoran Li , Kai Sun , Yuankai Luo , Qianren Mao

show 7 more authors

Ming Li Likang Xiao Dingqi Yang Xiao Huang Yikun Ban Hailong Sun Philip S. Yu

This is my paper

Pith reviewed 2026-05-23 02:20 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM EnsembleLarge Language ModelsEnsemble MethodsTaxonomySurveyInference StagesBenchmarksApplications

0 comments

The pith

LLM ensemble methods can be systematically reviewed and classified using a three-stage taxonomy based on when the combination occurs relative to inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the first systematic review of LLM Ensemble techniques that combine multiple large language models to exploit their individual strengths on user queries. It proposes a taxonomy that divides these methods into ensemble-before-inference, ensemble-during-inference, and ensemble-after-inference, then provides detailed classifications and reviews of methods within each category. The survey also examines related benchmarks and applications, summarizes the studies, and outlines future research directions. A sympathetic reader would care because the increasing number of available LLMs makes understanding how to best combine them relevant for practical improvements in downstream tasks.

Core claim

The paper claims to deliver the first comprehensive taxonomy and review of LLM Ensemble, showing that methods fall into three distinct stages relative to the inference process, with each stage containing specific techniques that can be reviewed and compared through existing benchmarks and applications.

What carries the argument

The three-stage taxonomy (ensemble-before-inference, ensemble-during-inference, ensemble-after-inference) that organizes all LLM ensemble methods for review and classification.

If this is right

Existing methods can be mapped onto the taxonomy without major omissions.
The review identifies gaps that future work can address in each stage.
Benchmarks provide a way to evaluate and compare ensemble approaches.
Applications demonstrate practical uses across various domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could guide the development of new hybrid methods that operate across multiple stages.
Practitioners might use the classification to choose ensemble strategies based on their computational constraints.
Future surveys could expand the taxonomy if new methods emerge that challenge the three-stage division.
Connections between LLM Ensemble and other multi-model techniques like mixture-of-experts may warrant further investigation.

Load-bearing premise

That the three-stage taxonomy comprehensively partitions the space of all relevant LLM ensemble methods without significant omissions or overlaps.

What would settle it

Discovery of an LLM ensemble technique that requires a fourth distinct category or cannot fit into the existing three without substantial overlap would challenge the taxonomy's completeness.

Figures

Figures reproduced from arXiv: 2502.18036 by Dingqi Yang, Hailong Sun, Jingzheng Li, Kai Sun, Likang Xiao, Ming Li, Pengpeng Chen, Philip S. Yu, Qianren Mao, Xiaodong Lu, Xiao Huang, Yikun Ban, Yuankai Luo, Zhijun Chen, Zhuoran Li.

**Figure 2.** Figure 2: Taxonomy of LLM Ensemble methods. 3 Methodology In this section, following the taxonomy in Section 2.1, we systematically review the three types of methods—ensemble before inference, ensemble during inference and ensemble after inference—in Sections 3.1, 3.2, and 3.3, respectively. 3.1 Ensemble Before Inference As mentioned in Section 2.1, two categories of ensemblebefore-inference approaches exist: pret… view at source ↗

read the original abstract

LLM Ensemble -- which involves the comprehensive use of multiple large language models (LLMs), each aimed at handling user queries during downstream inference, to benefit from their individual strengths -- has gained substantial attention recently. The widespread availability of LLMs, coupled with their varying strengths and out-of-the-box usability, has profoundly advanced the field of LLM Ensemble. This paper presents the first systematic review of recent developments in LLM Ensemble. First, we introduce our taxonomy of LLM Ensemble and discuss several related research problems. Then, we provide a more in-depth classification of the methods under the broad categories of "ensemble-before-inference, ensemble-during-inference, ensemble-after-inference'', and review all relevant methods. Finally, we introduce related benchmarks and applications, summarize existing studies, and suggest several future research directions. A curated list of papers on LLM Ensemble is available at https://github.com/junchenzhi/Awesome-LLM-Ensemble.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward survey that organizes LLM ensemble methods into a before/during/after taxonomy and supplies a GitHub list, but the exhaustiveness of that partition is not demonstrated.

read the letter

The paper's main contribution is a literature review that groups existing LLM ensemble work into three stages and walks through methods, benchmarks, applications, and open questions. The GitHub repo of papers is a practical addition that makes the compilation immediately usable for someone trying to get up to speed. That is the useful part: it saves readers time by collecting and labeling the scattered papers in one place rather than leaving them to hunt through arXiv themselves. The structure is logical at a high level and the future-directions section stays concrete without overclaiming. No new algorithms or experiments appear, which is expected for a survey, but the classification itself is the organizing move. The taxonomy is presented as the backbone, yet the text does not include a systematic check that every cited method lands cleanly in one bucket with no hybrids or orthogonal dimensions left out. If methods that combine selection before inference with verification after, or that ensemble prompting strategies separately from model choice, require extra categories, the partition would need adjustment. The abstract's claim of being the first systematic review therefore rests on that untested completeness. Readers who want a quick map of the area will find it serviceable; those looking for a definitive partition or resolution of technical open problems will not. It is the kind of paper that belongs in a survey track or as a reference piece rather than a methods contribution. A serious editor should send it to peer review so the coverage and taxonomy can be stress-tested by specialists who know the full literature.

Referee Report

1 major / 1 minor

Summary. The paper claims to present the first systematic review of recent developments in LLM Ensemble, which uses multiple LLMs to leverage their individual strengths for downstream inference. It introduces a taxonomy partitioning methods into ensemble-before-inference, ensemble-during-inference, and ensemble-after-inference; provides in-depth classification and review of methods under these categories; discusses related research problems, benchmarks, and applications; summarizes existing studies; and suggests future directions, supported by a curated GitHub list of papers.

Significance. If the taxonomy is shown to be comprehensive, the survey would offer a useful organizing framework for the rapidly growing LLM ensemble literature in NLP, helping researchers identify patterns, gaps, and connections across methods. The public GitHub repository of curated papers is a clear strength, enhancing accessibility and reproducibility of the review.

major comments (1)

[Taxonomy introduction and classification sections] The claim of presenting the 'first systematic review' (abstract) rests on the three-stage taxonomy being exhaustive and non-overlapping. The manuscript does not explicitly analyze or rule out hybrid methods (e.g., dynamic model selection that combines before- and during-inference stages) or orthogonal dimensions (e.g., ensembles over prompting strategies), which could produce overlaps or omissions and undermine the partition's completeness as a systematic structure.

minor comments (1)

[Introduction] The abstract states that 'several related research problems' are discussed, but the manuscript could clarify in the introduction or taxonomy section how these problems map onto the three-stage structure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the presentation of our taxonomy. We address the major comment below and will revise the manuscript to incorporate the suggested analysis.

read point-by-point responses

Referee: [Taxonomy introduction and classification sections] The claim of presenting the 'first systematic review' (abstract) rests on the three-stage taxonomy being exhaustive and non-overlapping. The manuscript does not explicitly analyze or rule out hybrid methods (e.g., dynamic model selection that combines before- and during-inference stages) or orthogonal dimensions (e.g., ensembles over prompting strategies), which could produce overlaps or omissions and undermine the partition's completeness as a systematic structure.

Authors: We appreciate the referee's observation. Our taxonomy partitions methods according to the primary stage (before, during, or after inference) at which the ensemble decision or aggregation occurs, which provides a clear and actionable organizing principle for the literature. We agree that the manuscript does not explicitly analyze hybrid methods or orthogonal dimensions such as prompting-strategy ensembles. To strengthen the taxonomy section, we will add a dedicated paragraph (or short subsection) that (1) acknowledges the possibility of hybrid approaches, (2) illustrates how a method that spans stages can still be classified by its dominant stage while noting the hybrid aspect, and (3) clarifies that orthogonal dimensions (e.g., prompting) are largely independent of the stage-based partition and can be applied across categories. This revision will make the boundaries of the taxonomy more transparent without altering its core structure or the claim of providing the first systematic review organized around these stages. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive survey with proposed taxonomy

full rationale

This paper is a literature review that introduces a three-stage taxonomy solely to organize existing LLM ensemble methods; no derivations, equations, fitted parameters, or predictions are present. The taxonomy is explicitly presented as an organizing framework rather than a result derived from prior claims, and the 'first systematic review' statement rests on coverage of the literature rather than any self-referential reduction. No self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing steps. The work is self-contained as a descriptive survey.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, it does not introduce new free parameters, axioms, or invented entities; it synthesizes and organizes existing literature on LLM ensembles.

pith-pipeline@v0.9.0 · 5737 in / 1024 out tokens · 37711 ms · 2026-05-23T02:20:10.436771+00:00 · methodology

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sampling from Your Language Model One Byte at a Time
cs.CL 2025-06 unverdicted novelty 7.0

An inference-time technique turns BPE-based LMs into byte- or character-level models, solving the prompt boundary problem while unifying vocabularies across different tokenizers.
Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers
cs.LG 2025-05 conditional novelty 7.0

A well-tuned kNN router matches or exceeds state-of-the-art learned routers on new standardized benchmarks spanning instruction, QA, reasoning, and the first multi-modal visual routing dataset, due to locality of mode...
A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability
cs.LG 2026-05 unverdicted novelty 6.0

LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.
Rethinking LLM Ensembling from the Perspective of Mixture Models
cs.LG 2026-05 unverdicted novelty 6.0

ME reinterprets LLM ensembling as a mixture model by sampling a single model stochastically at each token step, matching the ensemble distribution while invoking only one model per step for substantial speed gains.
Token-Level LLM Collaboration via FusionRoute
cs.AI 2026-01 unverdicted novelty 6.0

FusionRoute augments token-level expert routing with a trainable complementary logit generator to expand the policy class and recover optimal decoding under mild conditions, outperforming prior collaboration and mergi...
SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission
eess.SP 2026-04 unverdicted novelty 5.0

SpecFed accelerates federated LLM inference via speculative decoding for parallel processing and top-K compression with server-side reconstruction, achieving high fidelity with reduced communication overhead.
Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process
cs.CL 2025-12 unverdicted novelty 5.0

LLM-PeerReview ensembles LLMs by scoring responses with LLM-as-Judge and selecting the best via averaging or truth inference, beating Smoothie-Global by 6.9-7.3 points on four datasets.
LLM-Powered AI Agent Systems and Their Applications in Industry
cs.AI 2025-05 unverdicted novelty 2.0

A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 8 Pith papers · 11 internal anchors

[1]

GPT-4 Technical Report

[Achiamet al., 2023 ] Josh Achiam, Steven Adler, Sandhini Agar- wal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Automix: Automatically mixing language models

[Aggarwalet al., 2023 ] Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, et al. Automix: Automatically mixing language models. arXiv preprint arXiv:2310.12963,

work page arXiv 2023
[3]

Structured probabilistic end-to-end learning from crowds

[Chenet al., 2021 ] Zhijun Chen, Huimin Wang, Hailong Sun, Pengpeng Chen, Tao Han, Xudong Liu, and Jie Yang. Structured probabilistic end-to-end learning from crowds. InIJCAI,

work page 2021
[4]

Adversarial learning from crowds

[Chenet al., 2022 ] Pengpeng Chen, Hailong Sun, Yongqiang Yang, and Zhijun Chen. Adversarial learning from crowds. InAAAI,

work page 2022
[5]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

[Chenet al., 2023a ] Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv:2305.05176,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

A survey on collaborative mechanisms between large and small lan- guage models.arXiv preprint arXiv:2505.07460,

[Chenet al., 2025 ] Yi Chen, JiaHao Zhao, and HaoHao Han. A survey on collaborative mechanisms between large and small lan- guage models.arXiv preprint arXiv:2505.07460,

work page arXiv 2025
[7]

A unified approach to routing and cascad- ing for llms.arXiv preprint arXiv:2410.10347,

[Dekonincket al., 2024 ] Jasper Dekoninck, Maximilian Baader, and Martin Vechev. A unified approach to routing and cascad- ing for llms.arXiv preprint arXiv:2410.10347,

work page arXiv 2024
[8]

Hybrid llm: Cost- efficient and quality-aware query routing

[Dinget al., 2024 ] Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor R ¨uhle, Laks VS Laksh- manan, and Ahmed Hassan Awadallah. Hybrid llm: Cost- efficient and quality-aware query routing. InICLR,

work page 2024
[9]

A survey on ensemble learning.Frontiers of Computer Science, 14(2):241–258,

[Donget al., 2020 ] Xibin Dong, Zhiwen Yu, Wenming Cao, Yifan Shi, and Qianli Ma. A survey on ensemble learning.Frontiers of Computer Science, 14(2):241–258,

work page 2020
[10]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

[Duet al., 2023 ] Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and rea- soning in language models through multiagent debate.arXiv preprint arXiv:2305.14325,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Bayesian calibration of win rate estimation with llm evaluators.arXiv preprint arXiv:2411.04424,

[Gaoet al., 2024 ] Yicheng Gao, Gonghan Xu, Zhe Wang, and Ar- man Cohan. Bayesian calibration of win rate estimation with llm evaluators.arXiv preprint arXiv:2411.04424,

work page arXiv 2024
[12]

Smoothie: Label free language model routing

[Guhaet al., 2024 ] Neel Guha, Mayee F Chen, Trevor Chow, Is- han S Khare, and Christopher Re. Smoothie: Label free language model routing. InNeuIPS,

work page 2024
[13]

Promptmind team at mediqa-corr 2024: Im- proving clinical text correction with error categorization and llm ensembles.arXiv preprint arXiv:2405.08373,

[Gundabathula and Kolar, 2024] Satya Kesav Gundabathula and Sriram R Kolar. Promptmind team at mediqa-corr 2024: Im- proving clinical text correction with error categorization and llm ensembles.arXiv preprint arXiv:2405.08373,

work page arXiv 2024
[14]

Language model cascades: Token-level uncer- tainty and beyond.arXiv preprint arXiv:2404.10136,

[Guptaet al., 2024 ] Neha Gupta, Harikrishna Narasimhan, Wit- tawat Jitkrittum, Ankit Singh Rawat, Aditya Krishna Menon, and Sanjiv Kumar. Language model cascades: Token-level uncer- tainty and beyond.arXiv preprint arXiv:2404.10136,

work page arXiv 2024
[15]

Dynamic ensemble reasoning for llm experts.arXiv preprint arXiv:2412.07448,

[Huet al., 2024a ] Jinwu Hu, Yufeng Wang, Shuhai Zhang, Kai Zhou, Guohao Chen, Yu Hu, Bin Xiao, and Mingkui Tan. Dynamic ensemble reasoning for llm experts.arXiv preprint arXiv:2412.07448,

work page arXiv
[16]

RouterBench: A Benchmark for Multi-LLM Routing System

[Huet al., 2024b ] Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system.arXiv preprint arXiv:2403.12031,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Ensem- ble learning for heterogeneous large language models with deep parallel collaboration

[Huanget al., 2024 ] Yichong Huang, Xiaocheng Feng, Baohang Li, Yang Xiang, Hui Wang, Ting Liu, and Bing Qin. Ensem- ble learning for heterogeneous large language models with deep parallel collaboration. InNeurIPS,

work page 2024
[18]

Llm-blender: Ensembling large language models with pairwise ranking and generative fusion

[Jianget al., 2023 ] Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. InACL,

work page 2023
[19]

Collaborative decoding of critical tokens for boosting factuality of large language models.arXiv preprint arXiv:2402.17982,

[Jinet al., 2024 ] Lifeng Jin, Baolin Peng, Linfeng Song, Haitao Mi, Ye Tian, and Dong Yu. Collaborative decoding of critical tokens for boosting factuality of large language models.arXiv preprint arXiv:2402.17982,

work page arXiv 2024
[20]

When does confidence-based cascade deferral suffice? NeurIPS, 36,

[Jitkrittumet al., 2024 ] Wittawat Jitkrittum, Neha Gupta, Aditya K Menon, Harikrishna Narasimhan, Ankit Rawat, and Sanjiv Ku- mar. When does confidence-based cascade deferral suffice? NeurIPS, 36,

work page 2024
[21]

Ensemble-instruct: Generating instruction-tuning data with a heterogeneous mixture of lms.arXiv preprint arXiv:2310.13961,

[Leeet al., 2023 ] Young-Suk Lee, Md Arafat Sultan, Yousef El- Kurdi, Tahira Naseem Asim Munawar, Radu Florian, Salim Roukos, and Ram ´on Fernandez Astudillo. Ensemble-instruct: Generating instruction-tuning data with a heterogeneous mixture of lms.arXiv preprint arXiv:2310.13961,

work page arXiv 2023
[22]

More agents is all you need

[Liet al., 2024a ] Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and Deheng Ye. More agents is all you need.arXiv preprint arXiv:2402.05120,

work page arXiv
[23]

Purifying large language models by ensembling a small language model,

[Liet al., 2024b ] Tianlin Li, Qian Liu, Tianyu Pang, Chao Du, Qing Guo, Yang Liu, and Min Lin. Purifying large language models by ensembling a small language model.arXiv preprint arXiv:2402.14845,

work page arXiv
[24]

Llm bandit: Cost-efficient llm generation via preference-conditioned dynamic routing.arXiv preprint arXiv:2502.02743,

[Li, 2025] Yang Li. Llm bandit: Cost-efficient llm generation via preference-conditioned dynamic routing.arXiv preprint arXiv:2502.02743,

work page arXiv 2025
[25]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

[Liuet al., 2024a ] Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Cool-fusion: Fuse large language models without training.arXiv preprint arXiv:2407.19807,

[Liuet al., 2024b ] Cong Liu, Xiaojun Quan, Yan Pan, Liang Lin, Weigang Wu, and Xu Chen. Cool-fusion: Fuse large language models without training.arXiv preprint arXiv:2407.19807,

work page arXiv
[27]

Merge, ensemble, and cooperate! a survey on collabora- tive strategies in the era of large language models,

[Luet al., 2024a ] Jinliang Lu, Ziliang Pang, Min Xiao, Yaochen Zhu, Rui Xia, and Jiajun Zhang. Merge, ensemble, and coop- erate! a survey on collaborative strategies in the era of large lan- guage models.arXiv preprint arXiv:2407.06089,

work page arXiv
[28]

Routing to the expert: Efficient reward-guided ensemble of large language models

[Luet al., 2024b ] Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models. InNAACL, pages 1964–1974,

work page 1964
[29]

Blending is all you need: Cheaper, better alterna- tive to trillion-parameters llm.arXiv preprint arXiv:2401.02994,

[Luet al., 2024c ] Xiaoding Lu, Zongyi Liu, Adian Liusie, Vyas Raina, Vineet Mudupalli, Yuwen Zhang, and William Beauchamp. Blending is all you need: Cheaper, better alterna- tive to trillion-parameters llm.arXiv preprint arXiv:2401.02994,

work page arXiv
[30]

Specfuse: Ensembling large language models via next-segment prediction.arXiv preprint arXiv:2412.07380,

[Lvet al., 2024b ] Bo Lv, Chen Tang, Yanan Zhang, Xin Liu, Yue Yu, and Ping Luo. Specfuse: Ensembling large language models via next-segment prediction.arXiv preprint arXiv:2412.07380,

work page arXiv
[31]

Selectllm: Query-aware efficient selec- tion algorithm for large language models.arXiv preprint arXiv:2408.08545,

[Mauryaet al., 2024 ] Kaushal Kumar Maurya, KV Srivatsa, and Ekaterina Kochmar. Selectllm: Query-aware efficient selec- tion algorithm for large language models.arXiv preprint arXiv:2408.08545,

work page arXiv 2024
[32]

Pack of llms: Model fusion at test-time via perplexity optimization.arXiv preprint arXiv:2404.11531,

[Mavromatiset al., 2024 ] Costas Mavromatis, Petros Karypis, and George Karypis. Pack of llms: Model fusion at test-time via perplexity optimization.arXiv preprint arXiv:2404.11531,

work page arXiv 2024
[33]

Routoo: Learning to route to large language models effectively.arXiv preprint arXiv:2401.13979,

[Mohammadshahiet al., 2024 ] Alireza Mohammadshahi, Ar- shad Rafiq Shaikh, and Majid Yazdani. Routoo: Learning to route to large language models effectively.arXiv preprint arXiv:2401.13979,

work page arXiv 2024
[34]

A comprehensive review on ensemble deep learn- ing: Opportunities and challenges.Journal of King Saud University-Computer and Information Sciences, 35(2):757–774,

[Mohammed and Kora, 2023] Ammar Mohammed and Rania Kora. A comprehensive review on ensemble deep learn- ing: Opportunities and challenges.Journal of King Saud University-Computer and Information Sciences, 35(2):757–774,

work page 2023
[35]

Relative representations enable zero-shot latent space communication.arXiv:2209.15430,

[Moschellaet al., 2022 ] Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodol `a. Relative representations enable zero-shot latent space communication.arXiv:2209.15430,

work page arXiv 2022
[36]

Adaptive selection for homogeneous tools: An instantiation in the rag scenario.arXiv preprint arXiv:2406.12429,

[Muet al., 2024 ] Feiteng Mu, Yong Jiang, Liwen Zhang, Chu Liu, Wenjie Li, Pengjun Xie, and Fei Huang. Adaptive selection for homogeneous tools: An instantiation in the rag scenario.arXiv preprint arXiv:2406.12429,

work page arXiv 2024
[37]

Metallm: A high-performant and cost-efficient dynamic frame- work for wrapping llms.arXiv preprint arXiv:2407.10834,

[Nguyenet al., 2024 ] Quang H Nguyen, Duy C Hoang, Juliette De- cugis, Saurav Manchanda, Nitesh V Chawla, and Khoa D Doan. Metallm: A high-performant and cost-efficient dynamic frame- work for wrapping llms.arXiv preprint arXiv:2407.10834,

work page arXiv 2024
[38]

RouteLLM: Learning to Route LLMs with Preference Data

[Onget al., 2024 ] Isaac Ong, Amjad Almahairi, Vincent Wu, Wei- Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Ensembling large language models with process reward-guided tree search for better complex reasoning.arXiv preprint arXiv:2412.15797,

[Parket al., 2024 ] Sungjin Park, Xiao Liu, Yeyun Gong, and Ed- ward Choi. Ensembling large language models with process reward-guided tree search for better complex reasoning.arXiv preprint arXiv:2412.15797,

work page arXiv 2024
[40]

Cache & distil: Optimising api calls to large language models.arXiv preprint arXiv:2310.13561,

[Ram´ırezet al., 2023] Guillem Ram ´ırez, Matthias Lindemann, Alexandra Birch, and Ivan Titov. Cache & distil: Optimising api calls to large language models.arXiv preprint arXiv:2310.13561,

work page arXiv 2023
[41]

Snorkel: Rapid training data creation with weak supervision

[Ratneret al., 2017 ] Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R ´e. Snorkel: Rapid training data creation with weak supervision. InProceed- ings of the VLDB endowment. International conference on very large data bases, volume 11, page 269,

work page 2017
[42]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

[Reimers, 2019] N Reimers. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084,

work page internal anchor Pith review Pith/arXiv arXiv 2019
[43]

From task-specific models to unified sys- tems: A review of model merging approaches.arXiv preprint arXiv:2503.08998,

[Ruanet al., 2025 ] Wei Ruan, Tianze Yang, Yifan Zhou, Tianming Liu, and Jin Lu. From task-specific models to unified sys- tems: A review of model merging approaches.arXiv preprint arXiv:2503.08998,

work page arXiv 2025
[44]

Fly-swat or cannon? cost-effective language model choice via meta-modeling

[ˇSakotaet al., 2024 ] Marija ˇSakota, Maxime Peyrard, and Robert West. Fly-swat or cannon? cost-effective language model choice via meta-modeling. InWSDM, pages 606–615,

work page 2024
[45]

Large language model routing with bench- mark datasets

[Shnitzeret al., 2023 ] Tal Shnitzer, Anthony Ou, Mirian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, and Mikhail Yurochkin. Large language model routing with bench- mark datasets. InNeurIPS,

work page 2023
[46]

Getting more out of mixture of language model reasoning experts

[Siet al., 2023 ] Chenglei Si, Weijia Shi, Chen Zhao, Luke Zettle- moyer, and Jordan Boyd-Graber. Getting more out of mixture of language model reasoning experts. InFindings of EMNLP,

work page 2023
[47]

Pickllm: Context-aware rl-assisted large lan- guage model routing.arXiv preprint arXiv:2412.12170,

[Sikeridiset al., 2024 ] Dimitrios Sikeridis, Dennis Ramdass, and Pranay Pareek. Pickllm: Context-aware rl-assisted large lan- guage model routing.arXiv preprint arXiv:2412.12170,

work page arXiv 2024
[48]

Harnessing the power of multiple minds: Lessons learned from llm routing.arXiv preprint arXiv:2405.00467,

[Srivatsaet al., 2024 ] KV Srivatsa, Kaushal Kumar Maurya, and Ekaterina Kochmar. Harnessing the power of multiple minds: Lessons learned from llm routing.arXiv preprint arXiv:2405.00467,

work page arXiv 2024
[49]

Tensoropera router: A multi-model router for efficient llm inference.arXiv preprint arXiv:2408.12320,

[Stripeliset al., 2024 ] Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, and Chaoyang He. Tensoropera router: A multi-model router for efficient llm inference.arXiv preprint arXiv:2408.12320,

work page arXiv 2024
[50]

Gemini: A Family of Highly Capable Multimodal Models

[Teamet al., 2023 ] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Llm-topla: Efficient llm ensemble by maximising diversity

[Tekinet al., 2024 ] Selim Tekin, Fatih Ilhan, Tiansheng Huang, Si- hao Hu, and Ling Liu. Llm-topla: Efficient llm ensemble by maximising diversity. InFindings of EMNLP,

work page 2024
[52]

LLaMA: Open and Efficient Foundation Language Models

[Touvronet al., 2023 ] Hugo Touvron, Thibaut Lavril, Gautier Izac- ard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

[Tranet al., 2025 ] Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Model cascading: Towards jointly improving efficiency and accuracy of nlp systems, 2022

[Varshney and Baral, 2022] Neeraj Varshney and Chitta Baral. Model cascading: Towards jointly improving efficiency and ac- curacy of nlp systems.arXiv preprint arXiv:2210.05528,

work page arXiv 2022
[55]

Bench-coe: a framework for collaboration of ex- perts from benchmark.arXiv preprint arXiv:2412.04167,

[Wanget al., 2024 ] Yuanshuai Wang, Xingjian Zhang, Jinkun Zhao, Siwei Wen, Peilin Feng, Shuhao Liao, Lei Huang, and Wenjun Wu. Bench-coe: a framework for collaboration of ex- perts from benchmark.arXiv preprint arXiv:2412.04167,

work page arXiv 2024
[56]

Bridging the gap between different vocabularies for llm ensem- ble

[Xuet al., 2024 ] Yangyifan Xu, Jinliang Lu, and Jiajun Zhang. Bridging the gap between different vocabularies for llm ensem- ble. InNAACL, pages 7133–7145,

work page 2024
[57]

Hit the sweet spot! span-level ensemble for large language models

[Xuet al., 2025 ] Yangyifan Xu, Jianghao Chen, Junhong Wu, and Jiajun Zhang. Hit the sweet spot! span-level ensemble for large language models. InCOLING, pages 8314–8325,

work page 2025
[58]

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

[Yanget al., 2024 ] Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merg- ing in llms, mllms, and beyond: Methods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Cabs: Conflict- aware and balanced sparsification for enhancing model merging

[Yanget al., 2025 ] Zongzhen Yang, Binhang Qi, Hailong Sun, Wenrui Long, Ruobing Zhao, and Xiang Gao. Cabs: Conflict- aware and balanced sparsification for enhancing model merging. arXiv preprint arXiv:2503.01874,

work page arXiv 2025
[60]

Determine-then-ensemble: Necessity of top-k union for large language model ensembling.arXiv preprint arXiv:2410.03777,

[Yaoet al., 2024 ] Yuxuan Yao, Han Wu, Mingyang Liu, Sichun Luo, Xiongwei Han, Jie Liu, Zhijiang Guo, and Linqi Song. Determine-then-ensemble: Necessity of top-k union for large language model ensembling.arXiv preprint arXiv:2410.03777,

work page arXiv 2024
[61]

Breaking the ceiling of the llm community by treating token generation as a classification for en- sembling.arXiv preprint arXiv:2406.12585,

[Yuet al., 2024 ] Yao-Ching Yu, Chun-Chih Kuo, Ziqi Ye, Yu- Cheng Chang, and Yueh-Se Li. Breaking the ceiling of the llm community by treating token generation as a classification for en- sembling.arXiv preprint arXiv:2406.12585,

work page arXiv 2024
[62]

Large language model cascades with mixture of thought representations for cost-efficient reasoning

[Yueet al., 2024 ] Murong Yue, Jie Zhao, Min Zhang, Liang Du, and Ziyu Yao. Large language model cascades with mixture of thought representations for cost-efficient reasoning. InICLR,

work page 2024
[63]

Wrench: A comprehensive benchmark for weak supervision.arXiv preprint arXiv:2109.11377,

[Zhanget al., 2021 ] Jieyu Zhang, Yue Yu, Yinghao Li, Yujing Wang, Yaming Yang, Mao Yang, and Alexander Ratner. Wrench: A comprehensive benchmark for weak supervision.arXiv preprint arXiv:2109.11377,

work page arXiv 2021
[64]

A survey on programmatic weak supervision.arXiv preprint arXiv:2202.05433,

[Zhanget al., 2022 ] Jieyu Zhang, Cheng-Yu Hsieh, Yue Yu, Chao Zhang, and Alexander Ratner. A survey on programmatic weak supervision.arXiv preprint arXiv:2202.05433,

work page arXiv 2022
[65]

Ecoassistant: Using llm as- sistant more affordably and accurately.arXiv preprint arXiv:2310.03046,

[Zhanget al., 2023 ] Jieyu Zhang, Ranjay Krishna, Ahmed H Awadallah, and Chi Wang. Ecoassistant: Using llm as- sistant more affordably and accurately.arXiv preprint arXiv:2310.03046,

work page arXiv 2023
[66]

If multi-agent debate is the answer, what is the question.arXiv preprint arXiv:2502.08788,

[Zhanget al., 2025 ] Hangfan Zhang, Zhiyao Cui, Xinrun Wang, Qiaosheng Zhang, Zhen Wang, Dinghao Wu, and Shuyue Hu. If multi-agent debate is the answer, what is the question.arXiv preprint arXiv:2502.08788,

work page arXiv 2025
[67]

Knowledge learning with crowdsourc- ing: A brief review and systematic perspective.IEEE/CAA Jour- nal of Automatica Sinica, 9(5):749–762,

[Zhang, 2022] Jing Zhang. Knowledge learning with crowdsourc- ing: A brief review and systematic perspective.IEEE/CAA Jour- nal of Automatica Sinica, 9(5):749–762,

work page 2022
[68]

Eagle: Efficient training-free router for multi-llm inference

[Zhaoet al., 2024 ] Zesen Zhao, Shuowei Jin, and Z Morley Mao. Eagle: Efficient training-free router for multi-llm inference. arXiv preprint arXiv:2409.15518,

work page arXiv 2024
[69]

Truth inference in crowdsourcing: Is the problem solved?Proceedings of the VLDB Endowment, 10(5):541–552,

[Zhenget al., 2017 ] Yudian Zheng, Guoliang Li, Yuanbing Li, Cai- hua Shan, and Reynold Cheng. Truth inference in crowdsourcing: Is the problem solved?Proceedings of the VLDB Endowment, 10(5):541–552,

work page 2017
[70]

Citer: Collaborative inference for ef- ficient large language model decoding with token-level routing

[Zhenget al., 2025 ] Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P Xing, Hongyi Wang, and Huaxiu Yao. Citer: Collaborative inference for ef- ficient large language model decoding with token-level routing. arXiv preprint arXiv:2502.01976,

work page arXiv 2025
[71]

Ensemble learning

[Zhou, 2021] Zhi-Hua Zhou. Ensemble learning. InMachine learn- ing, pages 181–210. Springer, 2021

work page 2021

[1] [1]

GPT-4 Technical Report

[Achiamet al., 2023 ] Josh Achiam, Steven Adler, Sandhini Agar- wal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Automix: Automatically mixing language models

[Aggarwalet al., 2023 ] Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, et al. Automix: Automatically mixing language models. arXiv preprint arXiv:2310.12963,

work page arXiv 2023

[3] [3]

Structured probabilistic end-to-end learning from crowds

[Chenet al., 2021 ] Zhijun Chen, Huimin Wang, Hailong Sun, Pengpeng Chen, Tao Han, Xudong Liu, and Jie Yang. Structured probabilistic end-to-end learning from crowds. InIJCAI,

work page 2021

[4] [4]

Adversarial learning from crowds

[Chenet al., 2022 ] Pengpeng Chen, Hailong Sun, Yongqiang Yang, and Zhijun Chen. Adversarial learning from crowds. InAAAI,

work page 2022

[5] [5]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

[Chenet al., 2023a ] Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv:2305.05176,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

A survey on collaborative mechanisms between large and small lan- guage models.arXiv preprint arXiv:2505.07460,

[Chenet al., 2025 ] Yi Chen, JiaHao Zhao, and HaoHao Han. A survey on collaborative mechanisms between large and small lan- guage models.arXiv preprint arXiv:2505.07460,

work page arXiv 2025

[7] [7]

A unified approach to routing and cascad- ing for llms.arXiv preprint arXiv:2410.10347,

[Dekonincket al., 2024 ] Jasper Dekoninck, Maximilian Baader, and Martin Vechev. A unified approach to routing and cascad- ing for llms.arXiv preprint arXiv:2410.10347,

work page arXiv 2024

[8] [8]

Hybrid llm: Cost- efficient and quality-aware query routing

[Dinget al., 2024 ] Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor R ¨uhle, Laks VS Laksh- manan, and Ahmed Hassan Awadallah. Hybrid llm: Cost- efficient and quality-aware query routing. InICLR,

work page 2024

[9] [9]

A survey on ensemble learning.Frontiers of Computer Science, 14(2):241–258,

[Donget al., 2020 ] Xibin Dong, Zhiwen Yu, Wenming Cao, Yifan Shi, and Qianli Ma. A survey on ensemble learning.Frontiers of Computer Science, 14(2):241–258,

work page 2020

[10] [10]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

[Duet al., 2023 ] Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and rea- soning in language models through multiagent debate.arXiv preprint arXiv:2305.14325,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Bayesian calibration of win rate estimation with llm evaluators.arXiv preprint arXiv:2411.04424,

[Gaoet al., 2024 ] Yicheng Gao, Gonghan Xu, Zhe Wang, and Ar- man Cohan. Bayesian calibration of win rate estimation with llm evaluators.arXiv preprint arXiv:2411.04424,

work page arXiv 2024

[12] [12]

Smoothie: Label free language model routing

[Guhaet al., 2024 ] Neel Guha, Mayee F Chen, Trevor Chow, Is- han S Khare, and Christopher Re. Smoothie: Label free language model routing. InNeuIPS,

work page 2024

[13] [13]

Promptmind team at mediqa-corr 2024: Im- proving clinical text correction with error categorization and llm ensembles.arXiv preprint arXiv:2405.08373,

[Gundabathula and Kolar, 2024] Satya Kesav Gundabathula and Sriram R Kolar. Promptmind team at mediqa-corr 2024: Im- proving clinical text correction with error categorization and llm ensembles.arXiv preprint arXiv:2405.08373,

work page arXiv 2024

[14] [14]

Language model cascades: Token-level uncer- tainty and beyond.arXiv preprint arXiv:2404.10136,

[Guptaet al., 2024 ] Neha Gupta, Harikrishna Narasimhan, Wit- tawat Jitkrittum, Ankit Singh Rawat, Aditya Krishna Menon, and Sanjiv Kumar. Language model cascades: Token-level uncer- tainty and beyond.arXiv preprint arXiv:2404.10136,

work page arXiv 2024

[15] [15]

Dynamic ensemble reasoning for llm experts.arXiv preprint arXiv:2412.07448,

[Huet al., 2024a ] Jinwu Hu, Yufeng Wang, Shuhai Zhang, Kai Zhou, Guohao Chen, Yu Hu, Bin Xiao, and Mingkui Tan. Dynamic ensemble reasoning for llm experts.arXiv preprint arXiv:2412.07448,

work page arXiv

[16] [16]

RouterBench: A Benchmark for Multi-LLM Routing System

[Huet al., 2024b ] Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system.arXiv preprint arXiv:2403.12031,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Ensem- ble learning for heterogeneous large language models with deep parallel collaboration

[Huanget al., 2024 ] Yichong Huang, Xiaocheng Feng, Baohang Li, Yang Xiang, Hui Wang, Ting Liu, and Bing Qin. Ensem- ble learning for heterogeneous large language models with deep parallel collaboration. InNeurIPS,

work page 2024

[18] [18]

Llm-blender: Ensembling large language models with pairwise ranking and generative fusion

[Jianget al., 2023 ] Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. InACL,

work page 2023

[19] [19]

Collaborative decoding of critical tokens for boosting factuality of large language models.arXiv preprint arXiv:2402.17982,

[Jinet al., 2024 ] Lifeng Jin, Baolin Peng, Linfeng Song, Haitao Mi, Ye Tian, and Dong Yu. Collaborative decoding of critical tokens for boosting factuality of large language models.arXiv preprint arXiv:2402.17982,

work page arXiv 2024

[20] [20]

When does confidence-based cascade deferral suffice? NeurIPS, 36,

[Jitkrittumet al., 2024 ] Wittawat Jitkrittum, Neha Gupta, Aditya K Menon, Harikrishna Narasimhan, Ankit Rawat, and Sanjiv Ku- mar. When does confidence-based cascade deferral suffice? NeurIPS, 36,

work page 2024

[21] [21]

Ensemble-instruct: Generating instruction-tuning data with a heterogeneous mixture of lms.arXiv preprint arXiv:2310.13961,

[Leeet al., 2023 ] Young-Suk Lee, Md Arafat Sultan, Yousef El- Kurdi, Tahira Naseem Asim Munawar, Radu Florian, Salim Roukos, and Ram ´on Fernandez Astudillo. Ensemble-instruct: Generating instruction-tuning data with a heterogeneous mixture of lms.arXiv preprint arXiv:2310.13961,

work page arXiv 2023

[22] [22]

More agents is all you need

[Liet al., 2024a ] Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and Deheng Ye. More agents is all you need.arXiv preprint arXiv:2402.05120,

work page arXiv

[23] [23]

Purifying large language models by ensembling a small language model,

[Liet al., 2024b ] Tianlin Li, Qian Liu, Tianyu Pang, Chao Du, Qing Guo, Yang Liu, and Min Lin. Purifying large language models by ensembling a small language model.arXiv preprint arXiv:2402.14845,

work page arXiv

[24] [24]

Llm bandit: Cost-efficient llm generation via preference-conditioned dynamic routing.arXiv preprint arXiv:2502.02743,

[Li, 2025] Yang Li. Llm bandit: Cost-efficient llm generation via preference-conditioned dynamic routing.arXiv preprint arXiv:2502.02743,

work page arXiv 2025

[25] [25]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

[Liuet al., 2024a ] Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Cool-fusion: Fuse large language models without training.arXiv preprint arXiv:2407.19807,

[Liuet al., 2024b ] Cong Liu, Xiaojun Quan, Yan Pan, Liang Lin, Weigang Wu, and Xu Chen. Cool-fusion: Fuse large language models without training.arXiv preprint arXiv:2407.19807,

work page arXiv

[27] [27]

Merge, ensemble, and cooperate! a survey on collabora- tive strategies in the era of large language models,

[Luet al., 2024a ] Jinliang Lu, Ziliang Pang, Min Xiao, Yaochen Zhu, Rui Xia, and Jiajun Zhang. Merge, ensemble, and coop- erate! a survey on collaborative strategies in the era of large lan- guage models.arXiv preprint arXiv:2407.06089,

work page arXiv

[28] [28]

Routing to the expert: Efficient reward-guided ensemble of large language models

[Luet al., 2024b ] Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models. InNAACL, pages 1964–1974,

work page 1964

[29] [29]

Blending is all you need: Cheaper, better alterna- tive to trillion-parameters llm.arXiv preprint arXiv:2401.02994,

[Luet al., 2024c ] Xiaoding Lu, Zongyi Liu, Adian Liusie, Vyas Raina, Vineet Mudupalli, Yuwen Zhang, and William Beauchamp. Blending is all you need: Cheaper, better alterna- tive to trillion-parameters llm.arXiv preprint arXiv:2401.02994,

work page arXiv

[30] [30]

Specfuse: Ensembling large language models via next-segment prediction.arXiv preprint arXiv:2412.07380,

[Lvet al., 2024b ] Bo Lv, Chen Tang, Yanan Zhang, Xin Liu, Yue Yu, and Ping Luo. Specfuse: Ensembling large language models via next-segment prediction.arXiv preprint arXiv:2412.07380,

work page arXiv

[31] [31]

Selectllm: Query-aware efficient selec- tion algorithm for large language models.arXiv preprint arXiv:2408.08545,

[Mauryaet al., 2024 ] Kaushal Kumar Maurya, KV Srivatsa, and Ekaterina Kochmar. Selectllm: Query-aware efficient selec- tion algorithm for large language models.arXiv preprint arXiv:2408.08545,

work page arXiv 2024

[32] [32]

Pack of llms: Model fusion at test-time via perplexity optimization.arXiv preprint arXiv:2404.11531,

[Mavromatiset al., 2024 ] Costas Mavromatis, Petros Karypis, and George Karypis. Pack of llms: Model fusion at test-time via perplexity optimization.arXiv preprint arXiv:2404.11531,

work page arXiv 2024

[33] [33]

Routoo: Learning to route to large language models effectively.arXiv preprint arXiv:2401.13979,

[Mohammadshahiet al., 2024 ] Alireza Mohammadshahi, Ar- shad Rafiq Shaikh, and Majid Yazdani. Routoo: Learning to route to large language models effectively.arXiv preprint arXiv:2401.13979,

work page arXiv 2024

[34] [34]

A comprehensive review on ensemble deep learn- ing: Opportunities and challenges.Journal of King Saud University-Computer and Information Sciences, 35(2):757–774,

[Mohammed and Kora, 2023] Ammar Mohammed and Rania Kora. A comprehensive review on ensemble deep learn- ing: Opportunities and challenges.Journal of King Saud University-Computer and Information Sciences, 35(2):757–774,

work page 2023

[35] [35]

Relative representations enable zero-shot latent space communication.arXiv:2209.15430,

[Moschellaet al., 2022 ] Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodol `a. Relative representations enable zero-shot latent space communication.arXiv:2209.15430,

work page arXiv 2022

[36] [36]

Adaptive selection for homogeneous tools: An instantiation in the rag scenario.arXiv preprint arXiv:2406.12429,

[Muet al., 2024 ] Feiteng Mu, Yong Jiang, Liwen Zhang, Chu Liu, Wenjie Li, Pengjun Xie, and Fei Huang. Adaptive selection for homogeneous tools: An instantiation in the rag scenario.arXiv preprint arXiv:2406.12429,

work page arXiv 2024

[37] [37]

Metallm: A high-performant and cost-efficient dynamic frame- work for wrapping llms.arXiv preprint arXiv:2407.10834,

[Nguyenet al., 2024 ] Quang H Nguyen, Duy C Hoang, Juliette De- cugis, Saurav Manchanda, Nitesh V Chawla, and Khoa D Doan. Metallm: A high-performant and cost-efficient dynamic frame- work for wrapping llms.arXiv preprint arXiv:2407.10834,

work page arXiv 2024

[38] [38]

RouteLLM: Learning to Route LLMs with Preference Data

[Onget al., 2024 ] Isaac Ong, Amjad Almahairi, Vincent Wu, Wei- Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Ensembling large language models with process reward-guided tree search for better complex reasoning.arXiv preprint arXiv:2412.15797,

[Parket al., 2024 ] Sungjin Park, Xiao Liu, Yeyun Gong, and Ed- ward Choi. Ensembling large language models with process reward-guided tree search for better complex reasoning.arXiv preprint arXiv:2412.15797,

work page arXiv 2024

[40] [40]

Cache & distil: Optimising api calls to large language models.arXiv preprint arXiv:2310.13561,

[Ram´ırezet al., 2023] Guillem Ram ´ırez, Matthias Lindemann, Alexandra Birch, and Ivan Titov. Cache & distil: Optimising api calls to large language models.arXiv preprint arXiv:2310.13561,

work page arXiv 2023

[41] [41]

Snorkel: Rapid training data creation with weak supervision

[Ratneret al., 2017 ] Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R ´e. Snorkel: Rapid training data creation with weak supervision. InProceed- ings of the VLDB endowment. International conference on very large data bases, volume 11, page 269,

work page 2017

[42] [42]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

[Reimers, 2019] N Reimers. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084,

work page internal anchor Pith review Pith/arXiv arXiv 2019

[43] [43]

From task-specific models to unified sys- tems: A review of model merging approaches.arXiv preprint arXiv:2503.08998,

[Ruanet al., 2025 ] Wei Ruan, Tianze Yang, Yifan Zhou, Tianming Liu, and Jin Lu. From task-specific models to unified sys- tems: A review of model merging approaches.arXiv preprint arXiv:2503.08998,

work page arXiv 2025

[44] [44]

Fly-swat or cannon? cost-effective language model choice via meta-modeling

[ˇSakotaet al., 2024 ] Marija ˇSakota, Maxime Peyrard, and Robert West. Fly-swat or cannon? cost-effective language model choice via meta-modeling. InWSDM, pages 606–615,

work page 2024

[45] [45]

Large language model routing with bench- mark datasets

[Shnitzeret al., 2023 ] Tal Shnitzer, Anthony Ou, Mirian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, and Mikhail Yurochkin. Large language model routing with bench- mark datasets. InNeurIPS,

work page 2023

[46] [46]

Getting more out of mixture of language model reasoning experts

[Siet al., 2023 ] Chenglei Si, Weijia Shi, Chen Zhao, Luke Zettle- moyer, and Jordan Boyd-Graber. Getting more out of mixture of language model reasoning experts. InFindings of EMNLP,

work page 2023

[47] [47]

Pickllm: Context-aware rl-assisted large lan- guage model routing.arXiv preprint arXiv:2412.12170,

[Sikeridiset al., 2024 ] Dimitrios Sikeridis, Dennis Ramdass, and Pranay Pareek. Pickllm: Context-aware rl-assisted large lan- guage model routing.arXiv preprint arXiv:2412.12170,

work page arXiv 2024

[48] [48]

Harnessing the power of multiple minds: Lessons learned from llm routing.arXiv preprint arXiv:2405.00467,

[Srivatsaet al., 2024 ] KV Srivatsa, Kaushal Kumar Maurya, and Ekaterina Kochmar. Harnessing the power of multiple minds: Lessons learned from llm routing.arXiv preprint arXiv:2405.00467,

work page arXiv 2024

[49] [49]

Tensoropera router: A multi-model router for efficient llm inference.arXiv preprint arXiv:2408.12320,

[Stripeliset al., 2024 ] Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, and Chaoyang He. Tensoropera router: A multi-model router for efficient llm inference.arXiv preprint arXiv:2408.12320,

work page arXiv 2024

[50] [50]

Gemini: A Family of Highly Capable Multimodal Models

[Teamet al., 2023 ] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Llm-topla: Efficient llm ensemble by maximising diversity

[Tekinet al., 2024 ] Selim Tekin, Fatih Ilhan, Tiansheng Huang, Si- hao Hu, and Ling Liu. Llm-topla: Efficient llm ensemble by maximising diversity. InFindings of EMNLP,

work page 2024

[52] [52]

LLaMA: Open and Efficient Foundation Language Models

[Touvronet al., 2023 ] Hugo Touvron, Thibaut Lavril, Gautier Izac- ard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

[Tranet al., 2025 ] Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Model cascading: Towards jointly improving efficiency and accuracy of nlp systems, 2022

[Varshney and Baral, 2022] Neeraj Varshney and Chitta Baral. Model cascading: Towards jointly improving efficiency and ac- curacy of nlp systems.arXiv preprint arXiv:2210.05528,

work page arXiv 2022

[55] [55]

Bench-coe: a framework for collaboration of ex- perts from benchmark.arXiv preprint arXiv:2412.04167,

[Wanget al., 2024 ] Yuanshuai Wang, Xingjian Zhang, Jinkun Zhao, Siwei Wen, Peilin Feng, Shuhao Liao, Lei Huang, and Wenjun Wu. Bench-coe: a framework for collaboration of ex- perts from benchmark.arXiv preprint arXiv:2412.04167,

work page arXiv 2024

[56] [56]

Bridging the gap between different vocabularies for llm ensem- ble

[Xuet al., 2024 ] Yangyifan Xu, Jinliang Lu, and Jiajun Zhang. Bridging the gap between different vocabularies for llm ensem- ble. InNAACL, pages 7133–7145,

work page 2024

[57] [57]

Hit the sweet spot! span-level ensemble for large language models

[Xuet al., 2025 ] Yangyifan Xu, Jianghao Chen, Junhong Wu, and Jiajun Zhang. Hit the sweet spot! span-level ensemble for large language models. InCOLING, pages 8314–8325,

work page 2025

[58] [58]

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

[Yanget al., 2024 ] Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merg- ing in llms, mllms, and beyond: Methods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[59] [59]

Cabs: Conflict- aware and balanced sparsification for enhancing model merging

[Yanget al., 2025 ] Zongzhen Yang, Binhang Qi, Hailong Sun, Wenrui Long, Ruobing Zhao, and Xiang Gao. Cabs: Conflict- aware and balanced sparsification for enhancing model merging. arXiv preprint arXiv:2503.01874,

work page arXiv 2025

[60] [60]

Determine-then-ensemble: Necessity of top-k union for large language model ensembling.arXiv preprint arXiv:2410.03777,

[Yaoet al., 2024 ] Yuxuan Yao, Han Wu, Mingyang Liu, Sichun Luo, Xiongwei Han, Jie Liu, Zhijiang Guo, and Linqi Song. Determine-then-ensemble: Necessity of top-k union for large language model ensembling.arXiv preprint arXiv:2410.03777,

work page arXiv 2024

[61] [61]

Breaking the ceiling of the llm community by treating token generation as a classification for en- sembling.arXiv preprint arXiv:2406.12585,

[Yuet al., 2024 ] Yao-Ching Yu, Chun-Chih Kuo, Ziqi Ye, Yu- Cheng Chang, and Yueh-Se Li. Breaking the ceiling of the llm community by treating token generation as a classification for en- sembling.arXiv preprint arXiv:2406.12585,

work page arXiv 2024

[62] [62]

Large language model cascades with mixture of thought representations for cost-efficient reasoning

[Yueet al., 2024 ] Murong Yue, Jie Zhao, Min Zhang, Liang Du, and Ziyu Yao. Large language model cascades with mixture of thought representations for cost-efficient reasoning. InICLR,

work page 2024

[63] [63]

Wrench: A comprehensive benchmark for weak supervision.arXiv preprint arXiv:2109.11377,

[Zhanget al., 2021 ] Jieyu Zhang, Yue Yu, Yinghao Li, Yujing Wang, Yaming Yang, Mao Yang, and Alexander Ratner. Wrench: A comprehensive benchmark for weak supervision.arXiv preprint arXiv:2109.11377,

work page arXiv 2021

[64] [64]

A survey on programmatic weak supervision.arXiv preprint arXiv:2202.05433,

[Zhanget al., 2022 ] Jieyu Zhang, Cheng-Yu Hsieh, Yue Yu, Chao Zhang, and Alexander Ratner. A survey on programmatic weak supervision.arXiv preprint arXiv:2202.05433,

work page arXiv 2022

[65] [65]

Ecoassistant: Using llm as- sistant more affordably and accurately.arXiv preprint arXiv:2310.03046,

[Zhanget al., 2023 ] Jieyu Zhang, Ranjay Krishna, Ahmed H Awadallah, and Chi Wang. Ecoassistant: Using llm as- sistant more affordably and accurately.arXiv preprint arXiv:2310.03046,

work page arXiv 2023

[66] [66]

If multi-agent debate is the answer, what is the question.arXiv preprint arXiv:2502.08788,

[Zhanget al., 2025 ] Hangfan Zhang, Zhiyao Cui, Xinrun Wang, Qiaosheng Zhang, Zhen Wang, Dinghao Wu, and Shuyue Hu. If multi-agent debate is the answer, what is the question.arXiv preprint arXiv:2502.08788,

work page arXiv 2025

[67] [67]

Knowledge learning with crowdsourc- ing: A brief review and systematic perspective.IEEE/CAA Jour- nal of Automatica Sinica, 9(5):749–762,

[Zhang, 2022] Jing Zhang. Knowledge learning with crowdsourc- ing: A brief review and systematic perspective.IEEE/CAA Jour- nal of Automatica Sinica, 9(5):749–762,

work page 2022

[68] [68]

Eagle: Efficient training-free router for multi-llm inference

[Zhaoet al., 2024 ] Zesen Zhao, Shuowei Jin, and Z Morley Mao. Eagle: Efficient training-free router for multi-llm inference. arXiv preprint arXiv:2409.15518,

work page arXiv 2024

[69] [69]

Truth inference in crowdsourcing: Is the problem solved?Proceedings of the VLDB Endowment, 10(5):541–552,

[Zhenget al., 2017 ] Yudian Zheng, Guoliang Li, Yuanbing Li, Cai- hua Shan, and Reynold Cheng. Truth inference in crowdsourcing: Is the problem solved?Proceedings of the VLDB Endowment, 10(5):541–552,

work page 2017

[70] [70]

Citer: Collaborative inference for ef- ficient large language model decoding with token-level routing

[Zhenget al., 2025 ] Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P Xing, Hongyi Wang, and Huaxiu Yao. Citer: Collaborative inference for ef- ficient large language model decoding with token-level routing. arXiv preprint arXiv:2502.01976,

work page arXiv 2025

[71] [71]

Ensemble learning

[Zhou, 2021] Zhi-Hua Zhou. Ensemble learning. InMachine learn- ing, pages 181–210. Springer, 2021

work page 2021