pith. sign in

arxiv: 2502.18036 · v6 · submitted 2025-02-25 · 💻 cs.CL

Harnessing Multiple Large Language Models: A Survey on LLM Ensemble

Pith reviewed 2026-05-23 02:20 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM EnsembleLarge Language ModelsEnsemble MethodsTaxonomySurveyInference StagesBenchmarksApplications
0
0 comments X

The pith

LLM ensemble methods can be systematically reviewed and classified using a three-stage taxonomy based on when the combination occurs relative to inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the first systematic review of LLM Ensemble techniques that combine multiple large language models to exploit their individual strengths on user queries. It proposes a taxonomy that divides these methods into ensemble-before-inference, ensemble-during-inference, and ensemble-after-inference, then provides detailed classifications and reviews of methods within each category. The survey also examines related benchmarks and applications, summarizes the studies, and outlines future research directions. A sympathetic reader would care because the increasing number of available LLMs makes understanding how to best combine them relevant for practical improvements in downstream tasks.

Core claim

The paper claims to deliver the first comprehensive taxonomy and review of LLM Ensemble, showing that methods fall into three distinct stages relative to the inference process, with each stage containing specific techniques that can be reviewed and compared through existing benchmarks and applications.

What carries the argument

The three-stage taxonomy (ensemble-before-inference, ensemble-during-inference, ensemble-after-inference) that organizes all LLM ensemble methods for review and classification.

If this is right

  • Existing methods can be mapped onto the taxonomy without major omissions.
  • The review identifies gaps that future work can address in each stage.
  • Benchmarks provide a way to evaluate and compare ensemble approaches.
  • Applications demonstrate practical uses across various domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy could guide the development of new hybrid methods that operate across multiple stages.
  • Practitioners might use the classification to choose ensemble strategies based on their computational constraints.
  • Future surveys could expand the taxonomy if new methods emerge that challenge the three-stage division.
  • Connections between LLM Ensemble and other multi-model techniques like mixture-of-experts may warrant further investigation.

Load-bearing premise

That the three-stage taxonomy comprehensively partitions the space of all relevant LLM ensemble methods without significant omissions or overlaps.

What would settle it

Discovery of an LLM ensemble technique that requires a fourth distinct category or cannot fit into the existing three without substantial overlap would challenge the taxonomy's completeness.

Figures

Figures reproduced from arXiv: 2502.18036 by Dingqi Yang, Hailong Sun, Jingzheng Li, Kai Sun, Likang Xiao, Ming Li, Pengpeng Chen, Philip S. Yu, Qianren Mao, Xiaodong Lu, Xiao Huang, Yikun Ban, Yuankai Luo, Zhijun Chen, Zhuoran Li.

Figure 1
Figure 1. Figure 1: Illustration of the LLM Ensemble taxonomy. (Note that for [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Taxonomy of LLM Ensemble methods. 3 Methodology In this section, following the taxonomy in Section 2.1, we systematically review the three types of methods—ensemble before inference, ensemble during inference and ensemble af￾ter inference—in Sections 3.1, 3.2, and 3.3, respectively. 3.1 Ensemble Before Inference As mentioned in Section 2.1, two categories of ensemble￾before-inference approaches exist: pret… view at source ↗
read the original abstract

LLM Ensemble -- which involves the comprehensive use of multiple large language models (LLMs), each aimed at handling user queries during downstream inference, to benefit from their individual strengths -- has gained substantial attention recently. The widespread availability of LLMs, coupled with their varying strengths and out-of-the-box usability, has profoundly advanced the field of LLM Ensemble. This paper presents the first systematic review of recent developments in LLM Ensemble. First, we introduce our taxonomy of LLM Ensemble and discuss several related research problems. Then, we provide a more in-depth classification of the methods under the broad categories of "ensemble-before-inference, ensemble-during-inference, ensemble-after-inference'', and review all relevant methods. Finally, we introduce related benchmarks and applications, summarize existing studies, and suggest several future research directions. A curated list of papers on LLM Ensemble is available at https://github.com/junchenzhi/Awesome-LLM-Ensemble.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims to present the first systematic review of recent developments in LLM Ensemble, which uses multiple LLMs to leverage their individual strengths for downstream inference. It introduces a taxonomy partitioning methods into ensemble-before-inference, ensemble-during-inference, and ensemble-after-inference; provides in-depth classification and review of methods under these categories; discusses related research problems, benchmarks, and applications; summarizes existing studies; and suggests future directions, supported by a curated GitHub list of papers.

Significance. If the taxonomy is shown to be comprehensive, the survey would offer a useful organizing framework for the rapidly growing LLM ensemble literature in NLP, helping researchers identify patterns, gaps, and connections across methods. The public GitHub repository of curated papers is a clear strength, enhancing accessibility and reproducibility of the review.

major comments (1)
  1. [Taxonomy introduction and classification sections] The claim of presenting the 'first systematic review' (abstract) rests on the three-stage taxonomy being exhaustive and non-overlapping. The manuscript does not explicitly analyze or rule out hybrid methods (e.g., dynamic model selection that combines before- and during-inference stages) or orthogonal dimensions (e.g., ensembles over prompting strategies), which could produce overlaps or omissions and undermine the partition's completeness as a systematic structure.
minor comments (1)
  1. [Introduction] The abstract states that 'several related research problems' are discussed, but the manuscript could clarify in the introduction or taxonomy section how these problems map onto the three-stage structure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the presentation of our taxonomy. We address the major comment below and will revise the manuscript to incorporate the suggested analysis.

read point-by-point responses
  1. Referee: [Taxonomy introduction and classification sections] The claim of presenting the 'first systematic review' (abstract) rests on the three-stage taxonomy being exhaustive and non-overlapping. The manuscript does not explicitly analyze or rule out hybrid methods (e.g., dynamic model selection that combines before- and during-inference stages) or orthogonal dimensions (e.g., ensembles over prompting strategies), which could produce overlaps or omissions and undermine the partition's completeness as a systematic structure.

    Authors: We appreciate the referee's observation. Our taxonomy partitions methods according to the primary stage (before, during, or after inference) at which the ensemble decision or aggregation occurs, which provides a clear and actionable organizing principle for the literature. We agree that the manuscript does not explicitly analyze hybrid methods or orthogonal dimensions such as prompting-strategy ensembles. To strengthen the taxonomy section, we will add a dedicated paragraph (or short subsection) that (1) acknowledges the possibility of hybrid approaches, (2) illustrates how a method that spans stages can still be classified by its dominant stage while noting the hybrid aspect, and (3) clarifies that orthogonal dimensions (e.g., prompting) are largely independent of the stage-based partition and can be applied across categories. This revision will make the boundaries of the taxonomy more transparent without altering its core structure or the claim of providing the first systematic review organized around these stages. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive survey with proposed taxonomy

full rationale

This paper is a literature review that introduces a three-stage taxonomy solely to organize existing LLM ensemble methods; no derivations, equations, fitted parameters, or predictions are present. The taxonomy is explicitly presented as an organizing framework rather than a result derived from prior claims, and the 'first systematic review' statement rests on coverage of the literature rather than any self-referential reduction. No self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing steps. The work is self-contained as a descriptive survey.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, it does not introduce new free parameters, axioms, or invented entities; it synthesizes and organizes existing literature on LLM ensembles.

pith-pipeline@v0.9.0 · 5737 in / 1024 out tokens · 37711 ms · 2026-05-23T02:20:10.436771+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Sampling from Your Language Model One Byte at a Time

    cs.CL 2025-06 unverdicted novelty 7.0

    An inference-time technique turns BPE-based LMs into byte- or character-level models, solving the prompt boundary problem while unifying vocabularies across different tokenizers.

  2. Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers

    cs.LG 2025-05 conditional novelty 7.0

    A well-tuned kNN router matches or exceeds state-of-the-art learned routers on new standardized benchmarks spanning instruction, QA, reasoning, and the first multi-modal visual routing dataset, due to locality of mode...

  3. A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability

    cs.LG 2026-05 unverdicted novelty 6.0

    LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.

  4. Rethinking LLM Ensembling from the Perspective of Mixture Models

    cs.LG 2026-05 unverdicted novelty 6.0

    ME reinterprets LLM ensembling as a mixture model by sampling a single model stochastically at each token step, matching the ensemble distribution while invoking only one model per step for substantial speed gains.

  5. Token-Level LLM Collaboration via FusionRoute

    cs.AI 2026-01 unverdicted novelty 6.0

    FusionRoute augments token-level expert routing with a trainable complementary logit generator to expand the policy class and recover optimal decoding under mild conditions, outperforming prior collaboration and mergi...

  6. SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission

    eess.SP 2026-04 unverdicted novelty 5.0

    SpecFed accelerates federated LLM inference via speculative decoding for parallel processing and top-K compression with server-side reconstruction, achieving high fidelity with reduced communication overhead.

  7. Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

    cs.CL 2025-12 unverdicted novelty 5.0

    LLM-PeerReview ensembles LLMs by scoring responses with LLM-as-Judge and selecting the best via averaging or truth inference, beating Smoothie-Global by 6.9-7.3 points on four datasets.

  8. LLM-Powered AI Agent Systems and Their Applications in Industry

    cs.AI 2025-05 unverdicted novelty 2.0

    A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 8 Pith papers · 11 internal anchors

  1. [1]

    GPT-4 Technical Report

    [Achiamet al., 2023 ] Josh Achiam, Steven Adler, Sandhini Agar- wal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv:2303.08774,

  2. [2]

    Automix: Automatically mixing language models

    [Aggarwalet al., 2023 ] Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, et al. Automix: Automatically mixing language models. arXiv preprint arXiv:2310.12963,

  3. [3]

    Structured probabilistic end-to-end learning from crowds

    [Chenet al., 2021 ] Zhijun Chen, Huimin Wang, Hailong Sun, Pengpeng Chen, Tao Han, Xudong Liu, and Jie Yang. Structured probabilistic end-to-end learning from crowds. InIJCAI,

  4. [4]

    Adversarial learning from crowds

    [Chenet al., 2022 ] Pengpeng Chen, Hailong Sun, Yongqiang Yang, and Zhijun Chen. Adversarial learning from crowds. InAAAI,

  5. [5]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    [Chenet al., 2023a ] Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv:2305.05176,

  6. [6]

    A survey on collaborative mechanisms between large and small lan- guage models.arXiv preprint arXiv:2505.07460,

    [Chenet al., 2025 ] Yi Chen, JiaHao Zhao, and HaoHao Han. A survey on collaborative mechanisms between large and small lan- guage models.arXiv preprint arXiv:2505.07460,

  7. [7]

    A unified approach to routing and cascad- ing for llms.arXiv preprint arXiv:2410.10347,

    [Dekonincket al., 2024 ] Jasper Dekoninck, Maximilian Baader, and Martin Vechev. A unified approach to routing and cascad- ing for llms.arXiv preprint arXiv:2410.10347,

  8. [8]

    Hybrid llm: Cost- efficient and quality-aware query routing

    [Dinget al., 2024 ] Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor R ¨uhle, Laks VS Laksh- manan, and Ahmed Hassan Awadallah. Hybrid llm: Cost- efficient and quality-aware query routing. InICLR,

  9. [9]

    A survey on ensemble learning.Frontiers of Computer Science, 14(2):241–258,

    [Donget al., 2020 ] Xibin Dong, Zhiwen Yu, Wenming Cao, Yifan Shi, and Qianli Ma. A survey on ensemble learning.Frontiers of Computer Science, 14(2):241–258,

  10. [10]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    [Duet al., 2023 ] Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and rea- soning in language models through multiagent debate.arXiv preprint arXiv:2305.14325,

  11. [11]

    Bayesian calibration of win rate estimation with llm evaluators.arXiv preprint arXiv:2411.04424,

    [Gaoet al., 2024 ] Yicheng Gao, Gonghan Xu, Zhe Wang, and Ar- man Cohan. Bayesian calibration of win rate estimation with llm evaluators.arXiv preprint arXiv:2411.04424,

  12. [12]

    Smoothie: Label free language model routing

    [Guhaet al., 2024 ] Neel Guha, Mayee F Chen, Trevor Chow, Is- han S Khare, and Christopher Re. Smoothie: Label free language model routing. InNeuIPS,

  13. [13]

    Promptmind team at mediqa-corr 2024: Im- proving clinical text correction with error categorization and llm ensembles.arXiv preprint arXiv:2405.08373,

    [Gundabathula and Kolar, 2024] Satya Kesav Gundabathula and Sriram R Kolar. Promptmind team at mediqa-corr 2024: Im- proving clinical text correction with error categorization and llm ensembles.arXiv preprint arXiv:2405.08373,

  14. [14]

    Language model cascades: Token-level uncer- tainty and beyond.arXiv preprint arXiv:2404.10136,

    [Guptaet al., 2024 ] Neha Gupta, Harikrishna Narasimhan, Wit- tawat Jitkrittum, Ankit Singh Rawat, Aditya Krishna Menon, and Sanjiv Kumar. Language model cascades: Token-level uncer- tainty and beyond.arXiv preprint arXiv:2404.10136,

  15. [15]

    Dynamic ensemble reasoning for llm experts.arXiv preprint arXiv:2412.07448,

    [Huet al., 2024a ] Jinwu Hu, Yufeng Wang, Shuhai Zhang, Kai Zhou, Guohao Chen, Yu Hu, Bin Xiao, and Mingkui Tan. Dynamic ensemble reasoning for llm experts.arXiv preprint arXiv:2412.07448,

  16. [16]

    RouterBench: A Benchmark for Multi-LLM Routing System

    [Huet al., 2024b ] Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system.arXiv preprint arXiv:2403.12031,

  17. [17]

    Ensem- ble learning for heterogeneous large language models with deep parallel collaboration

    [Huanget al., 2024 ] Yichong Huang, Xiaocheng Feng, Baohang Li, Yang Xiang, Hui Wang, Ting Liu, and Bing Qin. Ensem- ble learning for heterogeneous large language models with deep parallel collaboration. InNeurIPS,

  18. [18]

    Llm-blender: Ensembling large language models with pairwise ranking and generative fusion

    [Jianget al., 2023 ] Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. InACL,

  19. [19]

    Collaborative decoding of critical tokens for boosting factuality of large language models.arXiv preprint arXiv:2402.17982,

    [Jinet al., 2024 ] Lifeng Jin, Baolin Peng, Linfeng Song, Haitao Mi, Ye Tian, and Dong Yu. Collaborative decoding of critical tokens for boosting factuality of large language models.arXiv preprint arXiv:2402.17982,

  20. [20]

    When does confidence-based cascade deferral suffice? NeurIPS, 36,

    [Jitkrittumet al., 2024 ] Wittawat Jitkrittum, Neha Gupta, Aditya K Menon, Harikrishna Narasimhan, Ankit Rawat, and Sanjiv Ku- mar. When does confidence-based cascade deferral suffice? NeurIPS, 36,

  21. [21]

    Ensemble-instruct: Generating instruction-tuning data with a heterogeneous mixture of lms.arXiv preprint arXiv:2310.13961,

    [Leeet al., 2023 ] Young-Suk Lee, Md Arafat Sultan, Yousef El- Kurdi, Tahira Naseem Asim Munawar, Radu Florian, Salim Roukos, and Ram ´on Fernandez Astudillo. Ensemble-instruct: Generating instruction-tuning data with a heterogeneous mixture of lms.arXiv preprint arXiv:2310.13961,

  22. [22]

    More agents is all you need

    [Liet al., 2024a ] Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and Deheng Ye. More agents is all you need.arXiv preprint arXiv:2402.05120,

  23. [23]

    Purifying large language models by ensembling a small language model,

    [Liet al., 2024b ] Tianlin Li, Qian Liu, Tianyu Pang, Chao Du, Qing Guo, Yang Liu, and Min Lin. Purifying large language models by ensembling a small language model.arXiv preprint arXiv:2402.14845,

  24. [24]

    Llm bandit: Cost-efficient llm generation via preference-conditioned dynamic routing.arXiv preprint arXiv:2502.02743,

    [Li, 2025] Yang Li. Llm bandit: Cost-efficient llm generation via preference-conditioned dynamic routing.arXiv preprint arXiv:2502.02743,

  25. [25]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    [Liuet al., 2024a ] Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434,

  26. [26]

    Cool-fusion: Fuse large language models without training.arXiv preprint arXiv:2407.19807,

    [Liuet al., 2024b ] Cong Liu, Xiaojun Quan, Yan Pan, Liang Lin, Weigang Wu, and Xu Chen. Cool-fusion: Fuse large language models without training.arXiv preprint arXiv:2407.19807,

  27. [27]

    Merge, ensemble, and cooperate! a survey on collabora- tive strategies in the era of large language models,

    [Luet al., 2024a ] Jinliang Lu, Ziliang Pang, Min Xiao, Yaochen Zhu, Rui Xia, and Jiajun Zhang. Merge, ensemble, and coop- erate! a survey on collaborative strategies in the era of large lan- guage models.arXiv preprint arXiv:2407.06089,

  28. [28]

    Routing to the expert: Efficient reward-guided ensemble of large language models

    [Luet al., 2024b ] Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models. InNAACL, pages 1964–1974,

  29. [29]

    Blending is all you need: Cheaper, better alterna- tive to trillion-parameters llm.arXiv preprint arXiv:2401.02994,

    [Luet al., 2024c ] Xiaoding Lu, Zongyi Liu, Adian Liusie, Vyas Raina, Vineet Mudupalli, Yuwen Zhang, and William Beauchamp. Blending is all you need: Cheaper, better alterna- tive to trillion-parameters llm.arXiv preprint arXiv:2401.02994,

  30. [30]

    Specfuse: Ensembling large language models via next-segment prediction.arXiv preprint arXiv:2412.07380,

    [Lvet al., 2024b ] Bo Lv, Chen Tang, Yanan Zhang, Xin Liu, Yue Yu, and Ping Luo. Specfuse: Ensembling large language models via next-segment prediction.arXiv preprint arXiv:2412.07380,

  31. [31]

    Selectllm: Query-aware efficient selec- tion algorithm for large language models.arXiv preprint arXiv:2408.08545,

    [Mauryaet al., 2024 ] Kaushal Kumar Maurya, KV Srivatsa, and Ekaterina Kochmar. Selectllm: Query-aware efficient selec- tion algorithm for large language models.arXiv preprint arXiv:2408.08545,

  32. [32]

    Pack of llms: Model fusion at test-time via perplexity optimization.arXiv preprint arXiv:2404.11531,

    [Mavromatiset al., 2024 ] Costas Mavromatis, Petros Karypis, and George Karypis. Pack of llms: Model fusion at test-time via perplexity optimization.arXiv preprint arXiv:2404.11531,

  33. [33]

    Routoo: Learning to route to large language models effectively.arXiv preprint arXiv:2401.13979,

    [Mohammadshahiet al., 2024 ] Alireza Mohammadshahi, Ar- shad Rafiq Shaikh, and Majid Yazdani. Routoo: Learning to route to large language models effectively.arXiv preprint arXiv:2401.13979,

  34. [34]

    A comprehensive review on ensemble deep learn- ing: Opportunities and challenges.Journal of King Saud University-Computer and Information Sciences, 35(2):757–774,

    [Mohammed and Kora, 2023] Ammar Mohammed and Rania Kora. A comprehensive review on ensemble deep learn- ing: Opportunities and challenges.Journal of King Saud University-Computer and Information Sciences, 35(2):757–774,

  35. [35]

    Relative representations enable zero-shot latent space communication.arXiv:2209.15430,

    [Moschellaet al., 2022 ] Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodol `a. Relative representations enable zero-shot latent space communication.arXiv:2209.15430,

  36. [36]

    Adaptive selection for homogeneous tools: An instantiation in the rag scenario.arXiv preprint arXiv:2406.12429,

    [Muet al., 2024 ] Feiteng Mu, Yong Jiang, Liwen Zhang, Chu Liu, Wenjie Li, Pengjun Xie, and Fei Huang. Adaptive selection for homogeneous tools: An instantiation in the rag scenario.arXiv preprint arXiv:2406.12429,

  37. [37]

    Metallm: A high-performant and cost-efficient dynamic frame- work for wrapping llms.arXiv preprint arXiv:2407.10834,

    [Nguyenet al., 2024 ] Quang H Nguyen, Duy C Hoang, Juliette De- cugis, Saurav Manchanda, Nitesh V Chawla, and Khoa D Doan. Metallm: A high-performant and cost-efficient dynamic frame- work for wrapping llms.arXiv preprint arXiv:2407.10834,

  38. [38]

    RouteLLM: Learning to Route LLMs with Preference Data

    [Onget al., 2024 ] Isaac Ong, Amjad Almahairi, Vincent Wu, Wei- Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665,

  39. [39]

    Ensembling large language models with process reward-guided tree search for better complex reasoning.arXiv preprint arXiv:2412.15797,

    [Parket al., 2024 ] Sungjin Park, Xiao Liu, Yeyun Gong, and Ed- ward Choi. Ensembling large language models with process reward-guided tree search for better complex reasoning.arXiv preprint arXiv:2412.15797,

  40. [40]

    Cache & distil: Optimising api calls to large language models.arXiv preprint arXiv:2310.13561,

    [Ram´ırezet al., 2023] Guillem Ram ´ırez, Matthias Lindemann, Alexandra Birch, and Ivan Titov. Cache & distil: Optimising api calls to large language models.arXiv preprint arXiv:2310.13561,

  41. [41]

    Snorkel: Rapid training data creation with weak supervision

    [Ratneret al., 2017 ] Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R ´e. Snorkel: Rapid training data creation with weak supervision. InProceed- ings of the VLDB endowment. International conference on very large data bases, volume 11, page 269,

  42. [42]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    [Reimers, 2019] N Reimers. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084,

  43. [43]

    From task-specific models to unified sys- tems: A review of model merging approaches.arXiv preprint arXiv:2503.08998,

    [Ruanet al., 2025 ] Wei Ruan, Tianze Yang, Yifan Zhou, Tianming Liu, and Jin Lu. From task-specific models to unified sys- tems: A review of model merging approaches.arXiv preprint arXiv:2503.08998,

  44. [44]

    Fly-swat or cannon? cost-effective language model choice via meta-modeling

    [ˇSakotaet al., 2024 ] Marija ˇSakota, Maxime Peyrard, and Robert West. Fly-swat or cannon? cost-effective language model choice via meta-modeling. InWSDM, pages 606–615,

  45. [45]

    Large language model routing with bench- mark datasets

    [Shnitzeret al., 2023 ] Tal Shnitzer, Anthony Ou, Mirian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, and Mikhail Yurochkin. Large language model routing with bench- mark datasets. InNeurIPS,

  46. [46]

    Getting more out of mixture of language model reasoning experts

    [Siet al., 2023 ] Chenglei Si, Weijia Shi, Chen Zhao, Luke Zettle- moyer, and Jordan Boyd-Graber. Getting more out of mixture of language model reasoning experts. InFindings of EMNLP,

  47. [47]

    Pickllm: Context-aware rl-assisted large lan- guage model routing.arXiv preprint arXiv:2412.12170,

    [Sikeridiset al., 2024 ] Dimitrios Sikeridis, Dennis Ramdass, and Pranay Pareek. Pickllm: Context-aware rl-assisted large lan- guage model routing.arXiv preprint arXiv:2412.12170,

  48. [48]

    Harnessing the power of multiple minds: Lessons learned from llm routing.arXiv preprint arXiv:2405.00467,

    [Srivatsaet al., 2024 ] KV Srivatsa, Kaushal Kumar Maurya, and Ekaterina Kochmar. Harnessing the power of multiple minds: Lessons learned from llm routing.arXiv preprint arXiv:2405.00467,

  49. [49]

    Tensoropera router: A multi-model router for efficient llm inference.arXiv preprint arXiv:2408.12320,

    [Stripeliset al., 2024 ] Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, and Chaoyang He. Tensoropera router: A multi-model router for efficient llm inference.arXiv preprint arXiv:2408.12320,

  50. [50]

    Gemini: A Family of Highly Capable Multimodal Models

    [Teamet al., 2023 ] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

  51. [51]

    Llm-topla: Efficient llm ensemble by maximising diversity

    [Tekinet al., 2024 ] Selim Tekin, Fatih Ilhan, Tiansheng Huang, Si- hao Hu, and Ling Liu. Llm-topla: Efficient llm ensemble by maximising diversity. InFindings of EMNLP,

  52. [52]

    LLaMA: Open and Efficient Foundation Language Models

    [Touvronet al., 2023 ] Hugo Touvron, Thibaut Lavril, Gautier Izac- ard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,

  53. [53]

    Multi-Agent Collaboration Mechanisms: A Survey of LLMs

    [Tranet al., 2025 ] Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322,

  54. [54]

    Model cascading: Towards jointly improving efficiency and accuracy of nlp systems, 2022

    [Varshney and Baral, 2022] Neeraj Varshney and Chitta Baral. Model cascading: Towards jointly improving efficiency and ac- curacy of nlp systems.arXiv preprint arXiv:2210.05528,

  55. [55]

    Bench-coe: a framework for collaboration of ex- perts from benchmark.arXiv preprint arXiv:2412.04167,

    [Wanget al., 2024 ] Yuanshuai Wang, Xingjian Zhang, Jinkun Zhao, Siwei Wen, Peilin Feng, Shuhao Liao, Lei Huang, and Wenjun Wu. Bench-coe: a framework for collaboration of ex- perts from benchmark.arXiv preprint arXiv:2412.04167,

  56. [56]

    Bridging the gap between different vocabularies for llm ensem- ble

    [Xuet al., 2024 ] Yangyifan Xu, Jinliang Lu, and Jiajun Zhang. Bridging the gap between different vocabularies for llm ensem- ble. InNAACL, pages 7133–7145,

  57. [57]

    Hit the sweet spot! span-level ensemble for large language models

    [Xuet al., 2025 ] Yangyifan Xu, Jianghao Chen, Junhong Wu, and Jiajun Zhang. Hit the sweet spot! span-level ensemble for large language models. InCOLING, pages 8314–8325,

  58. [58]

    Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

    [Yanget al., 2024 ] Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merg- ing in llms, mllms, and beyond: Methods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666,

  59. [59]

    Cabs: Conflict- aware and balanced sparsification for enhancing model merging

    [Yanget al., 2025 ] Zongzhen Yang, Binhang Qi, Hailong Sun, Wenrui Long, Ruobing Zhao, and Xiang Gao. Cabs: Conflict- aware and balanced sparsification for enhancing model merging. arXiv preprint arXiv:2503.01874,

  60. [60]

    Determine-then-ensemble: Necessity of top-k union for large language model ensembling.arXiv preprint arXiv:2410.03777,

    [Yaoet al., 2024 ] Yuxuan Yao, Han Wu, Mingyang Liu, Sichun Luo, Xiongwei Han, Jie Liu, Zhijiang Guo, and Linqi Song. Determine-then-ensemble: Necessity of top-k union for large language model ensembling.arXiv preprint arXiv:2410.03777,

  61. [61]

    Breaking the ceiling of the llm community by treating token generation as a classification for en- sembling.arXiv preprint arXiv:2406.12585,

    [Yuet al., 2024 ] Yao-Ching Yu, Chun-Chih Kuo, Ziqi Ye, Yu- Cheng Chang, and Yueh-Se Li. Breaking the ceiling of the llm community by treating token generation as a classification for en- sembling.arXiv preprint arXiv:2406.12585,

  62. [62]

    Large language model cascades with mixture of thought representations for cost-efficient reasoning

    [Yueet al., 2024 ] Murong Yue, Jie Zhao, Min Zhang, Liang Du, and Ziyu Yao. Large language model cascades with mixture of thought representations for cost-efficient reasoning. InICLR,

  63. [63]

    Wrench: A comprehensive benchmark for weak supervision.arXiv preprint arXiv:2109.11377,

    [Zhanget al., 2021 ] Jieyu Zhang, Yue Yu, Yinghao Li, Yujing Wang, Yaming Yang, Mao Yang, and Alexander Ratner. Wrench: A comprehensive benchmark for weak supervision.arXiv preprint arXiv:2109.11377,

  64. [64]

    A survey on programmatic weak supervision.arXiv preprint arXiv:2202.05433,

    [Zhanget al., 2022 ] Jieyu Zhang, Cheng-Yu Hsieh, Yue Yu, Chao Zhang, and Alexander Ratner. A survey on programmatic weak supervision.arXiv preprint arXiv:2202.05433,

  65. [65]

    Ecoassistant: Using llm as- sistant more affordably and accurately.arXiv preprint arXiv:2310.03046,

    [Zhanget al., 2023 ] Jieyu Zhang, Ranjay Krishna, Ahmed H Awadallah, and Chi Wang. Ecoassistant: Using llm as- sistant more affordably and accurately.arXiv preprint arXiv:2310.03046,

  66. [66]

    If multi-agent debate is the answer, what is the question.arXiv preprint arXiv:2502.08788,

    [Zhanget al., 2025 ] Hangfan Zhang, Zhiyao Cui, Xinrun Wang, Qiaosheng Zhang, Zhen Wang, Dinghao Wu, and Shuyue Hu. If multi-agent debate is the answer, what is the question.arXiv preprint arXiv:2502.08788,

  67. [67]

    Knowledge learning with crowdsourc- ing: A brief review and systematic perspective.IEEE/CAA Jour- nal of Automatica Sinica, 9(5):749–762,

    [Zhang, 2022] Jing Zhang. Knowledge learning with crowdsourc- ing: A brief review and systematic perspective.IEEE/CAA Jour- nal of Automatica Sinica, 9(5):749–762,

  68. [68]

    Eagle: Efficient training-free router for multi-llm inference

    [Zhaoet al., 2024 ] Zesen Zhao, Shuowei Jin, and Z Morley Mao. Eagle: Efficient training-free router for multi-llm inference. arXiv preprint arXiv:2409.15518,

  69. [69]

    Truth inference in crowdsourcing: Is the problem solved?Proceedings of the VLDB Endowment, 10(5):541–552,

    [Zhenget al., 2017 ] Yudian Zheng, Guoliang Li, Yuanbing Li, Cai- hua Shan, and Reynold Cheng. Truth inference in crowdsourcing: Is the problem solved?Proceedings of the VLDB Endowment, 10(5):541–552,

  70. [70]

    Citer: Collaborative inference for ef- ficient large language model decoding with token-level routing

    [Zhenget al., 2025 ] Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P Xing, Hongyi Wang, and Huaxiu Yao. Citer: Collaborative inference for ef- ficient large language model decoding with token-level routing. arXiv preprint arXiv:2502.01976,

  71. [71]

    Ensemble learning

    [Zhou, 2021] Zhi-Hua Zhou. Ensemble learning. InMachine learn- ing, pages 181–210. Springer, 2021