pith. machine review for the scientific record. sign in

arxiv: 2406.04692 · v1 · submitted 2024-06-07 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

Mixture-of-Agents Enhances Large Language Model Capabilities

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:25 UTC · model grok-4.3

classification 💻 cs.CL
keywords Mixture-of-AgentsLarge language modelsMulti-agent systemsLLM ensemblesAlpacaEval benchmarkModel collaboration
0
0 comments X

The pith

A layered mixture of multiple LLM agents outperforms GPT-4 Omni on AlpacaEval 2.0, MT-Bench, and FLASK by using prior-layer outputs as auxiliary context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Mixture-of-Agents architecture that stacks layers of large language models, where each agent receives all outputs from the previous layer to inform its own response. This setup produces state-of-the-art results on standard benchmarks, with an all-open-source version reaching 65.1 percent on AlpacaEval 2.0 compared with 57.5 percent for GPT-4 Omni. A sympathetic reader cares because the method shows existing models can be combined to exceed the strongest single model without additional training or new data. The approach demonstrates that collective processing across layers can extract better answers than any individual agent provides alone.

Core claim

The paper claims that constructing a layered Mixture-of-Agents architecture, in which every agent in a layer takes the full set of outputs from agents in the prior layer as auxiliary information, yields responses that surpass GPT-4 Omni on AlpacaEval 2.0, MT-Bench, and FLASK, including a 65.1 percent score for an open-source-only MoA versus 57.5 percent for GPT-4 Omni.

What carries the argument

The Mixture-of-Agents layered architecture, in which each agent conditions its generation on all prior-layer outputs as auxiliary context to refine the final answer.

If this is right

  • An ensemble of only open-source LLMs can lead the AlpacaEval 2.0 leaderboard by a substantial margin.
  • The same layered structure raises scores on MT-Bench and FLASK above GPT-4 Omni.
  • No additional training is required; performance gains come from routing information across agents in successive layers.
  • Diversity among the chosen base models supplies the collective expertise that drives the improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method implies that incremental layering can continue to extract value from additional agents as long as base-model quality remains high.
  • It connects to broader questions of how information should be aggregated across heterogeneous models without explicit weighting.
  • A direct test would vary layer count while holding the agent pool fixed and measure whether returns diminish after a small number of layers.
  • The approach may generalize to other sequential generation tasks where each step can usefully condition on multiple prior drafts.

Load-bearing premise

That feeding outputs from earlier agents as auxiliary input will improve response quality without adding noise or compounding mistakes from weaker models.

What would settle it

A controlled run in which replacing strong early-layer agents with weaker ones causes MoA accuracy to fall below the single best base model on AlpacaEval 2.0 would falsify the central claim.

read the original abstract

Recent advances in large language models (LLMs) demonstrate substantial capabilities in natural language understanding and generation tasks. With the growing number of LLMs, how to harness the collective expertise of multiple LLMs is an exciting open direction. Toward this goal, we propose a new approach that leverages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA) methodology. In our approach, we construct a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response. MoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and FLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by GPT-4 Omni.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Mixture-of-Agents (MoA), a layered architecture in which each layer contains multiple LLM agents and each agent receives the full set of outputs from the preceding layer as auxiliary context when generating its response. It reports that MoA ensembles, including those using only open-source models, achieve state-of-the-art results on AlpacaEval 2.0 (65.1 % vs. 57.5 % for GPT-4 Omni), MT-Bench, and FLASK.

Significance. If the reported gains are shown to be robust to controls for prompt construction and model selection, the work would demonstrate a practical, training-free route to combining existing LLMs that can exceed the performance of the strongest individual models, particularly valuable for open-source ensembles on instruction-following benchmarks.

major comments (3)
  1. [Section 3] Section 3 (MoA Architecture): the description states that each agent 'takes all the outputs from agents in the previous layer as auxiliary information' with no filtering or weighting; this leaves the central claim vulnerable to error propagation from weaker agents, yet no layer-wise ablation, single-layer baseline, or error-analysis experiment is provided to test the assumption.
  2. [Section 4.1] Section 4.1 (Experimental Setup): the manuscript gives no details on the precise models chosen for each layer, the number of layers, the prompt templates used to incorporate prior outputs, or any controls for prompt-engineering effects, making it impossible to determine whether the 65.1 % AlpacaEval score arises from the layered architecture or from other factors.
  3. [Section 4.2] Section 4.2 and Table 1 (Benchmark Results): the reported scores lack statistical significance tests, standard-error estimates, or multiple-run variance; without these, the claimed 'substantial gap' over GPT-4 Omni cannot be assessed as reliable.
minor comments (2)
  1. [Abstract] The abstract and introduction use 'GPT-4 Omni' and 'GPT-4o' interchangeably; standardize the terminology throughout.
  2. [Figure 1] Figure 1 (architecture diagram) would benefit from explicit labels for layer indices and input/output arrows to clarify the flow of auxiliary context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each point below and will revise the manuscript to incorporate additional experiments, details, and analyses as suggested.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (MoA Architecture): the description states that each agent 'takes all the outputs from agents in the previous layer as auxiliary information' with no filtering or weighting; this leaves the central claim vulnerable to error propagation from weaker agents, yet no layer-wise ablation, single-layer baseline, or error-analysis experiment is provided to test the assumption.

    Authors: We agree that the manuscript would benefit from explicit validation of the architecture's robustness. In the revision, we will add layer-wise ablation studies (removing individual layers), single-layer baselines, and an error-propagation analysis showing how later agents synthesize and correct outputs from prior layers. revision: yes

  2. Referee: [Section 4.1] Section 4.1 (Experimental Setup): the manuscript gives no details on the precise models chosen for each layer, the number of layers, the prompt templates used to incorporate prior outputs, or any controls for prompt-engineering effects, making it impossible to determine whether the 65.1 % AlpacaEval score arises from the layered architecture or from other factors.

    Authors: We will expand Section 4.1 and add a detailed appendix specifying the exact models per layer (e.g., Llama-3-70B, Mixtral-8x22B), number of layers (3 in main results), full prompt templates, and prompt-engineering controls such as equivalent single-model prompts to isolate the MoA contribution. revision: yes

  3. Referee: [Section 4.2] Section 4.2 and Table 1 (Benchmark Results): the reported scores lack statistical significance tests, standard-error estimates, or multiple-run variance; without these, the claimed 'substantial gap' over GPT-4 Omni cannot be assessed as reliable.

    Authors: We acknowledge the need for statistical rigor. The revision will report standard errors from multiple runs (where evaluation variance can be measured) and include significance tests (e.g., bootstrap or paired comparisons) to substantiate the gaps on AlpacaEval 2.0, MT-Bench, and FLASK. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results on fixed public datasets

full rationale

The paper proposes a layered MoA architecture in which each agent receives prior-layer outputs as auxiliary context and then reports direct empirical scores on AlpacaEval 2.0, MT-Bench, and FLASK. These scores are measured outcomes on held-out benchmarks rather than quantities derived from fitted parameters, self-referential definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked that reduce the claimed performance gains to the inputs by construction; the central claim therefore remains an independent empirical observation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard LLM prompting assumptions and empirical choices for architecture depth and agent count; no new physical or mathematical entities are introduced.

free parameters (2)
  • Number of layers
    Depth of the MoA stack is a design choice likely tuned on validation data.
  • Number of agents per layer
    How many LLMs participate in each layer is selected to balance performance and cost.
axioms (1)
  • domain assumption LLM agents can usefully incorporate outputs from other models as auxiliary context in their prompts.
    The layered improvement mechanism depends on this property holding across the chosen models.

pith-pipeline@v0.9.0 · 5471 in / 1165 out tokens · 28121 ms · 2026-05-16T19:25:42.086172+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation Jcost_symm unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response. MoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and FLASK, surpassing GPT-4 Omni.

  • PhiForcing phi_equation unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we construct a layered MoA architecture wherein each layer comprises multiple LLM agents

  • DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    our MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery

    cs.CR 2026-04 unverdicted novelty 7.0

    Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes ac...

  2. SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees

    cs.LG 2026-04 unverdicted novelty 7.0

    SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.

  3. Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

    cs.CL 2026-02 unverdicted novelty 7.0

    Pyramid MoA is a hierarchical Mixture-of-Agents system with a decision-theoretic router that achieves up to 42.9% compute savings while nearly matching oracle accuracy on MBPP, GSM8K, MMLU, HumanEval, and MATH.

  4. SANet: A Semantic-aware Agentic AI Networking Framework for Cross-layer Optimization in 6G

    cs.AI 2025-12 unverdicted novelty 7.0

    SANet uses semantic-aware AI agents for cross-layer 6G optimization, achieving up to 14.61% performance gains with 44.37% of the FLOPs of prior methods via model partitioning and decentralized multi-objective algorithms.

  5. Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...

  6. Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding

    cs.AI 2026-05 unverdicted novelty 6.0

    CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.

  7. Talk is Cheap, Communication is Hard: Dynamic Grounding Failures and Repair in Multi-Agent Negotiation

    cs.MA 2026-05 unverdicted novelty 6.0

    LLM agent pairs in a resource allocation negotiation game fail to reach Pareto-optimal outcomes due to dynamic grounding failures such as loss of interaction history, anchoring, and referential errors.

  8. Talk is Cheap, Communication is Hard: Dynamic Grounding Failures and Repair in Multi-Agent Negotiation

    cs.MA 2026-05 unverdicted novelty 6.0

    LLM agent dyads fail to reach Pareto-optimal resource allocations in an iterated negotiation game due to dynamic grounding failures including anchoring, perfunctory fairness, and lost commitments, despite individual c...

  9. CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness

    q-bio.NC 2026-04 unverdicted novelty 6.0

    CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.

  10. Trace-Level Analysis of Information Contamination in Multi-Agent Systems

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.

  11. SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    SpatiO uses heterogeneous vision-language agents with test-time orchestration to dynamically weight their contributions for improved spatial reasoning on benchmarks like 3DSRBench and CV-Bench.

  12. CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation

    cs.AI 2026-04 unverdicted novelty 6.0

    CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.

  13. ChemGraph-XANES: An Agentic Framework for XANES Simulation and Analysis

    cond-mat.mtrl-sci 2026-04 unverdicted novelty 6.0

    An LLM-orchestrated framework automates the full XANES workflow from natural language to normalized spectra and curated data.

  14. Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

    cs.AI 2026-04 unverdicted novelty 6.0

    Prompt optimization in compound AI systems is statistically indistinguishable from random chance except when tasks have exploitable output structure; a two-stage diagnostic predicts success.

  15. Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus

    cs.LG 2026-04 conditional novelty 6.0

    LLM agent committees exhibit representational collapse with mean cosine similarity of 0.888, and diversity-aware consensus reaches 87% accuracy on GSM8K versus 84% for self-consistency at lower cost.

  16. TRINITY: An Evolved LLM Coordinator

    cs.LG 2025-12 unverdicted novelty 6.0

    A compact 0.6B-parameter coordinator with a 10K-parameter head uses evolutionary strategy to dynamically delegate roles to LLMs, achieving SOTA results such as 86.2% on LiveCodeBench.

  17. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    cs.LG 2024-07 unverdicted novelty 6.0

    Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.

  18. Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling

    cs.AI 2026-05 unverdicted novelty 5.0

    Multi-agent debate and mixture-of-agents outperform self-consistency by 1.3 and 2.7 percentage points respectively at equal compute budgets on MMLU-Pro and BBH, with advantages that continue at higher scales while sel...

  19. Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled Futures

    cs.AI 2026-04 unverdicted novelty 4.0

    A survey comparing classical multi-agent systems with large foundation model-enabled multi-agent systems, showing how the latter enables semantic-level collaboration and greater adaptability.

  20. Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities

    cs.DC 2026-04 unverdicted novelty 3.0

    A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 19 Pith papers · 15 internal anchors

  1. [1]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report. arXiv preprint arXiv:2309.16609,

  2. [2]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

  3. [3]

    ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

    Chan, C.-M., Chen, W., Su, Y ., Yu, J., Xue, W., Zhang, S., Fu, J., and Liu, Z. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201,

  4. [4]

    C.-Y ., Saha, S., and Bansal, M

    Chen, J. C.-Y ., Saha, S., and Bansal, M. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007, 2023a. Chen, L., Zaharia, M., and Zou, J. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023b. Chowdhery, A., Narang,...

  5. [5]

    Active prompting with chain-of-thought for large language models

    Diao, S., Wang, P., Lin, Y ., and Zhang, T. Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246,

  6. [6]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Du, Y ., Li, S., Torralba, A., Tenenbaum, J. B., and Mordatch, I. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325,

  7. [7]

    Dubois, Y ., Galambosi, B., Liang, P., and Hashimoto, T. B. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475,

  8. [8]

    Complexity-based prompting for multi-step reasoning

    Fu, Y ., Peng, H., Sabharwal, A., Clark, P., and Khot, T. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720,

  9. [9]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y ., Li, Y ., et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196,

  10. [10]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

  11. [11]

    Enabling ensemble learn- ing for heterogeneous large language models with deep parallel collaboration

    Huang, Y ., Feng, X., Li, B., Xiang, Y ., Wang, H., Qin, B., and Liu, T. Enabling ensemble learn- ing for heterogeneous large language models with deep parallel collaboration. arXiv preprint arXiv:2404.12715,

  12. [12]

    Mixtral of Experts

    Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., de Las Casas, D., Hanna, E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L. R., Saulnier, L., Lachaux, M., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Scao, T. L., Gervet, T., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mixtral of...

  13. [13]

    Mixtral of Experts

    doi: 10.48550/ARXIV .2401.04088. URLhttps://doi.org/10.48550/arXiv.2401.04088. Jiang, D., Ren, X., and Lin, B. Y . LLM-blender: Ensembling large language models with pairwise ranking and generative fusion. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olum...

  14. [14]

    doi: 10.18653/v1/2023.acl-long.792

    Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.792. URL https://aclanthology.org/2023.acl-long

  15. [15]

    Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

    Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y ., Wang, R., Yang, Y ., Tu, Z., and Shi, S. Encour- aging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118,

  16. [16]

    Deductive verification of chain-of-thought reasoning

    10 Ling, Z., Fang, Y ., Li, X., Huang, Z., Lee, M., Memisevic, R., and Su, H. Deductive verification of chain-of-thought reasoning. arXiv preprint arXiv:2306.03872,

  17. [17]

    Bleu: a method for automatic evaluation of machine translation

    Papineni, K., Roukos, S., Ward, T., and Zhu, W. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA , pp. 311–318. ACL,

  18. [18]

    URL https://aclanthology.org/P02-1040/

    doi: 10.3115/ 1073083.1073135. URL https://aclanthology.org/P02-1040/. RapidFuzz. python-levenshtein by rapidfuzz. https://github.com/rapidfuzz/ python-Levenshtein,

  19. [19]

    Code Llama: Open Foundation Models for Code

    Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y ., Liu, J., Remez, T., Rapin, J., et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950,

  20. [20]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outra- geously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

  21. [21]

    Gemini: A Family of Highly Capable Multimodal Models

    URL https://openreview.net/ forum?id=LyNsMNNLjY. Team, G., Anil, R., Borgeaud, S., Wu, Y ., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

  22. [22]

    LLaMA: Open and Efficient Foundation Language Models

    URL https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm . Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Touvron, H., Martin, L., Stone, K., Albert, P., Al...

  23. [23]

    Rethinking the Bounds of LLM Reasoning: Are Multi-Agent Discussions the Key? arXiv:2402.18272 [cs]

    Wang, Q., Wang, Z., Su, Y ., Tong, H., and Song, Y . Rethinking the bounds of llm reasoning: Are multi-agent discussions the key? arXiv preprint arXiv:2402.18272, 2024b. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2...

  24. [24]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    11 0.00 0.05 0.10 0.15 0.20 0.25 Spearman correlation coefficient QWen1.5-110B QWen1.5-72B WizardLM Llama-3-70B Mixtral-8x22B dbrx-instruct Aggregator Aggregation 1st aggregation 2nd aggregation 3rd aggregation 0.00 0.05 0.10 0.15 0.20 0.25 Spearman correlation coefficient QWen1.5-110B QWen1.5-72B WizardLM Llama-3-70B Mixtral-8x22B dbrx-instruct Aggregato...

  25. [25]

    Exploring collaboration mechanisms for llm agents: A social psychology view

    Zhang, J., Xu, X., and Deng, S. Exploring collaboration mechanisms for llm agents: A social psychology view. arXiv preprint arXiv:2310.02124,

  26. [26]

    Automatic Chain of Thought Prompting in Large Language Models

    Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., et al. Opt: Open pre-trained transformer language models. arXiv e-prints, pp. arXiv–2205, 2022a. Zhang, Z., Zhang, A., Li, M., and Smola, A. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022b. Zhen...

  27. [27]

    instruction

    Supplementary Material A Spearman Correlation using Different Similarity Functions We present results using TF-IDF-based similarity and Levenshtein similarity when calculating the Spearman correlation. Specifically, within each sample ofn proposed answers, we calculate Spearman correlation coefficient between the n similarity scores and the n preference s...