arxiv: 2406.04692 · v1 · submitted 2024-06-07 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

Mixture-of-Agents Enhances Large Language Model Capabilities

Junlin Wang , Jue Wang , Ben Athiwaratkun , Ce Zhang , James Zou

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:25 UTC · model grok-4.3

classification 💻 cs.CL

keywords Mixture-of-AgentsLarge language modelsMulti-agent systemsLLM ensemblesAlpacaEval benchmarkModel collaboration

0 comments

The pith

A layered mixture of multiple LLM agents outperforms GPT-4 Omni on AlpacaEval 2.0, MT-Bench, and FLASK by using prior-layer outputs as auxiliary context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Mixture-of-Agents architecture that stacks layers of large language models, where each agent receives all outputs from the previous layer to inform its own response. This setup produces state-of-the-art results on standard benchmarks, with an all-open-source version reaching 65.1 percent on AlpacaEval 2.0 compared with 57.5 percent for GPT-4 Omni. A sympathetic reader cares because the method shows existing models can be combined to exceed the strongest single model without additional training or new data. The approach demonstrates that collective processing across layers can extract better answers than any individual agent provides alone.

Core claim

The paper claims that constructing a layered Mixture-of-Agents architecture, in which every agent in a layer takes the full set of outputs from agents in the prior layer as auxiliary information, yields responses that surpass GPT-4 Omni on AlpacaEval 2.0, MT-Bench, and FLASK, including a 65.1 percent score for an open-source-only MoA versus 57.5 percent for GPT-4 Omni.

What carries the argument

The Mixture-of-Agents layered architecture, in which each agent conditions its generation on all prior-layer outputs as auxiliary context to refine the final answer.

If this is right

An ensemble of only open-source LLMs can lead the AlpacaEval 2.0 leaderboard by a substantial margin.
The same layered structure raises scores on MT-Bench and FLASK above GPT-4 Omni.
No additional training is required; performance gains come from routing information across agents in successive layers.
Diversity among the chosen base models supplies the collective expertise that drives the improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method implies that incremental layering can continue to extract value from additional agents as long as base-model quality remains high.
It connects to broader questions of how information should be aggregated across heterogeneous models without explicit weighting.
A direct test would vary layer count while holding the agent pool fixed and measure whether returns diminish after a small number of layers.
The approach may generalize to other sequential generation tasks where each step can usefully condition on multiple prior drafts.

Load-bearing premise

That feeding outputs from earlier agents as auxiliary input will improve response quality without adding noise or compounding mistakes from weaker models.

What would settle it

A controlled run in which replacing strong early-layer agents with weaker ones causes MoA accuracy to fall below the single best base model on AlpacaEval 2.0 would falsify the central claim.

read the original abstract

Recent advances in large language models (LLMs) demonstrate substantial capabilities in natural language understanding and generation tasks. With the growing number of LLMs, how to harness the collective expertise of multiple LLMs is an exciting open direction. Toward this goal, we propose a new approach that leverages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA) methodology. In our approach, we construct a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response. MoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and FLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by GPT-4 Omni.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Mixture-of-Agents (MoA), a layered architecture in which each layer contains multiple LLM agents and each agent receives the full set of outputs from the preceding layer as auxiliary context when generating its response. It reports that MoA ensembles, including those using only open-source models, achieve state-of-the-art results on AlpacaEval 2.0 (65.1 % vs. 57.5 % for GPT-4 Omni), MT-Bench, and FLASK.

Significance. If the reported gains are shown to be robust to controls for prompt construction and model selection, the work would demonstrate a practical, training-free route to combining existing LLMs that can exceed the performance of the strongest individual models, particularly valuable for open-source ensembles on instruction-following benchmarks.

major comments (3)

[Section 3] Section 3 (MoA Architecture): the description states that each agent 'takes all the outputs from agents in the previous layer as auxiliary information' with no filtering or weighting; this leaves the central claim vulnerable to error propagation from weaker agents, yet no layer-wise ablation, single-layer baseline, or error-analysis experiment is provided to test the assumption.
[Section 4.1] Section 4.1 (Experimental Setup): the manuscript gives no details on the precise models chosen for each layer, the number of layers, the prompt templates used to incorporate prior outputs, or any controls for prompt-engineering effects, making it impossible to determine whether the 65.1 % AlpacaEval score arises from the layered architecture or from other factors.
[Section 4.2] Section 4.2 and Table 1 (Benchmark Results): the reported scores lack statistical significance tests, standard-error estimates, or multiple-run variance; without these, the claimed 'substantial gap' over GPT-4 Omni cannot be assessed as reliable.

minor comments (2)

[Abstract] The abstract and introduction use 'GPT-4 Omni' and 'GPT-4o' interchangeably; standardize the terminology throughout.
[Figure 1] Figure 1 (architecture diagram) would benefit from explicit labels for layer indices and input/output arrows to clarify the flow of auxiliary context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each point below and will revise the manuscript to incorporate additional experiments, details, and analyses as suggested.

read point-by-point responses

Referee: [Section 3] Section 3 (MoA Architecture): the description states that each agent 'takes all the outputs from agents in the previous layer as auxiliary information' with no filtering or weighting; this leaves the central claim vulnerable to error propagation from weaker agents, yet no layer-wise ablation, single-layer baseline, or error-analysis experiment is provided to test the assumption.

Authors: We agree that the manuscript would benefit from explicit validation of the architecture's robustness. In the revision, we will add layer-wise ablation studies (removing individual layers), single-layer baselines, and an error-propagation analysis showing how later agents synthesize and correct outputs from prior layers. revision: yes
Referee: [Section 4.1] Section 4.1 (Experimental Setup): the manuscript gives no details on the precise models chosen for each layer, the number of layers, the prompt templates used to incorporate prior outputs, or any controls for prompt-engineering effects, making it impossible to determine whether the 65.1 % AlpacaEval score arises from the layered architecture or from other factors.

Authors: We will expand Section 4.1 and add a detailed appendix specifying the exact models per layer (e.g., Llama-3-70B, Mixtral-8x22B), number of layers (3 in main results), full prompt templates, and prompt-engineering controls such as equivalent single-model prompts to isolate the MoA contribution. revision: yes
Referee: [Section 4.2] Section 4.2 and Table 1 (Benchmark Results): the reported scores lack statistical significance tests, standard-error estimates, or multiple-run variance; without these, the claimed 'substantial gap' over GPT-4 Omni cannot be assessed as reliable.

Authors: We acknowledge the need for statistical rigor. The revision will report standard errors from multiple runs (where evaluation variance can be measured) and include significance tests (e.g., bootstrap or paired comparisons) to substantiate the gaps on AlpacaEval 2.0, MT-Bench, and FLASK. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results on fixed public datasets

full rationale

The paper proposes a layered MoA architecture in which each agent receives prior-layer outputs as auxiliary context and then reports direct empirical scores on AlpacaEval 2.0, MT-Bench, and FLASK. These scores are measured outcomes on held-out benchmarks rather than quantities derived from fitted parameters, self-referential definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked that reduce the claimed performance gains to the inputs by construction; the central claim therefore remains an independent empirical observation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard LLM prompting assumptions and empirical choices for architecture depth and agent count; no new physical or mathematical entities are introduced.

free parameters (2)

Number of layers
Depth of the MoA stack is a design choice likely tuned on validation data.
Number of agents per layer
How many LLMs participate in each layer is selected to balance performance and cost.

axioms (1)

domain assumption LLM agents can usefully incorporate outputs from other models as auxiliary context in their prompts.
The layered improvement mechanism depends on this property holding across the chosen models.

pith-pipeline@v0.9.0 · 5471 in / 1165 out tokens · 28121 ms · 2026-05-16T19:25:42.086172+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation Jcost_symm unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response. MoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and FLASK, surpassing GPT-4 Omni.
PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we construct a layered MoA architecture wherein each layer comprises multiple LLM agents
DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

our MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery
cs.CR 2026-04 unverdicted novelty 7.0

Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes ac...
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
cs.LG 2026-04 unverdicted novelty 7.0

SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
cs.CL 2026-02 unverdicted novelty 7.0

Pyramid MoA is a hierarchical Mixture-of-Agents system with a decision-theoretic router that achieves up to 42.9% compute savings while nearly matching oracle accuracy on MBPP, GSM8K, MMLU, HumanEval, and MATH.
SANet: A Semantic-aware Agentic AI Networking Framework for Cross-layer Optimization in 6G
cs.AI 2025-12 unverdicted novelty 7.0

SANet uses semantic-aware AI agents for cross-layer 6G optimization, achieving up to 14.61% performance gains with 44.37% of the FLOPs of prior methods via model partitioning and decentralized multi-objective algorithms.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
cs.AI 2026-05 unverdicted novelty 6.0

CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
Talk is Cheap, Communication is Hard: Dynamic Grounding Failures and Repair in Multi-Agent Negotiation
cs.MA 2026-05 unverdicted novelty 6.0

LLM agent pairs in a resource allocation negotiation game fail to reach Pareto-optimal outcomes due to dynamic grounding failures such as loss of interaction history, anchoring, and referential errors.
Talk is Cheap, Communication is Hard: Dynamic Grounding Failures and Repair in Multi-Agent Negotiation
cs.MA 2026-05 unverdicted novelty 6.0

LLM agent dyads fail to reach Pareto-optimal resource allocations in an iterated negotiation game due to dynamic grounding failures including anchoring, perfunctory fairness, and lost commitments, despite individual c...
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness
q-bio.NC 2026-04 unverdicted novelty 6.0

CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
cs.AI 2026-04 unverdicted novelty 6.0

Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

SpatiO uses heterogeneous vision-language agents with test-time orchestration to dynamically weight their contributions for improved spatial reasoning on benchmarks like 3DSRBench and CV-Bench.
CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation
cs.AI 2026-04 unverdicted novelty 6.0

CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.
ChemGraph-XANES: An Agentic Framework for XANES Simulation and Analysis
cond-mat.mtrl-sci 2026-04 unverdicted novelty 6.0

An LLM-orchestrated framework automates the full XANES workflow from natural language to normalized spectra and curated data.
Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems
cs.AI 2026-04 unverdicted novelty 6.0

Prompt optimization in compound AI systems is statistically indistinguishable from random chance except when tasks have exploitable output structure; a two-stage diagnostic predicts success.
Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus
cs.LG 2026-04 conditional novelty 6.0

LLM agent committees exhibit representational collapse with mean cosine similarity of 0.888, and diversity-aware consensus reaches 87% accuracy on GSM8K versus 84% for self-consistency at lower cost.
TRINITY: An Evolved LLM Coordinator
cs.LG 2025-12 unverdicted novelty 6.0

A compact 0.6B-parameter coordinator with a 10K-parameter head uses evolutionary strategy to dynamically delegate roles to LLMs, achieving SOTA results such as 86.2% on LiveCodeBench.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
cs.LG 2024-07 unverdicted novelty 6.0

Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling
cs.AI 2026-05 unverdicted novelty 5.0

Multi-agent debate and mixture-of-agents outperform self-consistency by 1.3 and 2.7 percentage points respectively at equal compute budgets on MMLU-Pro and BBH, with advantages that continue at higher scales while sel...
Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled Futures
cs.AI 2026-04 unverdicted novelty 4.0

A survey comparing classical multi-agent systems with large foundation model-enabled multi-agent systems, showing how the latter enables semantic-level collaboration and greater adaptability.
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities
cs.DC 2026-04 unverdicted novelty 3.0

A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 19 Pith papers · 15 internal anchors

[1]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report. arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901
[3]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chan, C.-M., Chen, W., Su, Y ., Yu, J., Xue, W., Zhang, S., Fu, J., and Liu, Z. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

C.-Y ., Saha, S., and Bansal, M

Chen, J. C.-Y ., Saha, S., and Bansal, M. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007, 2023a. Chen, L., Zaharia, M., and Zou, J. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023b. Chowdhery, A., Narang,...

work page arXiv
[5]

Active prompting with chain-of-thought for large language models

Diao, S., Wang, P., Lin, Y ., and Zhang, T. Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246,

work page arXiv
[6]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Du, Y ., Li, S., Torralba, A., Tenenbaum, J. B., and Mordatch, I. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Dubois, Y ., Galambosi, B., Liang, P., and Hashimoto, T. B. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Complexity-based prompting for multi-step reasoning

Fu, Y ., Peng, H., Sabharwal, A., Clark, P., and Khot, T. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720,

work page arXiv
[9]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y ., Li, Y ., et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Enabling ensemble learn- ing for heterogeneous large language models with deep parallel collaboration

Huang, Y ., Feng, X., Li, B., Xiang, Y ., Wang, H., Qin, B., and Liu, T. Enabling ensemble learn- ing for heterogeneous large language models with deep parallel collaboration. arXiv preprint arXiv:2404.12715,

work page arXiv
[12]

Mixtral of Experts

Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., de Las Casas, D., Hanna, E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L. R., Saulnier, L., Lachaux, M., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Scao, T. L., Gervet, T., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mixtral of...

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Mixtral of Experts

doi: 10.48550/ARXIV .2401.04088. URLhttps://doi.org/10.48550/arXiv.2401.04088. Jiang, D., Ren, X., and Lin, B. Y . LLM-blender: Ensembling large language models with pairwise ranking and generative fusion. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olum...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
[14]

doi: 10.18653/v1/2023.acl-long.792

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.792. URL https://aclanthology.org/2023.acl-long

work page doi:10.18653/v1/2023.acl-long.792 2023
[15]

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y ., Wang, R., Yang, Y ., Tu, Z., and Shi, S. Encour- aging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Deductive verification of chain-of-thought reasoning

10 Ling, Z., Fang, Y ., Li, X., Huang, Z., Lee, M., Memisevic, R., and Su, H. Deductive verification of chain-of-thought reasoning. arXiv preprint arXiv:2306.03872,

work page arXiv
[17]

Bleu: a method for automatic evaluation of machine translation

Papineni, K., Roukos, S., Ward, T., and Zhu, W. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA , pp. 311–318. ACL,

work page 2002
[18]

URL https://aclanthology.org/P02-1040/

doi: 10.3115/ 1073083.1073135. URL https://aclanthology.org/P02-1040/. RapidFuzz. python-levenshtein by rapidfuzz. https://github.com/rapidfuzz/ python-Levenshtein,

work page arXiv
[19]

Code Llama: Open Foundation Models for Code

Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y ., Liu, J., Remez, T., Rapin, J., et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outra- geously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Gemini: A Family of Highly Capable Multimodal Models

URL https://openreview.net/ forum?id=LyNsMNNLjY. Team, G., Anil, R., Borgeaud, S., Wu, Y ., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

LLaMA: Open and Efficient Foundation Language Models

URL https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm . Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Touvron, H., Martin, L., Stone, K., Albert, P., Al...

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Rethinking the Bounds of LLM Reasoning: Are Multi-Agent Discussions the Key? arXiv:2402.18272 [cs]

Wang, Q., Wang, Z., Su, Y ., Tong, H., and Song, Y . Rethinking the bounds of llm reasoning: Are multi-agent discussions the key? arXiv preprint arXiv:2402.18272, 2024b. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2...

work page arXiv
[24]

WizardLM: Empowering large pre-trained language models to follow complex instructions

11 0.00 0.05 0.10 0.15 0.20 0.25 Spearman correlation coefficient QWen1.5-110B QWen1.5-72B WizardLM Llama-3-70B Mixtral-8x22B dbrx-instruct Aggregator Aggregation 1st aggregation 2nd aggregation 3rd aggregation 0.00 0.05 0.10 0.15 0.20 0.25 Spearman correlation coefficient QWen1.5-110B QWen1.5-72B WizardLM Llama-3-70B Mixtral-8x22B dbrx-instruct Aggregato...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Exploring collaboration mechanisms for llm agents: A social psychology view

Zhang, J., Xu, X., and Deng, S. Exploring collaboration mechanisms for llm agents: A social psychology view. arXiv preprint arXiv:2310.02124,

work page arXiv
[26]

Automatic Chain of Thought Prompting in Large Language Models

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., et al. Opt: Open pre-trained transformer language models. arXiv e-prints, pp. arXiv–2205, 2022a. Zhang, Z., Zhang, A., Li, M., and Smola, A. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022b. Zhen...

work page internal anchor Pith review Pith/arXiv arXiv
[27]

instruction

Supplementary Material A Spearman Correlation using Different Similarity Functions We present results using TF-IDF-based similarity and Levenshtein similarity when calculating the Spearman correlation. Specifically, within each sample ofn proposed answers, we calculate Spearman correlation coefficient between the n similarity scores and the n preference s...

work page 2021