arxiv: 2604.02863 · v1 · submitted 2026-04-03 · 💻 cs.AI

Recognition: 1 theorem link

· Lean Theorem

EMS: Multi-Agent Voting via Efficient Majority-then-Stopping

Yiqing Liu , Hantao Yao , Wu Liu , Yongdong Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:53 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent systemsmajority votingearly stoppingreliability estimationagent schedulingefficient reasoningincremental voting

0 comments

The pith

EMS reduces the average number of invoked agents by 32% in multi-agent voting by stopping once a majority consensus forms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that running every agent before aggregating wastes computation once a clear majority appears. It reframes the task as reliability-aware scheduling so that agents are ordered by estimated trustworthiness and the process halts the moment enough votes agree. Reliability comes from each agent's past accuracy plus how closely its current answer matches others, with those scores updated after each round. A sympathetic reader cares because multi-agent reasoning pipelines are costly to run at scale, and trimming redundant agents directly lowers that cost while the final majority decision stays the same.

Core claim

Multi-agent voting is cast as a reliability-aware agent scheduling problem. EMS prioritizes agents according to task-aware reliability and stops the reasoning pipeline the instant a majority is reached. The approach rests on three parts: Agent Confidence Modeling that scores reliability from historical performance and semantic similarity, Adaptive Incremental Voting that adds agents sequentially until the stopping condition, and Individual Confidence Updating that refreshes each agent's score after it contributes. Across six benchmarks this yields a consistent 32% drop in the average number of invoked agents.

What carries the argument

The EMS scheduler that orders agents by reliability estimates and terminates once a majority is reached, using Agent Confidence Modeling, Adaptive Incremental Voting, and Individual Confidence Updating.

If this is right

The average number of agents required falls by 32% while the quality of the majority decision remains equivalent to full voting.
Computational cost drops because redundant reasoning steps are avoided once consensus appears.
The same early-stopping rule works across six different benchmarks without task-specific tuning.
Reliability scores improve over successive uses because each agent's contribution updates its estimate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reliability-driven ordering could be tested with weighted or ranked voting rules instead of simple majority.
In settings with tight latency limits the method might allow more agents to be considered within a fixed time budget.
Domains where historical data are scarce would need to check whether semantic similarity alone still produces safe stopping decisions.

Load-bearing premise

Estimates of agent reliability drawn from historical performance and semantic similarity are accurate enough that early stopping leaves the final majority decision unchanged.

What would settle it

Run EMS and full voting side-by-side on the same queries and count how often the early-stopped majority differs from the full-voting result in accuracy or outcome.

Figures

Figures reproduced from arXiv: 2604.02863 by Hantao Yao, Wu Liu, Yiqing Liu, Yongdong Zhang.

**Figure 2.** Figure 2: Overview of the proposed Efficient Majority-then-Stopping (EMS) framework. For each query, EMS first uses the Agent [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Analysis of the Adaptive Incremental Voting. Each bar [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Majority voting is the standard for aggregating multi-agent responses into a final decision. However, traditional methods typically require all agents to complete their reasoning before aggregation begins, leading to significant computational overhead, as many responses become redundant once a majority consensus is achieved. In this work, we formulate the multi-agent voting as a reliability-aware agent scheduling problem, and propose an Efficient Majority-then-Stopping (EMS) to improve reasoning efficiency. EMS prioritizes agents based on task-aware reliability and terminates the reasoning pipeline the moment a majority is achieved from the following three critical components. Specifically, we introduce Agent Confidence Modeling (ACM) to estimate agent reliability using historical performance and semantic similarity, Adaptive Incremental Voting (AIV) to sequentially select agents with early stopping, and Individual Confidence Updating (ICU) to dynamically update the reliability of each contributing agent. Extensive evaluations across six benchmarks demonstrate that EMS consistently reduces the average number of invoked agents by 32%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EMS gives a concrete way to early-stop multi-agent LLM voting once a majority forms using reliability scores, but the abstract supplies no accuracy numbers or stats to show the savings do not change the final decision.

read the letter

The punchline is that this paper turns standard majority voting into a sequential process that stops agents early, claiming a 32% drop in average calls across six benchmarks. The new parts are the three named pieces: Agent Confidence Modeling that scores reliability from historical performance plus semantic similarity, Adaptive Incremental Voting that adds agents one at a time until majority, and Individual Confidence Updating that refreshes the scores as results come in. Together they frame voting as a reliability-aware scheduling task rather than an all-at-once aggregation. That framing is useful because it directly targets the waste when extra agents run after consensus is already reached. The approach is straightforward and builds on existing majority-voting practice without claiming a whole new theory. The main gap is that the abstract reports only the reduction in agent count and says nothing about whether final accuracy stayed the same, what the exact baselines were, or any error bars and tests. Without those checks it is impossible to tell if the early stopping is safe or if noisy reliability estimates sometimes flip the majority. The historical-plus-similarity scores could easily mis-predict on new tasks, which would turn the efficiency gain into a correctness loss. This work is aimed at people already running multi-agent LLM setups who need to trim compute. A practitioner who cares about deployment cost would find the scheduling idea worth testing, but only after seeing the accuracy side-by-side numbers. I would send it to peer review so the authors can add those comparisons and let referees verify the claims.

Referee Report

3 major / 2 minor

Summary. The paper formulates multi-agent voting as a reliability-aware scheduling problem and proposes Efficient Majority-then-Stopping (EMS). EMS uses Agent Confidence Modeling (ACM) to estimate reliability from historical performance and semantic similarity, Adaptive Incremental Voting (AIV) for sequential agent selection, and Individual Confidence Updating (ICU) to terminate once a majority is reached, claiming a consistent 32% reduction in the average number of invoked agents across six benchmarks.

Significance. If the reliability estimates in ACM and ICU preserve the correctness of the final majority decision, EMS could meaningfully reduce compute in multi-agent systems without sacrificing output quality. The paper provides no machine-checked proofs or parameter-free derivations, and the evaluation reports only agent-count savings.

major comments (3)

[Abstract] Abstract: the headline claim of a 32% reduction in invoked agents is presented without any accuracy comparison to full voting, statistical significance tests, error bars, or baseline details, so it is impossible to determine whether early stopping trades correctness for efficiency.
[Evaluation] Evaluation section (and § on ACM): the reliability scores derived from historical performance and semantic similarity are treated as sufficiently accurate to decide stopping, yet no ablation or sensitivity analysis shows how noisy or biased estimates affect the final majority outcome on the six benchmarks.
[Method] AIV and ICU description: the stopping rule assumes that once a majority is reached under the current reliability estimates, additional agents cannot change the outcome, but no formal argument or empirical check confirms this invariance holds when estimates are imperfect.

minor comments (2)

[Method] Clarify the precise definition of 'majority' (e.g., strict >50% or tie-breaking rule) and how ICU updates are computed after each new response.
[Abstract] The abstract mentions 'six benchmarks' but does not name them or report per-benchmark breakdowns; add this information for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We appreciate the emphasis on ensuring that efficiency gains are not achieved at the expense of decision correctness. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of a 32% reduction in invoked agents is presented without any accuracy comparison to full voting, statistical significance tests, error bars, or baseline details, so it is impossible to determine whether early stopping trades correctness for efficiency.

Authors: We agree that the abstract should better contextualize the efficiency claim. In the revised version, we will expand the abstract to explicitly note that accuracy remains comparable to full voting (with details and statistical tests provided in the evaluation section), include error bars on the reported savings, and reference the baselines used. This will clarify that early stopping preserves correctness. revision: yes
Referee: [Evaluation] Evaluation section (and § on ACM): the reliability scores derived from historical performance and semantic similarity are treated as sufficiently accurate to decide stopping, yet no ablation or sensitivity analysis shows how noisy or biased estimates affect the final majority outcome on the six benchmarks.

Authors: We acknowledge the absence of such analysis in the current draft. We will add a dedicated sensitivity subsection to the evaluation, introducing controlled perturbations to the reliability estimates (both historical and similarity-based) and reporting their effects on stopping decisions and final majority accuracy across all six benchmarks. revision: yes
Referee: [Method] AIV and ICU description: the stopping rule assumes that once a majority is reached under the current reliability estimates, additional agents cannot change the outcome, but no formal argument or empirical check confirms this invariance holds when estimates are imperfect.

Authors: The stopping criterion is heuristic and relies on the quality of estimates. A parameter-free formal proof is not feasible given the data-driven and stochastic nature of agent outputs. However, we will add an empirical check in the revised evaluation: after EMS stops, we continue with remaining agents and quantify the fraction of cases where the majority outcome is unchanged, even under imperfect estimates. revision: partial

Circularity Check

0 steps flagged

No circularity: reliability estimates and efficiency gains are empirically grounded

full rationale

The paper defines Agent Confidence Modeling (ACM) using historical performance and semantic similarity as inputs that are independent of the final majority outcome. Adaptive Incremental Voting (AIV) and Individual Confidence Updating (ICU) are scheduling mechanisms whose stopping rule is evaluated on six benchmarks for agent-count reduction. No equations reduce a prediction to a fitted parameter by construction, no self-citation is load-bearing for the central claim, and no uniqueness theorem or ansatz is imported from prior author work. The 32% reduction is presented as an empirical result, not a definitional identity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that agent reliability is stable and predictable from history and task similarity; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Agent reliability can be estimated from historical performance and semantic similarity to the current task
This underpins Agent Confidence Modeling and the prioritization in Adaptive Incremental Voting.

pith-pipeline@v0.9.0 · 5458 in / 1212 out tokens · 74087 ms · 2026-05-13T19:53:08.955978+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EMS prioritizes agents based on task-aware reliability and terminates the reasoning pipeline the moment a majority is achieved... Agent Confidence Modeling (ACM) to estimate agent reliability using historical performance and semantic similarity, Adaptive Incremental Voting (AIV)...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 6 internal anchors

[1]

Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

work page 2022
[2]

Better zero-shot reasoning with self- adaptive prompting

Xingchen Wan, Ruoxi Sun, Hanjun Dai, Sercan Arik, and Tomas Pfister. Better zero-shot reasoning with self- adaptive prompting. InFindings of the Association for Computational Linguistics: ACL 2023, pages 3493–3514, 2023

work page 2023
[3]

Fundamental capabilities and applications of large language models: A survey.ACM Comput

Jiawei Li, Yang Gao, Yizhe Yang, Yu Bai, Xiaofeng Zhou, Yinghao Li, Huashan Sun, Yuhang Liu, Xingpeng Si, Yuhao Ye, Yixiao Wu, Yiguan Lin, Bin Xu, Ren Bowen, Chong Feng, and Heyan Huang. Fundamental capabilities and applications of large language models: A survey.ACM Comput. Surv., 58(2):38:1–38:42, 2026. 8 Running Title for Header

work page 2026
[4]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models.arXiv preprint arXiv:2410.05229, 2024

work page Pith review arXiv 2024
[5]

A peek into token bias: Large language models are not yet genuine reasoners

Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J Su, Camillo Jose Taylor, and Dan Roth. A peek into token bias: Large language models are not yet genuine reasoners. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4722–4756, 2024

work page 2024
[6]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Theory of mind for multi-agent collaboration via large language models

Huao Li, Yu Chong, Simon Stepputtis, Joseph P Campbell, Dana Hughes, Charles Lewis, and Katia Sycara. Theory of mind for multi-agent collaboration via large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 180–192, 2023

work page 2023
[8]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InThe Twelfth International Conference on Learning Representations, ICLR 202...

work page 2024
[9]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen LLM applications via multi-agent conversation framework. CoRR, abs/2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[11]

Llm-blender: Ensembling large language models with pairwise ranking and generative fusion

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, 2023

work page 2023
[12]

Encouraging divergent thinking in large language models through multi-agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 17889–17904, 2024

work page 2024
[13]

Improving factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InForty-first International Conference on Machine Learning, 2023

work page 2023
[14]

An electoral approach to diversify llm-based multi-agent collective decision-making

Xiutian Zhao, Ke Wang, and Wei Peng. An electoral approach to diversify llm-based multi-agent collective decision-making. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pages 2712–2727. Association for Co...

work page 2024
[15]

V oting or consensus? decision-making in multi-agent debate

Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, and Bela Gipp. V oting or consensus? decision-making in multi-agent debate. InFindings of the Association for Computational Linguistics: ACL 2025. Association for Computational Linguistics, 2025

work page 2025
[16]

Shostak, and Marshall C

Leslie Lamport, Robert E. Shostak, and Marshall C. Pease. The byzantine generals problem.ACM Trans. Program. Lang. Syst., 4(3):382–401, 1982

work page 1982
[17]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate.arXiv preprint arXiv:2308.07201, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Masrouter: Learning to route llms for multi-agent systems.arXiv preprint arXiv:2502.11133, 2025

Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, and Yiyan Qi. Masrouter: Learning to route llms for multi-agent systems.arXiv preprint arXiv:2502.11133, 2025

work page arXiv 2025
[19]

Groupdebate: Enhancing the efficiency of multi-agent debate using group discussion.CoRR, abs/2409.14051, 2024

Tongxuan Liu, Xingyu Wang, Weizhe Huang, Wenjiang Xu, Yuting Zeng, Lei Jiang, Hailong Yang, and Jing Li. Groupdebate: Enhancing the efficiency of multi-agent debate using group discussion.CoRR, abs/2409.14051, 2024

work page arXiv 2024
[20]

Debate only when necessary: Adaptive multiagent collaboration for efficient llm reasoning.arXiv preprint arXiv:2504.05047, 2025

Sugyeong Eo, Hyeonseok Moon, Evelyn Hayoon Zi, Chanjun Park, and Heuiseok Lim. Debate only when necessary: Adaptive multiagent collaboration for efficient llm reasoning.arXiv preprint arXiv:2504.05047, 2025

work page arXiv 2025
[21]

Benchmark self-evolving: A multi-agent framework for dynamic LLM evaluation

Siyuan Wang, Zhuohan Long, Zhihao Fan, Xuanjing Huang, and Zhongyu Wei. Benchmark self-evolving: A multi-agent framework for dynamic LLM evaluation. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International 9 Running Title for Header Conference on Computationa...

work page 2025
[22]

Association for Computational Linguistics, 2025

work page 2025
[23]

Debate or vote: Which yields better decisions in multi-agent large language models?arXiv preprint arXiv:2508.17536, 2025

Hyeong Kyu Choi, Xiaojin Zhu, and Sharon Li. Debate or vote: Which yields better decisions in multi-agent large language models?arXiv preprint arXiv:2508.17536, 2025

work page arXiv 2025
[24]

Key decision-makers in multi-agent debates: Who holds the power?CoRR, abs/2511.11040, 2025

Qian Zhang, Yan Zheng, Jinyi Liu, Hebin Liang, and Lanjun Wang. Key decision-makers in multi-agent debates: Who holds the power?CoRR, abs/2511.11040, 2025

work page arXiv 2025
[25]

Towards a Science of Scaling Agent Systems

Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak N. Patel, Tim Althoff, Daniel McDuff, and Xin Liu. Towards a science of scaling agent systems.CoRR, abs/2512.08296, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Program induction by rationale generation: Learning to solve and explain algebraic word problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, 2017

work page 2017
[27]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

work page 2021
[30]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

work page 2024
[31]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019

work page 2019
[32]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neura...

work page 2022
[33]

More agents is all you need.Trans

Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and Deheng Ye. More agents is all you need.Trans. Mach. Learn. Res., 2024, 2024

work page 2024
[34]

Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 10

work page 2024