Agentic Systems as Boosting Weak Reasoning Models
Pith reviewed 2026-05-15 04:55 UTC · model grok-4.3
The pith
Weak reasoning models reach strong-model performance by recovering correct solutions already latent in their own proposal pools using local signals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper proves that coverage is amplified by repeated sampling but cannot alone produce useful critics or comparators; reliable amplification needs an additional local soundness signal such as execution, proof checking, or tests. It gives rank-based bounds on when local selection errors compose into reliable trajectories and shows that the proposer-side ceiling is the total probability mass the model places on task slices that contain at least one useful solution. Empirically the critic-comparator system lifts the nano model from 67.0 percent to 76.4 percent on SWE-bench Verified while the remaining errors are shown to be proposal-coverage failures.
What carries the argument
The critic-comparator orchestration that ranks proposals by local soundness signals such as execution and test outcomes without access to any hidden verifier.
If this is right
- Repeated sampling increases proposal coverage but does not by itself create reliable critics or comparators.
- Rank-based bounds determine when local selection errors still allow reliable overall trajectories.
- The oracle best-of-k performance is bounded by the probability mass the proposal model assigns to useful solutions.
- After boosting, remaining errors are proposal-coverage failures that stronger selection alone cannot close.
Where Pith is reading between the lines
- Inference-time compute on weak models may substitute for some training-time increases in model size.
- Domains lacking cheap, reliable local soundness signals are unlikely to benefit from this form of boosting.
- Adding explicit diversity mechanisms to the proposal step could further narrow the gap to the oracle bound.
Load-bearing premise
Local soundness signals such as execution or test results reliably distinguish correct proposals from incorrect ones.
What would settle it
Construct a benchmark where local signals like tests or execution cannot separate correct from incorrect proposals; if the orchestration then fails to exceed the single-proposal baseline, the central claim is false.
Figures
read the original abstract
Can a committee of weak reasoning-model calls reach the performance of much stronger models? We study verifier-backed committee search as inference-time boosting for reasoning language models. The mechanism is not simply that ``more agents help'': samples expose latent correct solutions, while critics and comparators must recover them without access to the hidden verifier. We formalize this view by separating proposal coverage, local identifiability, progress, and diversity. We prove that coverage can be amplified by repeated sampling, but cannot by itself create useful critics or comparators; reliable amplification requires an additional local soundness signal, such as execution, proof checking, type checking, tests, or constraint solving. We give rank-based bounds showing when local selection errors compose into reliable trajectories, and characterize the proposer-side ceiling: oracle best-of-\(k\) converges only to the mass of task slices on which the proposal system assigns nonzero useful probability. Empirically, on SWE-bench Verified, a single \texttt{GPT-5.4 nano} proposal solves \(67.0\%\) of tasks. Using the same nano model, our critic--comparator orchestration reaches \(76.4\%\) with \(k=8\) proposals, matching the standalone performance of \texttt{Gemini 3 Pro} and \texttt{Claude Opus 4.5} Thinking and approaching the \(79.0\%\) oracle best-of-\(8\) upper bound. Thus, many correct patches are already present in weak-model proposal pools; the main challenge is selecting them. The remaining failures are mostly proposal-coverage failures, indicating shared blind spots that stronger selection alone cannot close.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that committees of weak reasoning models can boost performance to match stronger models via verifier-backed critic-comparator orchestration at inference time. It separates proposal coverage, local identifiability, progress, and diversity; proves that repeated sampling amplifies coverage but requires an additional local soundness signal (e.g., execution or tests) to create reliable critics/comparators; gives rank-based bounds on when local selection errors compose into reliable trajectories; and reports that a single GPT-5.4 nano proposal solves 67.0% of SWE-bench Verified tasks while the same nano model with k=8 orchestrated proposals reaches 76.4%, matching Gemini 3 Pro and Claude Opus 4.5 Thinking and approaching the 79.0% oracle best-of-8 bound. The remaining failures are attributed to proposal-coverage gaps.
Significance. If the empirical protocol and theoretical separation hold, the work demonstrates that many correct solutions are already latent in weak-model proposal pools and that the primary bottleneck is selection rather than generation. The formalization of coverage amplification versus local identifiability, together with the rank-based bounds, supplies a useful analytical lens for agentic systems. The reproducible results on a public benchmark and the explicit oracle upper bound are concrete strengths that allow direct comparison with future work.
major comments (2)
- [Experimental protocol and results] Experimental protocol (results section and SWE-bench Verified description): the manuscript does not state whether the critic and comparator are allowed to execute the task test suite on the k=8 proposals. Because SWE-bench Verified defines patch correctness precisely by running those tests, permitting execution would supply the hidden verifier signal directly to the selection step. This would collapse the claimed separation between proposal coverage and local identifiability that the rank-based bounds and amplification proofs rely on, and would re-interpret the jump from 67.0% to 76.4% as verifier-assisted selection rather than pure weak-model committee boosting.
- [Theoretical results] Theoretical claims (abstract and § on formalization): proofs of coverage amplification by repeated sampling and the rank-based bounds on local selection errors are referenced but supplied without explicit derivations, assumption statements, or error analysis. Because these results are load-bearing for the central claim that “reliable amplification requires an additional local soundness signal,” the absence of the derivations prevents verification that the bounds are non-vacuous and that the separation between coverage and identifiability is rigorously maintained.
minor comments (2)
- [Formalization section] Notation for “progress” and “diversity” is introduced in the formalization but not consistently referenced in the subsequent rank-bound statements; adding a short table of symbols would improve readability.
- [Results] The oracle best-of-8 bound is reported as 79.0% without an accompanying breakdown by task slice or failure mode; a supplementary table showing the mass of tasks on which the nano proposer assigns zero useful probability would make the “proposal-coverage ceiling” claim more concrete.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback on our manuscript. We address each major comment below and will revise the manuscript to improve clarity on the experimental protocol and to include the requested theoretical derivations.
read point-by-point responses
-
Referee: [Experimental protocol and results] Experimental protocol (results section and SWE-bench Verified description): the manuscript does not state whether the critic and comparator are allowed to execute the task test suite on the k=8 proposals. Because SWE-bench Verified defines patch correctness precisely by running those tests, permitting execution would supply the hidden verifier signal directly to the selection step. This would collapse the claimed separation between proposal coverage and local identifiability that the rank-based bounds and amplification proofs rely on, and would re-interpret the jump from 67.0% to 76.4% as verifier-assisted selection rather than pure weak-model committee boosting.
Authors: We thank the referee for highlighting the need for explicit protocol details. In our verifier-backed committee search, the critic and comparator are explicitly permitted to execute proposals against the task test suite; this execution constitutes the local soundness signal formalized in the paper (e.g., tests or constraint solving). The 'hidden verifier' referenced in the manuscript denotes an oracle that would directly reveal ground-truth correctness without any computation, whereas test execution supplies a noisy but computable local signal that enables identifiability. The separation between coverage (supplied by the weak model's proposal distribution) and local identifiability (supplied by the critic/comparator using the execution signal) is therefore preserved by design. The empirical jump from 67.0% to 76.4% is precisely the result of using this local signal to select among proposals. We will revise the results section and SWE-bench description to state the exact signals used by the critic and comparator, confirming that test-suite execution is the local verifier while the final reported accuracy uses the hidden ground-truth evaluation only for scoring. revision: yes
-
Referee: [Theoretical results] Theoretical claims (abstract and § on formalization): proofs of coverage amplification by repeated sampling and the rank-based bounds on local selection errors are referenced but supplied without explicit derivations, assumption statements, or error analysis. Because these results are load-bearing for the central claim that “reliable amplification requires an additional local soundness signal,” the absence of the derivations prevents verification that the bounds are non-vacuous and that the separation between coverage and identifiability is rigorously maintained.
Authors: We agree that the derivations must be supplied for the theoretical claims to be verifiable. The coverage-amplification result (showing that the probability of including at least one correct proposal grows with k under independent sampling) and the rank-based bounds (showing when local selection errors compose into reliable trajectories) were derived under standard assumptions but omitted from the main text for brevity. We will add a new appendix containing the full proofs, explicit assumption statements (i.i.d. sampling from the proposal model, bounded local error rate of the critic/comparator, and rank-order preservation), and an error analysis confirming that the bounds are non-vacuous for the observed proposal-quality regime on SWE-bench Verified. This addition will rigorously substantiate the necessity of the local soundness signal for reliable amplification. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper's central claims are supported by empirical measurements on the public SWE-bench Verified benchmark and theoretical arguments based on standard sampling, ranking, and amplification bounds. No load-bearing step reduces to a self-definition, fitted input renamed as prediction, or self-citation chain. The distinction between local soundness signals and the hidden verifier is maintained without circular reduction, and the performance numbers (67.0% to 76.4%) are reported as direct observations rather than derived by construction from the model inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard probabilistic bounds on repeated sampling and rank-based selection errors
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalize this view by separating proposal coverage, local identifiability, progress, and diversity. We prove that coverage can be amplified by repeated sampling, but cannot by itself create useful critics or comparators
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
rank-based bounds showing when local selection errors compose into reliable trajectories
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. InComputational Learning Theory: Second European Con- ference, EuroCOLT 1995, volume 904 ofLecture Notes in Computer Science, pages 23–37. Springer, 1995
work page 1995
- [3]
- [4]
-
[5]
Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and Deheng Ye. More agents is all you need. Transactions on Machine Learning Research, 2024
work page 2024
-
[6]
Le, Christopher Ré, and Azalia Mirhoseini
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024
work page 2024
-
[7]
Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei A. Zaharia, and James Y . Zou. Are more LM calls all you need? towards the scaling properties of com- pound AI systems. InAdvances in Neural Information Processing Systems, volume 37, pages 45767–45790, 2024
work page 2024
-
[8]
Is best-of-N the best of them? coverage, scaling, and optimality in inference- time alignment, 2025
Audrey Huang, Adam Block, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, and Akshay Krishnamurthy. Is best-of-N the best of them? coverage, scaling, and optimality in inference- time alignment, 2025
work page 2025
-
[9]
Griffiths, Yuan Cao, and Karthik Narasimhan
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language mod- els. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[10]
Training verifiers to solve math word problems, 2021
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021
work page 2021
-
[11]
Let’s verify step by step, 2023
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023
work page 2023
-
[12]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations, 2024
work page 2024
-
[13]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated soft- ware engineering. InAdvances in Neural Information Processing Systems, volume 37, 2024
work page 2024
-
[14]
AutoCodeRover: Au- tonomous program improvement, 2024
Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Au- tonomous program improvement, 2024. Published version appears in the ACM SIGSOFT/IS- STA proceedings
work page 2024
-
[15]
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Demystifying LLM- based software engineering agents.Proceedings of the ACM on Software Engineering, 2(FSE):801–824, 2025
work page 2025
-
[16]
Hariharan Manikandan, Yiding Jiang, and J. Zico Kolter. Language models are weak learners. InAdvances in Neural Information Processing Systems, volume 36, 2023. 11
work page 2023
-
[17]
Weak-to-strong generalization: Eliciting strong capabilities with weak supervision
Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschen- brenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeffrey Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of...
work page 2024
-
[18]
LLMBoost: Make large language models stronger with boosting, 2025
Zehao Chen, Tianxiang Ai, Yifei Li, Gongxun Li, Yuyang Wei, Wang Zhou, Guanghui Li, Bin Yu, Zhijun Chen, Hailong Sun, Fuzhen Zhuang, Jianxin Li, Deqing Wang, and Yikun Ban. LLMBoost: Make large language models stronger with boosting, 2025. Preprint; also submitted to ICLR 2026 on OpenReview
work page 2025
- [19]
-
[20]
Universal self-consistency for large language model generation, 2023
Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. Universal self-consistency for large language model generation, 2023
work page 2023
-
[21]
A survey on test-time scaling in large language models: What, how, where, and how well?, 2025
Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua, Haolun Wu, Zhihan Guo, Yufei Wang, Niklas Muennighoff, Irwin King, Xue Liu, and Chen Ma. A survey on test-time scaling in large language models: What, how, where, and how well?, 2025
work page 2025
-
[22]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations, 2024
work page 2024
-
[23]
Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, Bangkok, Thailand, August 2024. Association...
work page 2024
-
[24]
LLM-blender: Ensembling large language models with pairwise ranking and generative fusion
Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. LLM-blender: Ensembling large language models with pairwise ranking and generative fusion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165– 14178, Toronto, Canada, July 2023. Association for Computational Linguistics
work page 2023
-
[25]
Shalev Lifshitz, Sheila A. McIlraith, and Yilun Du. Multi-agent verification: Scaling test-time compute with multiple verifiers, 2025
work page 2025
-
[26]
Jon Saad-Falcon, E. Kelly Buchanan, Mayee F. Chen, Tzu-Heng Huang, Brendan McLaugh- lin, Tanvir Bhathal, Shang Zhu, Ben Athiwaratkun, Frederic Sala, Scott Linderman, Azalia Mirhoseini, and Christopher Ré. Shrinking the generation-verification gap with weak verifiers, 2025
work page 2025
-
[27]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[28]
Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, Lester James V . Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hanna Hajishirzi. Rewardbench: Evaluating reward models for language modeling, 2024
work page 2024
-
[29]
CRITIC: Large language models can self-correct with tool-interactive critiquing
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. CRITIC: Large language models can self-correct with tool-interactive critiquing. In International Conference on Learning Representations, 2024
work page 2024
-
[30]
Chain-of-verification reduces hallucination in large language models
Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyil- maz, and Jason Weston. Chain-of-verification reduces hallucination in large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 3563–3578, Bangkok, Thailand, August 2024. Association for Computational Linguistics. 12
work page 2024
-
[31]
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schu- urmans, Claire Cui, Olivier Bousquet, Quoc V . Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. InInternational Conference on Learning Representations, 2023
work page 2023
-
[32]
Reasoning with language model is planning with world model
Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahong Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, Singapore, 2023. Association for Computational Linguistics
work page 2023
-
[33]
Language agent tree search unifies reasoning, acting, and planning in language models, 2023
Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models, 2023
work page 2023
-
[34]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Confer- ence on Learning Representations, 2023
work page 2023
-
[35]
Reflexion: Language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[36]
Self-refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegr- effe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bod- hisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing...
work page 2023
-
[37]
CAMEL: Communicative agents for “mind” exploration of large scale language model society, 2023
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for “mind” exploration of large scale language model society, 2023
work page 2023
-
[38]
White, Doug Burger, and Chi Wang
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation, 2023
work page 2023
-
[39]
MetaGPT: Meta programming for a multi-agent collaborative framework
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J"urgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representa- tions, 2024
work page 2024
-
[40]
Mixture-of-agents enhances large language model capabilities, 2024
Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities, 2024
work page 2024
-
[41]
Chen, Neel Guha, Christopher Ré, and Azalia Mirhoseini
Jon Saad-Falcon, Adrian Gamarra Lafuente, Shlok Natarajan, Nahum Maru, Hristo Todorov, Etash Kumar Guha, Estefany Kelly Buchanan, Mayee F. Chen, Neel Guha, Christopher Ré, and Azalia Mirhoseini. Archon: An architecture search framework for inference-time techniques, 2024
work page 2024
-
[42]
Neel Guha, Mayee F. Chen, Trevor Chow, Ishan S. Khare, and Christopher Ré. Smoothie: Label free language model routing, 2024
work page 2024
-
[43]
Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel Mankowitz, Esme Suther- land Robson, Pushme...
work page 2022
-
[44]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke 13 Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavar...
work page 2021
-
[45]
Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate, 2018
work page 2018
-
[46]
Supervising strong learners by amplifying weak experts, 2018
Paul Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts, 2018
work page 2018
-
[47]
Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 11733–11763. PMLR, 2024
work page 2024
-
[48]
Encouraging divergent thinking in large language models through multi-agent debate
Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shum- ing Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17889–17904, Miami, Florida, USA, November 2024. Associa- tion f...
work page 2024
-
[49]
Scalable AI safety via doubly- efficient debate
Jonah Brown-Cohen, Geoffrey Irving, and Georgios Piliouras. Scalable AI safety via doubly- efficient debate. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 4585–4602. PMLR, 2024
work page 2024
-
[50]
Learning to give checkable an- swers with prover-verifier games, 2021
Cem Anil, Guodong Zhang, Yuhuai Wu, and Roger Grosse. Learning to give checkable an- swers with prover-verifier games, 2021
work page 2021
-
[51]
Prover-verifier games improve legibility of LLM outputs, 2024
Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, and Yuri Burda. Prover-verifier games improve legibility of LLM outputs, 2024
work page 2024
-
[52]
Zachary Kenton, Noah Y . Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jan- nis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, and Rohin Shah. On scalable oversight with weak LLMs judging strong LLMs. InAdvances in Neural Information Processing Systems, volume 37, 2024
work page 2024
-
[53]
Bagging predictors.Machine Learning, 24(2):123–140, 1996
Leo Breiman. Bagging predictors.Machine Learning, 24(2):123–140, 1996
work page 1996
-
[54]
David H. Wolpert. Stacked generalization.Neural Networks, 5(2):241–259, 1992
work page 1992
-
[55]
Thomas G. Dietterich. Ensemble methods in machine learning. InMultiple Classifier Systems, volume 1857 ofLecture Notes in Computer Science, pages 1–15. Springer, 2000
work page 2000
-
[56]
Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, and Akshay Krishnamurthy
Audrey Huang, Adam Block, Dylan J. Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, and Akshay Krishnamurthy. Self-improvement in language models: The sharpening mecha- nism, 2024. ICLR 2025
work page 2024
-
[57]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824– 24837, 2022
work page 2022
-
[58]
Zhijun Chen, Xiaodong Lu, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Ming Li, Likang Xiao, Dingqi Yang, Xiao Huang, Yikun Ban, Hailong Sun, and Philip S. Yu. Harnessing multiple large language models: A survey on LLM ensemble, 2025
work page 2025
-
[59]
Venktesh, Mandeep Rathee, and Avishek Anand
V . Venktesh, Mandeep Rathee, and Avishek Anand. Trust but verify! a survey on verification design for test-time scaling, 2025. 14
work page 2025
-
[60]
Joonhyuk Lee, Virginia Ma, Sarah Zhao, Yash Nair, Asher Spector, Regev Cohen, and Em- manuel J. Candès. FUSE: Ensembling verifiers with zero labeled data, 2026
work page 2026
-
[61]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Ying- han Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel M. Ni, and Jian Guo. A survey on LLM-as-a-judge, 2024
work page 2024
-
[62]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[63]
HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[64]
Self-evaluation guided beam search for reasoning
Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, and Michael Xie. Self-evaluation guided beam search for reasoning. InAdvances in Neural Infor- mation Processing Systems, volume 36, 2023
work page 2023
-
[65]
ChatDev: Communicative agents for software development
Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. ChatDev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15174– 15186, B...
work page 2024
-
[66]
Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors, 2023. 15 A Broader Impacts This work studies inference-time amplification fo...
work page 2023
-
[67]
Failing test hypothesis: state the smallest hypothesis about what is wrong, derived from the issue text alone. One sentence. This is the ground truth you compare both patches against
-
[68]
A_changes: list the specific lines/functions Patch A modifies
-
[69]
B_changes: list the specific lines/functions Patch B modifies
-
[70]
A_consistent_with_hypothesis: do A’s changes plausibly cause the failing test in the issue to start passing? (true/false + one-line justification)
-
[71]
B_consistent_with_hypothesis: same question for B
-
[72]
A_collateral: does A change behavior on inputs unrelated to the failure mode? (true/false + one-line justification)
-
[73]
B_collateral: same question for B. The decision falls out of this comparison; do not pull a winner from prior. If exactly one patch is consistent with the hypothesis and the other is not, that one wins. If both are consistent, prefer the one with less collateral. If both fail the hypothesis, output TIE. If they are functionally equivalent (same lines chan...
-
[74]
study scaling laws for compound inference systems, focusing on V ote and Filter-V ote architec- tures. They show that increasing the number of calls can yield non-monotone performance because easy and hard instances respond differently to majority voting. This complements our analysis: their theory focuses on flat voting systems and majority aggregation, ...
-
[75]
Self-Refine [36] iteratively generates feedback and refines outputs without additional training data
asks models to draft answers, generate verification questions, answer them independently, and then revise. Self-Refine [36] iteratively generates feedback and refines outputs without additional training data. Reflexion [35] stores verbal feedback in memory to improve subsequent trials. Tool-using systems also change the effective verification and proposal...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.