arxiv: 2605.14163 · v1 · pith:XTKHQYMBnew · submitted 2026-05-13 · 💻 cs.AI

Agentic Systems as Boosting Weak Reasoning Models

Varun Sunkaraneni , Pierfrancesco Beneventano , Riccardo Neumarker , Tomaso Poggio , Tomer Galanti This is my paper

Pith reviewed 2026-05-15 04:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords agentic systemsreasoning modelscommittee searchinference-time boostingweak model amplificationSWE-benchlocal soundness signalscritic comparator

0 comments

The pith

Weak reasoning models reach strong-model performance by recovering correct solutions already latent in their own proposal pools using local signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether committees built from many calls to a weak reasoning model can match the accuracy of much stronger standalone models. It separates the problem into proposal coverage, which repeated sampling can increase, and selection, which requires critics and comparators that use local soundness signals such as execution or test results. On SWE-bench Verified a single nano model solves 67 percent of tasks; the same model with critic-comparator orchestration reaches 76.4 percent at eight proposals, matching Gemini 3 Pro and approaching the 79 percent oracle best-of-eight ceiling. A reader cares because the result reframes performance gains as an inference-time selection task rather than a requirement for larger trained models.

Core claim

The paper proves that coverage is amplified by repeated sampling but cannot alone produce useful critics or comparators; reliable amplification needs an additional local soundness signal such as execution, proof checking, or tests. It gives rank-based bounds on when local selection errors compose into reliable trajectories and shows that the proposer-side ceiling is the total probability mass the model places on task slices that contain at least one useful solution. Empirically the critic-comparator system lifts the nano model from 67.0 percent to 76.4 percent on SWE-bench Verified while the remaining errors are shown to be proposal-coverage failures.

What carries the argument

The critic-comparator orchestration that ranks proposals by local soundness signals such as execution and test outcomes without access to any hidden verifier.

If this is right

Repeated sampling increases proposal coverage but does not by itself create reliable critics or comparators.
Rank-based bounds determine when local selection errors still allow reliable overall trajectories.
The oracle best-of-k performance is bounded by the probability mass the proposal model assigns to useful solutions.
After boosting, remaining errors are proposal-coverage failures that stronger selection alone cannot close.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Inference-time compute on weak models may substitute for some training-time increases in model size.
Domains lacking cheap, reliable local soundness signals are unlikely to benefit from this form of boosting.
Adding explicit diversity mechanisms to the proposal step could further narrow the gap to the oracle bound.

Load-bearing premise

Local soundness signals such as execution or test results reliably distinguish correct proposals from incorrect ones.

What would settle it

Construct a benchmark where local signals like tests or execution cannot separate correct from incorrect proposals; if the orchestration then fails to exceed the single-proposal baseline, the central claim is false.

Figures

Figures reproduced from arXiv: 2605.14163 by Pierfrancesco Beneventano, Riccardo Neumarker, Tomaso Poggio, Tomer Galanti, Varun Sunkaraneni.

**Figure 1.** Figure 1: A committee of GPT-5.4 nano calls reaches much stronger models. Increasing proposer diversity lifts nano orchestration far above the nano baseline and up to Gemini 3 Pro and Claude Opus 4.5 Thinking. The oracle best-of-n curve shows that correct solutions are often already in the proposal pool; the remaining gap is selection. Dashed lines denote single-model resolve rates. This separation is essential. Sam… view at source ↗

**Figure 2.** Figure 2: One step of the committee protocol. The protocol separates generation from identification. Proposers create breadth, critics remove locally refutable errors, and comparators select among surviving candidates. The theory below shows that this architecture amplifies weak local competence only when two distinct resources are present: proposal coverage and local identifiability. Assumption 1 (Per-state loca… view at source ↗

**Figure 4.** Figure 4: Failure decomposition by benchmark category. For each category, we decompose tasks into those solved by orchestration, those that were oracle reachable but missed by selection, and those that were oracle unreachable under the proposal budget. The small oracle-reachable-but-missed segments indicate that most remaining failures are coverage failures rather than selection failures. critic gate if at lea… view at source ↗

read the original abstract

Can a committee of weak reasoning-model calls reach the performance of much stronger models? We study verifier-backed committee search as inference-time boosting for reasoning language models. The mechanism is not simply that ``more agents help'': samples expose latent correct solutions, while critics and comparators must recover them without access to the hidden verifier. We formalize this view by separating proposal coverage, local identifiability, progress, and diversity. We prove that coverage can be amplified by repeated sampling, but cannot by itself create useful critics or comparators; reliable amplification requires an additional local soundness signal, such as execution, proof checking, type checking, tests, or constraint solving. We give rank-based bounds showing when local selection errors compose into reliable trajectories, and characterize the proposer-side ceiling: oracle best-of-\(k\) converges only to the mass of task slices on which the proposal system assigns nonzero useful probability. Empirically, on SWE-bench Verified, a single \texttt{GPT-5.4 nano} proposal solves \(67.0\%\) of tasks. Using the same nano model, our critic--comparator orchestration reaches \(76.4\%\) with \(k=8\) proposals, matching the standalone performance of \texttt{Gemini 3 Pro} and \texttt{Claude Opus 4.5} Thinking and approaching the \(79.0\%\) oracle best-of-\(8\) upper bound. Thus, many correct patches are already present in weak-model proposal pools; the main challenge is selecting them. The remaining failures are mostly proposal-coverage failures, indicating shared blind spots that stronger selection alone cannot close.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Weak models already propose many correct patches on SWE-bench; the real lift comes from critic-comparator selection, though the local signals may include test execution.

read the letter

The main takeaway is that a weak nano model already generates enough correct solutions on SWE-bench that a simple critic-comparator loop on eight proposals can match the accuracy of much stronger standalone models. The 76.4% result is the number that matters here, and it sits close to the 79% oracle best-of-8 ceiling while the single-proposal baseline sits at 67%. That gap is the paper's central empirical point. They formalize the problem by splitting proposal coverage from local identifiability, progress, and diversity, then prove that sampling alone amplifies coverage but cannot create reliable selectors without an additional local soundness signal such as execution or tests. The rank-based bounds on how selection errors accumulate are the cleanest part of the write-up and give a concrete way to think about when these committees stay reliable. The observation that remaining errors are mostly coverage failures is also useful because it points to where stronger models still add value. The soft spot is exactly the one the stress test flags. On SWE-bench Verified, patch correctness is defined by running the task test suite. If the critics and comparators are allowed to execute those same tests on the proposals, then the selection step receives the verifier signal directly. That would collapse the claimed separation between coverage and identifiability and make the performance jump look more like standard test-based filtering than a pure weak-model committee effect. The paper needs to state the exact protocol for what signals the critics receive and whether they avoid the ground-truth tests. This work is aimed at people building inference-time search and agentic systems for verifiable reasoning tasks. It has enough structure and a clear empirical result to deserve a serious referee, mainly to verify the derivations and the experimental details around the local signals.

Referee Report

2 major / 2 minor

Summary. The paper claims that committees of weak reasoning models can boost performance to match stronger models via verifier-backed critic-comparator orchestration at inference time. It separates proposal coverage, local identifiability, progress, and diversity; proves that repeated sampling amplifies coverage but requires an additional local soundness signal (e.g., execution or tests) to create reliable critics/comparators; gives rank-based bounds on when local selection errors compose into reliable trajectories; and reports that a single GPT-5.4 nano proposal solves 67.0% of SWE-bench Verified tasks while the same nano model with k=8 orchestrated proposals reaches 76.4%, matching Gemini 3 Pro and Claude Opus 4.5 Thinking and approaching the 79.0% oracle best-of-8 bound. The remaining failures are attributed to proposal-coverage gaps.

Significance. If the empirical protocol and theoretical separation hold, the work demonstrates that many correct solutions are already latent in weak-model proposal pools and that the primary bottleneck is selection rather than generation. The formalization of coverage amplification versus local identifiability, together with the rank-based bounds, supplies a useful analytical lens for agentic systems. The reproducible results on a public benchmark and the explicit oracle upper bound are concrete strengths that allow direct comparison with future work.

major comments (2)

[Experimental protocol and results] Experimental protocol (results section and SWE-bench Verified description): the manuscript does not state whether the critic and comparator are allowed to execute the task test suite on the k=8 proposals. Because SWE-bench Verified defines patch correctness precisely by running those tests, permitting execution would supply the hidden verifier signal directly to the selection step. This would collapse the claimed separation between proposal coverage and local identifiability that the rank-based bounds and amplification proofs rely on, and would re-interpret the jump from 67.0% to 76.4% as verifier-assisted selection rather than pure weak-model committee boosting.
[Theoretical results] Theoretical claims (abstract and § on formalization): proofs of coverage amplification by repeated sampling and the rank-based bounds on local selection errors are referenced but supplied without explicit derivations, assumption statements, or error analysis. Because these results are load-bearing for the central claim that “reliable amplification requires an additional local soundness signal,” the absence of the derivations prevents verification that the bounds are non-vacuous and that the separation between coverage and identifiability is rigorously maintained.

minor comments (2)

[Formalization section] Notation for “progress” and “diversity” is introduced in the formalization but not consistently referenced in the subsequent rank-bound statements; adding a short table of symbols would improve readability.
[Results] The oracle best-of-8 bound is reported as 79.0% without an accompanying breakdown by task slice or failure mode; a supplementary table showing the mass of tasks on which the nano proposer assigns zero useful probability would make the “proposal-coverage ceiling” claim more concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address each major comment below and will revise the manuscript to improve clarity on the experimental protocol and to include the requested theoretical derivations.

read point-by-point responses

Referee: [Experimental protocol and results] Experimental protocol (results section and SWE-bench Verified description): the manuscript does not state whether the critic and comparator are allowed to execute the task test suite on the k=8 proposals. Because SWE-bench Verified defines patch correctness precisely by running those tests, permitting execution would supply the hidden verifier signal directly to the selection step. This would collapse the claimed separation between proposal coverage and local identifiability that the rank-based bounds and amplification proofs rely on, and would re-interpret the jump from 67.0% to 76.4% as verifier-assisted selection rather than pure weak-model committee boosting.

Authors: We thank the referee for highlighting the need for explicit protocol details. In our verifier-backed committee search, the critic and comparator are explicitly permitted to execute proposals against the task test suite; this execution constitutes the local soundness signal formalized in the paper (e.g., tests or constraint solving). The 'hidden verifier' referenced in the manuscript denotes an oracle that would directly reveal ground-truth correctness without any computation, whereas test execution supplies a noisy but computable local signal that enables identifiability. The separation between coverage (supplied by the weak model's proposal distribution) and local identifiability (supplied by the critic/comparator using the execution signal) is therefore preserved by design. The empirical jump from 67.0% to 76.4% is precisely the result of using this local signal to select among proposals. We will revise the results section and SWE-bench description to state the exact signals used by the critic and comparator, confirming that test-suite execution is the local verifier while the final reported accuracy uses the hidden ground-truth evaluation only for scoring. revision: yes
Referee: [Theoretical results] Theoretical claims (abstract and § on formalization): proofs of coverage amplification by repeated sampling and the rank-based bounds on local selection errors are referenced but supplied without explicit derivations, assumption statements, or error analysis. Because these results are load-bearing for the central claim that “reliable amplification requires an additional local soundness signal,” the absence of the derivations prevents verification that the bounds are non-vacuous and that the separation between coverage and identifiability is rigorously maintained.

Authors: We agree that the derivations must be supplied for the theoretical claims to be verifiable. The coverage-amplification result (showing that the probability of including at least one correct proposal grows with k under independent sampling) and the rank-based bounds (showing when local selection errors compose into reliable trajectories) were derived under standard assumptions but omitted from the main text for brevity. We will add a new appendix containing the full proofs, explicit assumption statements (i.i.d. sampling from the proposal model, bounded local error rate of the critic/comparator, and rank-order preservation), and an error analysis confirming that the bounds are non-vacuous for the observed proposal-quality regime on SWE-bench Verified. This addition will rigorously substantiate the necessity of the local soundness signal for reliable amplification. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's central claims are supported by empirical measurements on the public SWE-bench Verified benchmark and theoretical arguments based on standard sampling, ranking, and amplification bounds. No load-bearing step reduces to a self-definition, fitted input renamed as prediction, or self-citation chain. The distinction between local soundness signals and the hidden verifier is maintained without circular reduction, and the performance numbers (67.0% to 76.4%) are reported as direct observations rather than derived by construction from the model inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of usable local soundness signals and on the assumption that the weak model's proposal distribution places positive mass on correct solutions for a non-trivial fraction of tasks; no free parameters or invented entities are introduced in the abstract.

axioms (1)

standard math Standard probabilistic bounds on repeated sampling and rank-based selection errors
Invoked to prove that coverage can be amplified and to bound when local selection errors compose into reliable trajectories.

pith-pipeline@v0.9.0 · 5610 in / 1244 out tokens · 36722 ms · 2026-05-15T04:55:26.583894+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize this view by separating proposal coverage, local identifiability, progress, and diversity. We prove that coverage can be amplified by repeated sampling, but cannot by itself create useful critics or comparators
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

rank-based bounds showing when local selection errors compose into reliable trajectories

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages

[1]

Schapire

Robert E. Schapire. The strength of weak learnability.Machine Learning, 5(2):197–227, 1990

work page 1990
[2]

Schapire

Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. InComputational Learning Theory: Second European Con- ference, EuroCOLT 1995, volume 904 ofLecture Notes in Computer Science, pages 23–37. Springer, 1995

work page 1995
[3]

Schapire

Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of Computer and System Sciences, 55(1):119–139, 1997

work page 1997
[4]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought rea- soning in language models. InInternational Conference on Learning Representations, 2023

work page 2023
[5]

More agents is all you need

Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and Deheng Ye. More agents is all you need. Transactions on Machine Learning Research, 2024

work page 2024
[6]

Le, Christopher Ré, and Azalia Mirhoseini

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024

work page 2024
[7]

Zaharia, and James Y

Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei A. Zaharia, and James Y . Zou. Are more LM calls all you need? towards the scaling properties of com- pound AI systems. InAdvances in Neural Information Processing Systems, volume 37, pages 45767–45790, 2024

work page 2024
[8]

Is best-of-N the best of them? coverage, scaling, and optimality in inference- time alignment, 2025

Audrey Huang, Adam Block, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, and Akshay Krishnamurthy. Is best-of-N the best of them? coverage, scaling, and optimality in inference- time alignment, 2025

work page 2025
[9]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language mod- els. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[10]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

work page 2021
[11]

Let’s verify step by step, 2023

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023

work page 2023
[12]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations, 2024

work page 2024
[13]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated soft- ware engineering. InAdvances in Neural Information Processing Systems, volume 37, 2024

work page 2024
[14]

AutoCodeRover: Au- tonomous program improvement, 2024

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Au- tonomous program improvement, 2024. Published version appears in the ACM SIGSOFT/IS- STA proceedings

work page 2024
[15]

Demystifying LLM- based software engineering agents.Proceedings of the ACM on Software Engineering, 2(FSE):801–824, 2025

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Demystifying LLM- based software engineering agents.Proceedings of the ACM on Software Engineering, 2(FSE):801–824, 2025

work page 2025
[16]

Zico Kolter

Hariharan Manikandan, Yiding Jiang, and J. Zico Kolter. Language models are weak learners. InAdvances in Neural Information Processing Systems, volume 36, 2023. 11

work page 2023
[17]

Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschen- brenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeffrey Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of...

work page 2024
[18]

LLMBoost: Make large language models stronger with boosting, 2025

Zehao Chen, Tianxiang Ai, Yifei Li, Gongxun Li, Yuyang Wei, Wang Zhou, Guanghui Li, Bin Yu, Zhijun Chen, Hailong Sun, Fuzhen Zhuang, Jianxin Li, Deqing Wang, and Yikun Ban. LLMBoost: Make large language models stronger with boosting, 2025. Preprint; also submitted to ICLR 2026 on OpenReview

work page 2025
[19]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, and Denny Zhou. Rationale-augmented ensembles in language models, 2022

work page 2022
[20]

Universal self-consistency for large language model generation, 2023

Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. Universal self-consistency for large language model generation, 2023

work page 2023
[21]

A survey on test-time scaling in large language models: What, how, where, and how well?, 2025

Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua, Haolun Wu, Zhihan Guo, Yufei Wang, Niklas Muennighoff, Irwin King, Xue Liu, and Chen Ma. A survey on test-time scaling in large language models: What, how, where, and how well?, 2025

work page 2025
[22]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations, 2024

work page 2024
[23]

Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, Bangkok, Thailand, August 2024. Association...

work page 2024
[24]

LLM-blender: Ensembling large language models with pairwise ranking and generative fusion

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. LLM-blender: Ensembling large language models with pairwise ranking and generative fusion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165– 14178, Toronto, Canada, July 2023. Association for Computational Linguistics

work page 2023
[25]

McIlraith, and Yilun Du

Shalev Lifshitz, Sheila A. McIlraith, and Yilun Du. Multi-agent verification: Scaling test-time compute with multiple verifiers, 2025

work page 2025
[26]

Kelly Buchanan, Mayee F

Jon Saad-Falcon, E. Kelly Buchanan, Mayee F. Chen, Tzu-Heng Huang, Brendan McLaugh- lin, Tanvir Bhathal, Shang Zhu, Ben Athiwaratkun, Frederic Sala, Scott Linderman, Azalia Mirhoseini, and Christopher Ré. Shrinking the generation-verification gap with weak verifiers, 2025

work page 2025
[27]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[28]

Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, Lester James V . Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hanna Hajishirzi. Rewardbench: Evaluating reward models for language modeling, 2024

work page 2024
[29]

CRITIC: Large language models can self-correct with tool-interactive critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. CRITIC: Large language models can self-correct with tool-interactive critiquing. In International Conference on Learning Representations, 2024

work page 2024
[30]

Chain-of-verification reduces hallucination in large language models

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyil- maz, and Jason Weston. Chain-of-verification reduces hallucination in large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 3563–3578, Bangkok, Thailand, August 2024. Association for Computational Linguistics. 12

work page 2024
[31]

Le, and Ed H

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schu- urmans, Claire Cui, Olivier Bousquet, Quoc V . Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. InInternational Conference on Learning Representations, 2023

work page 2023
[32]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahong Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, Singapore, 2023. Association for Computational Linguistics

work page 2023
[33]

Language agent tree search unifies reasoning, acting, and planning in language models, 2023

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models, 2023

work page 2023
[34]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Confer- ence on Learning Representations, 2023

work page 2023
[35]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[36]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegr- effe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bod- hisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing...

work page 2023
[37]

CAMEL: Communicative agents for “mind” exploration of large scale language model society, 2023

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for “mind” exploration of large scale language model society, 2023

work page 2023
[38]

White, Doug Burger, and Chi Wang

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation, 2023

work page 2023
[39]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J"urgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representa- tions, 2024

work page 2024
[40]

Mixture-of-agents enhances large language model capabilities, 2024

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities, 2024

work page 2024
[41]

Chen, Neel Guha, Christopher Ré, and Azalia Mirhoseini

Jon Saad-Falcon, Adrian Gamarra Lafuente, Shlok Natarajan, Nahum Maru, Hristo Todorov, Etash Kumar Guha, Estefany Kelly Buchanan, Mayee F. Chen, Neel Guha, Christopher Ré, and Azalia Mirhoseini. Archon: An architecture search framework for inference-time techniques, 2024

work page 2024
[42]

Chen, Trevor Chow, Ishan S

Neel Guha, Mayee F. Chen, Trevor Chow, Ishan S. Khare, and Christopher Ré. Smoothie: Label free language model routing, 2024

work page 2024
[43]

Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel Mankowitz, Esme Suther- land Robson, Pushme...

work page 2022
[44]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke 13 Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavar...

work page 2021
[45]

AI safety via debate, 2018

Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate, 2018

work page 2018
[46]

Supervising strong learners by amplifying weak experts, 2018

Paul Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts, 2018

work page 2018
[47]

Tenenbaum, and Igor Mordatch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 11733–11763. PMLR, 2024

work page 2024
[48]

Encouraging divergent thinking in large language models through multi-agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shum- ing Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17889–17904, Miami, Florida, USA, November 2024. Associa- tion f...

work page 2024
[49]

Scalable AI safety via doubly- efficient debate

Jonah Brown-Cohen, Geoffrey Irving, and Georgios Piliouras. Scalable AI safety via doubly- efficient debate. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 4585–4602. PMLR, 2024

work page 2024
[50]

Learning to give checkable an- swers with prover-verifier games, 2021

Cem Anil, Guodong Zhang, Yuhuai Wu, and Roger Grosse. Learning to give checkable an- swers with prover-verifier games, 2021

work page 2021
[51]

Prover-verifier games improve legibility of LLM outputs, 2024

Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, and Yuri Burda. Prover-verifier games improve legibility of LLM outputs, 2024

work page 2024
[52]

Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jan- nis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D

Zachary Kenton, Noah Y . Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jan- nis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, and Rohin Shah. On scalable oversight with weak LLMs judging strong LLMs. InAdvances in Neural Information Processing Systems, volume 37, 2024

work page 2024
[53]

Bagging predictors.Machine Learning, 24(2):123–140, 1996

Leo Breiman. Bagging predictors.Machine Learning, 24(2):123–140, 1996

work page 1996
[54]

David H. Wolpert. Stacked generalization.Neural Networks, 5(2):241–259, 1992

work page 1992
[55]

Dietterich

Thomas G. Dietterich. Ensemble methods in machine learning. InMultiple Classifier Systems, volume 1857 ofLecture Notes in Computer Science, pages 1–15. Springer, 2000

work page 2000
[56]

Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, and Akshay Krishnamurthy

Audrey Huang, Adam Block, Dylan J. Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, and Akshay Krishnamurthy. Self-improvement in language models: The sharpening mecha- nism, 2024. ICLR 2025

work page 2024
[57]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824– 24837, 2022

work page 2022
[58]

Zhijun Chen, Xiaodong Lu, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Ming Li, Likang Xiao, Dingqi Yang, Xiao Huang, Yikun Ban, Hailong Sun, and Philip S. Yu. Harnessing multiple large language models: A survey on LLM ensemble, 2025

work page 2025
[59]

Venktesh, Mandeep Rathee, and Avishek Anand

V . Venktesh, Mandeep Rathee, and Avishek Anand. Trust but verify! a survey on verification design for test-time scaling, 2025. 14

work page 2025
[60]

Joonhyuk Lee, Virginia Ma, Sarah Zhao, Yash Nair, Asher Spector, Regev Cohen, and Em- manuel J. Candès. FUSE: Ensembling verifiers with zero labeled data, 2026

work page 2026
[61]

Ni, and Jian Guo

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Ying- han Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel M. Ni, and Jian Guo. A survey on LLM-as-a-judge, 2024

work page 2024
[62]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[63]

HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[64]

Self-evaluation guided beam search for reasoning

Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, and Michael Xie. Self-evaluation guided beam search for reasoning. InAdvances in Neural Infor- mation Processing Systems, volume 36, 2023

work page 2023
[65]

ChatDev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. ChatDev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15174– 15186, B...

work page 2024
[66]

resolves

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors, 2023. 15 A Broader Impacts This work studies inference-time amplification fo...

work page 2023
[67]

One sentence

Failing test hypothesis: state the smallest hypothesis about what is wrong, derived from the issue text alone. One sentence. This is the ground truth you compare both patches against

work page
[68]

A_changes: list the specific lines/functions Patch A modifies

work page
[69]

B_changes: list the specific lines/functions Patch B modifies

work page
[70]

A_consistent_with_hypothesis: do A’s changes plausibly cause the failing test in the issue to start passing? (true/false + one-line justification)

work page
[71]

B_consistent_with_hypothesis: same question for B

work page
[72]

A_collateral: does A change behavior on inputs unrelated to the failure mode? (true/false + one-line justification)

work page
[73]

hypothesis

B_collateral: same question for B. The decision falls out of this comparison; do not pull a winner from prior. If exactly one patch is consistent with the hypothesis and the other is not, that one wins. If both are consistent, prefer the one with less collateral. If both fail the hypothesis, output TIE. If they are functionally equivalent (same lines chan...

work page
[74]

They show that increasing the number of calls can yield non-monotone performance because easy and hard instances respond differently to majority voting

study scaling laws for compound inference systems, focusing on V ote and Filter-V ote architec- tures. They show that increasing the number of calls can yield non-monotone performance because easy and hard instances respond differently to majority voting. This complements our analysis: their theory focuses on flat voting systems and majority aggregation, ...

work page
[75]

Self-Refine [36] iteratively generates feedback and refines outputs without additional training data

asks models to draft answers, generate verification questions, answer them independently, and then revise. Self-Refine [36] iteratively generates feedback and refines outputs without additional training data. Reflexion [35] stores verbal feedback in memory to improve subsequent trials. Tool-using systems also change the effective verification and proposal...

work page