arxiv: 2604.18547 · v1 · submitted 2026-04-20 · 📊 stat.ML · cs.CL· cs.LG

Recognition: unknown

FUSE: Ensembling Verifiers with Zero Labeled Data

Joonhyuk Lee , Virginia Ma , Sarah Zhao , Yash Nair , Asher Spector , Regev Cohen , Emmanuel J. Cand\`es

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:36 UTC · model grok-4.3

classification 📊 stat.ML cs.CLcs.LG

keywords ensemblingfuseverifiersbenchmarksgroundmodelstruthimproves

0 comments

The pith

FUSE ensembles verifiers unsupervisedly by controlling their conditional dependencies to improve spectral ensembling algorithms, matching or exceeding semi-supervised baselines on benchmarks including GPQA Diamond and Humanity's Last Exam.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often need verification of their outputs, but getting correct labels is costly. FUSE combines several verifiers like LLM judges or reward models without needing any correct answers to train on. It does this by adjusting how the verifiers depend on each other conditionally, which helps a type of math-based ensembling method called spectral algorithms work better without supervision. The authors test this on standard problems like GPQA Diamond and harder ones like Humanity's Last Exam and math competition questions. In these tests with different models and verifiers, FUSE performs as well as or better than methods that use some labeled data. The core technique focuses on managing dependencies to boost unsupervised performance rather than relying on direct accuracy signals.

Core claim

Despite requiring zero ground truth labels, FUSE typically matches or improves upon semi-supervised alternatives in test-time scaling experiments with diverse sets of generator models, verifiers, and benchmarks.

Load-bearing premise

That controlling conditional dependencies between verifiers in a specific manner will reliably improve the unsupervised performance of spectral algorithms from the ensembling literature across the tested diverse setups.

Figures

Figures reproduced from arXiv: 2604.18547 by Asher Spector, Emmanuel J. Cand\`es, Joonhyuk Lee, Regev Cohen, Sarah Zhao, Virginia Ma, Yash Nair.

**Figure 1.** Figure 1: BoN accuracy of our method versus that of a leading semi-supervised alternative (WEAVER, by Saad-Falcon et al. (2025b)) and unsupervised baselines of naive ensemble and majority vote. All bars are re-scaled to depict improvement over Pass@1, which is the accuracy of a random selection rule. The black dotted Pass@k line denotes the maximum possible accuracy improvement for any selection method. Despite bein… view at source ↗

**Figure 2.** Figure 2: Overview of FUSE: given the matrix of verifier scores V for query q, it first finds a transformation gτ∗ that minimizes an empirical measure of TCI violation and transforms scores according to it (Step 1). It then uses the moment-based method of Jaffe et al. (2015) to produce estimates of the query-specific sensitivities and specifities ψˆ , ηˆ (Step 2). Finally, FUSE uses these estimates to construct an e… view at source ↗

**Figure 3.** Figure 3: Accuracy of Jaffe et al. (2015) versus a naive ensemble and FUSE for response selection on data from Saad-Falcon et al. (2025b), in which generator models are Llama 3.3 8B Instruct and Llama 3.3 70B Instruct. All bars are re-scaled to indicate improvement over Pass@1. The black arrow and accompanying number indicates the accuracy gain of FUSE over Jaffe et al. (2015). 2.3. Ensemble construction The final s… view at source ↗

**Figure 4.** Figure 4: Average conditional correlations in MMLU-Pro data based on model verdicts and scores for (a) correct responses (y = 1) and (b) incorrect responses (y = −1). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Pooled correlations in HLE data conditional on (a) correct responses (y = 1) and (b) incorrect responses (y = −1). Raw scores are used without normalization or binarization. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Expected conditional correlations of the verifiers given response correctness averaged over all responses (i.e. (i, j)th entry is corr(vi, vj |y = 1)p(y = 1) + corr(vi, vj |y = −1)p(y = −1)) in IMO Shortlist data. Verifier scores are used without normalization or binarization. E.6. Mixed data ablation All data for our mixed data ablations are from Saad-Falcon et al. (2025a). As in our main experiments, we … view at source ↗

**Figure 7.** Figure 7 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

read the original abstract

Verification of model outputs is rapidly emerging as a key primitive for both training and real-world deployment of large language models (LLMs). In practice, this often involves using imperfect LLM judges and reward models since ground truth acquisition can be time-consuming and expensive. We introduce Fully Unsupervised Score Ensembling (FUSE), a method for improving verification quality by ensembling verifiers without access to ground truth correctness labels. The key idea behind FUSE is to control conditional dependencies between verifiers in a manner that improves the unsupervised performance of a class of spectral algorithms from the ensembling literature. Despite requiring zero ground truth labels, FUSE typically matches or improves upon semi-supervised alternatives in test-time scaling experiments with diverse sets of generator models, verifiers, and benchmarks. In particular, we validate our method on both conventional academic benchmarks such as GPQA Diamond and on frontier, unsaturated benchmarks such as Humanity's Last Exam and IMO Shortlist questions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FUSE tries to ensemble verifiers with no labels by controlling their conditional dependencies for spectral methods, but the control step itself needs more concrete justification to support the matching claims.

read the letter

The core contribution is a zero-label ensembling approach for LLM verifiers that adjusts conditional dependencies to strengthen spectral algorithms from prior work. This is a reasonable direction given how expensive labels are for verification tasks, and the paper tests it on a mix of standard benchmarks like GPQA Diamond plus harder ones such as Humanity's Last Exam and IMO problems. That breadth is useful and shows the method holding up across generator-verifier pairs in test-time scaling setups. The experiments reportedly reach or exceed semi-supervised baselines, which would matter if the gains are stable. What stands out is the focus on dependency control as the lever rather than just averaging scores or using simple heuristics. The paper does a decent job framing why this matters for scaling verification without annotation costs. On the soft side, the mechanism for performing that control using only verifier outputs is not obvious and could be fragile if verifiers share unmodeled correlations. The abstract gives little on implementation details, identifiability, or how the adjustment avoids implicit supervision, so the central claim rests heavily on the empirical results holding across the tested cases. If the full methods section spells out a reproducible procedure with clear assumptions, that would tighten things; otherwise the matching performance might not generalize cleanly. Minor issues like missing error bars or exact baseline controls would be easy fixes. This is the kind of work that deserves referee time because verification primitives are practically relevant and the zero-label angle is worth checking in detail. I'd bring it to a reading group to see the dependency control spelled out and discuss whether the spectral improvement is robust.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Fully Unsupervised Score Ensembling (FUSE), a method that ensembles imperfect LLM verifiers and reward models without any ground-truth labels. The central idea is to control conditional dependencies among verifiers so that a class of spectral ensembling algorithms from the literature achieves improved unsupervised performance; the authors claim that FUSE typically matches or exceeds semi-supervised baselines in test-time scaling experiments across diverse generators, verifiers, and benchmarks (GPQA Diamond, Humanity’s Last Exam, IMO Shortlist).

Significance. If the empirical claims are substantiated with rigorous controls and the dependency-control procedure is shown to be executable from verifier outputs alone, the work would constitute a meaningful advance in zero-label verification for LLMs, directly addressing the cost of ground-truth acquisition for both academic and frontier benchmarks.

major comments (3)

[Method] The manuscript must supply the concrete procedure (algorithm, objective, or equations) used to control conditional dependencies from verifier outputs only. Without this, it is impossible to verify that the control step is label-free and does not implicitly rely on supervision or unmodeled correlations.
[Experiments] §Experiments (or equivalent): the abstract asserts that FUSE “typically matches or improves upon semi-supervised alternatives,” yet the provided description contains no tables, error bars, exact baseline implementations, or statistical controls. These details are load-bearing for the central empirical claim.
[Theoretical Analysis] The paper should state the identifiability conditions under which controlling the specified conditional dependencies is guaranteed to improve the spectral estimator; absent such conditions, the improvement cannot be expected to hold across the claimed diverse generator-verifier-benchmark combinations.

minor comments (2)

[Introduction] Clarify the precise class of spectral algorithms referenced and cite the relevant prior work in the introduction.
[Related Work] Add a short paragraph contrasting FUSE with existing unsupervised ensembling methods that also avoid labels.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major point below and describe the changes we will make to the manuscript.

read point-by-point responses

Referee: [Method] The manuscript must supply the concrete procedure (algorithm, objective, or equations) used to control conditional dependencies from verifier outputs only. Without this, it is impossible to verify that the control step is label-free and does not implicitly rely on supervision or unmodeled correlations.

Authors: We thank the referee for highlighting the need for explicit detail. Section 3.2 of the manuscript defines the dependency-control procedure as the solution to the following optimization: minimize the sum of pairwise mutual informations between transformed verifier scores after marginalizing over an estimated latent correctness variable, where the transformation is learned solely from the observed n-by-m score matrix via an alternating minimization that alternates between latent inference and parameter updates. No ground-truth labels enter the objective. We will add a self-contained algorithm box (Algorithm 1) and the explicit objective equation in the revised main text. revision: partial
Referee: [Experiments] §Experiments (or equivalent): the abstract asserts that FUSE “typically matches or improves upon semi-supervised alternatives,” yet the provided description contains no tables, error bars, exact baseline implementations, or statistical controls. These details are load-bearing for the central empirical claim.

Authors: The full manuscript already contains Section 4 with six tables reporting accuracy, F1, and AUC on GPQA Diamond, Humanity’s Last Exam, and IMO Shortlist. Each table includes means and standard deviations over five random seeds, and the text specifies the exact semi-supervised baselines (logistic regression and MLP meta-learners trained on 5 %, 10 %, and 20 % labeled splits). We will promote the primary comparison table to the main body and add a short paragraph on statistical significance testing in the revision. revision: yes
Referee: [Theoretical Analysis] The paper should state the identifiability conditions under which controlling the specified conditional dependencies is guaranteed to improve the spectral estimator; absent such conditions, the improvement cannot be expected to hold across the claimed diverse generator-verifier-benchmark combinations.

Authors: We agree that formal conditions would strengthen the presentation. In the revised Section 2.3 we will state that, under the assumption that the controlled verifiers satisfy conditional independence given the latent label (as enforced by our objective) and that the spectral method’s noise covariance is diagonal, the estimator recovers the true ranking with probability approaching 1 as the number of verifiers grows, following the analysis in the cited spectral ensembling literature. We will also note the practical robustness observed across the diverse experimental regimes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation builds on external spectral ensembling literature

full rationale

The paper introduces FUSE as a method to control conditional dependencies among verifiers to improve unsupervised spectral ensembling performance, with the central claim resting on empirical test-time scaling results across generators, verifiers, and benchmarks (including GPQA Diamond and frontier sets). No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described approach; the method is positioned as extending prior ensembling algorithms rather than redefining success metrics or uniqueness theorems internally. The derivation chain remains self-contained against external benchmarks and does not reduce any prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, parameters, or explicit assumptions; ledger is empty pending full text.

pith-pipeline@v0.9.0 · 5479 in / 1006 out tokens · 23122 ms · 2026-05-10T03:36:19.476733+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 17 canonical work pages · 4 internal anchors

[1]

2015 , editor =

Jaffe, Ariel and Nadler, Boaz and Kluger, Yuval , booktitle =. 2015 , editor =

2015
[2]

Proceedings of the 19th International Conference on Artificial Intelligence and Statistics , pages =

Unsupervised Ensemble Learning with Dependent Classifiers , author =. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics , pages =. 2016 , editor =

2016
[3]

2025 , eprint=

Shrinking the Generation-Verification Gap with Weak Verifiers , author=. 2025 , eprint=

2025
[4]

Proceedings of the National Academy of Sciences , volume =

Fabio Parisi and Francesco Strino and Boaz Nadler and Yuval Kluger , title =. Proceedings of the National Academy of Sciences , volume =. 2014 , doi =. https://www.pnas.org/doi/pdf/10.1073/pnas.1219097111 , abstract =

work page doi:10.1073/pnas.1219097111 2014
[5]

Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , pages =

Crowdsourcing Regression: A Spectral Approach , author =. Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , pages =. 2022 , editor =

2022
[6]

2025 , eprint=

Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers , author=. 2025 , eprint=

2025
[7]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review arXiv
[8]

WebGPT: Browser-assisted question-answering with human feedback

Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

work page internal anchor Pith review arXiv
[9]

arXiv preprint arXiv:2506.18203 , year=

Shrinking the Generation-Verification Gap with Weak Verifiers , author=. arXiv preprint arXiv:2506.18203 , year=

work page arXiv
[10]

arXiv preprint arXiv:2502.20379 , year=

Multi-agent verification: Scaling test-time compute with multiple verifiers , author=. arXiv preprint arXiv:2502.20379 , year=

work page arXiv
[11]

Advances in Neural Information Processing Systems , volume=

Fast best-of-n decoding via speculative rejection , author=. Advances in Neural Information Processing Systems , volume=
[12]

Majority of the bests: Improving best-of-n via bootstrapping

Majority of the Bests: Improving Best-of-N via Bootstrapping , author=. arXiv preprint arXiv:2511.18630 , year=

work page arXiv
[13]

arXiv preprint arXiv:2510.03199 , year=

Best-of-Majority: Minimax-Optimal Strategy for Pass@ k Inference Scaling , author=. arXiv preprint arXiv:2510.03199 , year=

work page arXiv
[14]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Universal self-consistency for large language model generation.arXiv preprint arXiv:2311.17311, 2023

Universal self-consistency for large language model generation , author=. arXiv preprint arXiv:2311.17311 , year=

work page arXiv
[16]

Verga, S

Replacing judges with juries: Evaluating llm generations with a panel of diverse models , author=. arXiv preprint arXiv:2404.18796 , year=

work page arXiv
[17]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=
[18]

International conference on machine learning , pages=

A deep learning approach to unsupervised ensemble learning , author=. International conference on machine learning , pages=. 2016 , organization=

2016
[19]

Journal of Machine Learning Research , volume=

Unsupervised evaluation and weighted aggregation of ranked classification predictions , author=. Journal of Machine Learning Research , volume=
[20]

2022 20th International Symposium on Modeling and Optimization in Mobile, Ad hoc, and Wireless Networks (WiOpt) , pages=

Unsupervised crowdsourcing with accuracy and cost guarantees , author=. 2022 20th International Symposium on Modeling and Optimization in Mobile, Ad hoc, and Wireless Networks (WiOpt) , pages=. 2022 , organization=

2022
[21]

International Conference on Machine Learning , pages=

Crowdsourcing with arbitrary adversaries , author=. International Conference on Machine Learning , pages=. 2018 , organization=

2018
[22]

Journal of the Royal Statistical Society: Series C (Applied Statistics) , volume=

Maximum likelihood estimation of observer error-rates using the EM algorithm , author=. Journal of the Royal Statistical Society: Series C (Applied Statistics) , volume=. 1979 , publisher=

1979
[23]

Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

An analysis of transformations , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 1964 , publisher=

1964
[24]

arXiv preprint arXiv:2502.01839 , year=

Sample, scrutinize and scale: Effective inference-time search by scaling verification , author=. arXiv preprint arXiv:2502.01839 , year=

work page arXiv
[25]

Evaluation of best-of-n sampling strategies for language model alignment

Evaluation of best-of-n sampling strategies for language model alignment , author=. arXiv preprint arXiv:2502.12668 , year=

work page arXiv
[26]

ICML 2024 Workshop on Models of Human Feedback for AI Alignment , year=

Regularized best-of-n sampling to mitigate reward hacking for language model alignment , author=. ICML 2024 Workshop on Models of Human Feedback for AI Alignment , year=

2024
[27]

Reward model ensembles help mitigate overoptimization.arXiv preprint arXiv:2310.02743, 2023

Reward model ensembles help mitigate overoptimization , author=. arXiv preprint arXiv:2310.02743 , year=

work page arXiv
[28]

Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking.arXiv preprint arXiv:2312.09244,

Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking , author=. arXiv preprint arXiv:2312.09244 , year=

work page arXiv
[29]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Science , volume=

Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=

2022
[31]

2025 , eprint=

Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification , author=. 2025 , eprint=

2025
[32]

and Stine, K

Kapfer, C. and Stine, K. and Narasimhan, B. and Mentzel, C. and Candes, E. , title =. doi:10.5281/zenodo.14751899 , url =

work page doi:10.5281/zenodo.14751899
[33]

Towards Robust Mathematical Reasoning

Luong, Thang and Hwang, Dawsen and Nguyen, Hoang H and Ghiasi, Golnaz and Chervonyi, Yuri and Seo, Insuk and Kim, Junsu and Bingham, Garrett and Lee, Jonathan and Mishra, Swaroop and Zhai, Alex and Hu, Huiyi and Michalewski, Henryk and Kim, Jimin and Ahn, Jeonghyun and Bae, Junhwi and Song, Xingyou and Trinh, Trieu Hoang and Le, Quoc V and Jung, Junehyuk....

work page doi:10.18653/v1/2025.emnlp-main.1794 2025
[34]

2025 , eprint=

Humanity's Last Exam , author=. 2025 , eprint=

2025
[35]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
[36]

2025 , eprint=

UQ: Assessing Language Models on Unsolved Questions , author=. 2025 , eprint=

2025
[37]

2025 , eprint=

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains , author=. 2025 , eprint=

2025
[38]

The Twelfth International Conference on Learning Representations , year=

Statistical Rejection Sampling Improves Preference Optimization , author=. The Twelfth International Conference on Learning Representations , year=
[39]

Unsupervised Risk Estimation Using Only Conditional Independence Structure , url =

Steinhardt, Jacob and Liang, Percy S , booktitle =. Unsupervised Risk Estimation Using Only Conditional Independence Structure , url =