Quantum-Inspired Trace-Augmented Evidence Selection for Reasoning over Structured Hypothesis Spaces

Laura Wynter; Nirvik Sahoo; Paul Griffin

arxiv: 2606.06941 · v1 · pith:HFKROB6Mnew · submitted 2026-06-05 · 💻 cs.AI

Quantum-Inspired Trace-Augmented Evidence Selection for Reasoning over Structured Hypothesis Spaces

Laura Wynter , Nirvik Sahoo , Paul Griffin This is my paper

Pith reviewed 2026-06-27 21:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords evidence selectionchain-of-thought aggregationhigher-order binary optimizationlegal reasoninghypothesis evidence poolsminority hypothesis preservationquantum-inspired optimization

0 comments

The pith

Treating chain-of-thought selection as higher-order binary optimization preserves minority hypotheses in legal reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes to frame the aggregation of reasoning fragments from multiple chain-of-thought traces as a combinatorial optimization problem rather than using majority vote. This approach, called EP-HUBO, uses higher-order unconstrained binary optimization with weights derived from relevance, specificity, and distinctiveness to select evidence sets for each hypothesis. It delegates final adjudication to a frontier model after solving the optimization, either on classical hardware or a photonic quantum machine. The method aims to handle subtle distinctions in evidence-intensive domains like law where popular answers may not have the strongest support. A sympathetic reader would care because it offers a way to let well-supported but less common answers prevail over noisy majorities.

Core claim

EP-HUBO generates multiple CoT traces with a small local model, parses fragments into per-hypothesis evidence pools, solves a higher-order unconstrained binary optimisation per pool with quality-derived weights, and delegates a single adjudication call per question to a frontier model. HUBO-style optimisation gives a principled way to aggregate reasoning fragments while preserving minority-but-correct hypotheses, and is most valuable in low-contamination domains where frontier models have not already absorbed the benchmark material.

What carries the argument

Evidence Pool Higher-Order Binary Optimisation (EP-HUBO), which formulates evidence selection as a HUBO problem with quality-derived weights for relevance, specificity, and distinctiveness.

If this is right

Well-supported but minority hypotheses can override noisy majorities in evidence-intensive legal tasks.
The approach applies to two evidence-intensive legal benchmarks and can run via simulated annealing or a photonic entropy-quantum machine.
The method is most valuable in low-contamination domains where frontier models have not absorbed the benchmark material.
A single frontier-model adjudication call per question suffices after the optimization step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same optimization framing could extend to other structured reasoning tasks that require distinguishing subtle evidence differences.
Comparing performance on contaminated versus uncontaminated benchmarks would isolate when the method adds value beyond frontier-model knowledge.
Generating fragments with larger local models might strengthen the evidence pools without changing the optimization core.

Load-bearing premise

That quality-derived weights for relevance, specificity, and distinctiveness computed from CoT fragments can be used inside the higher-order binary optimization to correctly identify the strongest evidence set for each hypothesis.

What would settle it

Running the method on the two legal benchmarks and finding that it selects evidence sets leading to lower accuracy than simple majority vote on the same CoT traces would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.06941 by Laura Wynter, Nirvik Sahoo, Paul Griffin.

read the original abstract

Large language models (LLMs) now solve a wide range of expert-level exams at or above human level, yet remain brittle on specialised, evidence-intensive domains such as law. On these tasks, errors arise not only from gaps in world knowledge but also from subtle distinctions between pieces of evidence and inconsistent use of supporting evidence. The most common aggregator over sampled chain-of-thought (CoT) traces, majority vote, returns the most popular answer regardless of whether its evidence is actually strongest. We propose to treat the selection of CoT reasoning fragments into a set of evidence as an explicit combinatorial optimisation problem, allowing well-supported but minority hypotheses to override noisy majorities, and to evaluate the approach on legal-reasoning benchmarks that are particularly sensitive to evidence quality. We introduce EP-HUBO (Evidence Pool Higher-Order Binary Optimisation), which generates multiple CoT traces with a small local model, parses fragments into per-hypothesis evidence pools, solves a higher-order unconstrained binary optimisation per pool with quality-derived weights (relevance, specificity, distinctiveness), and delegates a single adjudication call per question to a frontier model. We evaluate EP-HUBO on two evidence-intensive legal benchmarks using both simulated annealing on classical hardware and the Dirac-3 photonic entropy-quantum machine from Quantum Computing Inc. HUBO-style optimisation gives a principled way to aggregate reasoning fragments while preserving minority-but-correct hypotheses, and is most valuable in low-contamination domains where frontier models have not already absorbed the benchmark material.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EP-HUBO recasts CoT fragment selection as per-hypothesis HUBO with three quality weights, but the abstract supplies no numbers or ablations to show it beats simpler methods.

read the letter

The paper's main contribution is treating evidence selection from multiple CoT traces as an explicit higher-order binary optimization problem, one per hypothesis, with weights for relevance, specificity, and distinctiveness. This lets a minority hypothesis win if its evidence pool scores higher under the objective. They run the solver both with simulated annealing and on the Dirac-3 photonic machine, and they target legal benchmarks where evidence quality matters more than raw popularity.

That framing is new enough to be worth seeing in full. Most CoT work stays with voting or linear scoring; moving to HUBO per evidence pool is a distinct step, and the legal domain choice fits the motivation.

The description still leaves the central claim untested. No accuracy figures appear, no direct comparison to majority vote or length-based baselines, and no account of how the three weights are computed from the fragments. If those weights are just token overlap or model self-scores, the combinatorial step may add little. The stress-test point stands: without evidence that the optimization actually surfaces stronger minority sets, the advantage over existing aggregators is not shown.

The work is aimed at people building evidence-sensitive reasoning systems or exploring optimization for LLM outputs. A reader already working on CoT aggregation or quantum-inspired solvers would find the formulation useful even before the numbers.

It should go to peer review so the experiments, weight derivation, and any ablations can be checked properly.

Referee Report

3 major / 0 minor

Summary. The paper proposes EP-HUBO (Evidence Pool Higher-Order Binary Optimisation), which samples multiple CoT traces from a local model, parses them into per-hypothesis evidence pools, solves a higher-order unconstrained binary optimisation (HUBO) per pool using quality-derived weights for relevance/specificity/distinctiveness, and delegates final adjudication to a frontier model. The approach is positioned as superior to majority vote for preserving minority-but-correct hypotheses on evidence-intensive legal benchmarks and is demonstrated using both simulated annealing and the Dirac-3 photonic processor.

Significance. If the central claim holds, the work supplies a combinatorial formulation for evidence aggregation that can surface well-supported minority hypotheses without requiring the frontier model to re-process all fragments; the hardware demonstration and focus on low-contamination domains constitute concrete strengths.

major comments (3)

[Abstract] Abstract: the manuscript asserts that quality-derived weights inside the HUBO objective correctly identify the strongest evidence set per hypothesis, yet supplies neither the explicit procedure for computing those weights nor any ablation showing that the selected sets outperform frequency or length baselines.
[Abstract] Abstract and method description: no quantitative results, error bars, or cross-benchmark comparisons are reported, so the claim that EP-HUBO outperforms majority vote on legal-reasoning tasks cannot be evaluated against the stated weakest assumption.
[Method] The optimisation is described as independent of the final adjudication call, but without equations showing how the relevance/specificity/distinctiveness weights are derived solely from fragment statistics (rather than fitted to target labels), the independence cannot be verified.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and completeness of the technical details.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript asserts that quality-derived weights inside the HUBO objective correctly identify the strongest evidence set per hypothesis, yet supplies neither the explicit procedure for computing those weights nor any ablation showing that the selected sets outperform frequency or length baselines.

Authors: We agree the abstract is overly concise. The method section will be expanded with the explicit formulas for the three weights (relevance computed via token overlap with the hypothesis statement, specificity via inverse frequency within the evidence pool, distinctiveness via average pairwise Jaccard distance), all derived from fragment statistics only. An ablation comparing HUBO-selected sets against frequency and length baselines will be added to the experiments section. revision: yes
Referee: [Abstract] Abstract and method description: no quantitative results, error bars, or cross-benchmark comparisons are reported, so the claim that EP-HUBO outperforms majority vote on legal-reasoning tasks cannot be evaluated against the stated weakest assumption.

Authors: The manuscript reports results on two legal benchmarks, but we acknowledge the absence of error bars and detailed cross-benchmark tables. The revised version will include error bars from repeated sampling runs, quantitative performance tables versus majority vote, and additional benchmark comparisons to allow direct evaluation of the claims. revision: yes
Referee: [Method] The optimisation is described as independent of the final adjudication call, but without equations showing how the relevance/specificity/distinctiveness weights are derived solely from fragment statistics (rather than fitted to target labels), the independence cannot be verified.

Authors: The weights are computed exclusively from per-fragment statistics without access to ground-truth labels. We will insert the explicit derivation equations in the method section to demonstrate that the HUBO objective depends only on the evidence pool and is therefore independent of the subsequent frontier-model adjudication call. revision: yes

Circularity Check

0 steps flagged

No circularity: method described as independent combinatorial aggregation

full rationale

The abstract and description present EP-HUBO as generating CoT traces, parsing evidence pools, applying quality-derived weights (relevance, specificity, distinctiveness) inside a higher-order binary optimization, and delegating final adjudication. No equations, self-citations, or definitions are supplied that would make the optimization output equivalent to its inputs by construction, nor is any weight computation shown to be fitted to target labels or to the final hypothesis selection. The derivation chain therefore remains self-contained against external benchmarks and does not reduce to renaming, self-definition, or load-bearing self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, invented entities, or detailed axioms beyond the domain assumption that the chosen weights and HUBO solver will surface stronger evidence; full paper would be needed to audit any fitted constants or unstated modeling choices.

axioms (1)

domain assumption Quality-derived weights for relevance, specificity and distinctiveness can be computed from CoT fragments such that the resulting higher-order binary optimization selects the strongest evidence set.
This premise is required for the optimization step to improve over majority vote.

pith-pipeline@v0.9.1-grok · 5799 in / 1291 out tokens · 30335 ms · 2026-06-27T21:55:43.979812+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 1 canonical work pages

[1]

Quantum Combinatorial Reasoning for Large Language Models,

C. Flores-Garrigos, G. Dev, M. Falkenthal, A. Gomez Cadavid, A. Simen, S. Kumar, E. Solano, and N. N. Hegade, “Quantum Combinatorial Reasoning for Large Language Models,”arXiv preprint arXiv:2510.24509, 2025

arXiv 2025
[2]

Self-Consistency Improves Chain of Thought Reasoning in Language Models,

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-Consistency Improves Chain of Thought Reasoning in Language Models,”arXiv preprint arXiv:2203.11171, 2022

Pith/arXiv arXiv 2022
[3]

Entropy computing, a paradigm for optimization in open photonic systems,

L. Nguyen, M.-A. Miri, R. J. Rupert, W. Dyk, S. Wu, N. Vrahoretis, I. Huang, M. Begliar- bekov, N. Chancellor, U. Chukwu, P. Mahamuni, C. Martinez-Delgado, D. Haycraft, C. Spear, J. R. Huffman, Y. M. Sua, and Y.-P. Huang, “Entropy computing, a paradigm for optimization in open photonic systems,”Communications Physics, vol. 8, article 411, 2025. doi: 10.10...

work page doi:10.1038/s42005-025-02324-6 2025
[4]

Optimal Self-Consistency for Efficient Reasoning with Large Language Models,

A. Feng, M. Alonso, and A. Odonnat, “Optimal Self-Consistency for Efficient Reasoning with Large Language Models,”arXiv preprint arXiv:2511.12309, 2025

arXiv 2025
[5]

Scalable Best-of-N Selection for Large Language Models via Self-Certainty,

Z. Kang, X. Zhao, and D. Song, “Scalable Best-of-N Selection for Large Language Models via Self-Certainty,”arXiv preprint arXiv:2502.18581, 2025

arXiv 2025
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,

DeepSeek-AI, “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,”arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025
[7]

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning,

W. Yang, S. Ma, Y. Lin, and F. Wei, “Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning,”arXiv preprint arXiv:2502.18080, 2025

arXiv 2025
[8]

Mixture-of-Agents Enhances Large Language Model Capabilities,

J. Wang, J. Wang, B. Athiwaratkun, C. Zhang, and J. Zou, “Mixture-of-Agents Enhances Large Language Model Capabilities,”arXiv preprint arXiv:2406.04692, 2024

Pith/arXiv arXiv 2024
[9]

Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?

W. Li, Y. Lin, M. Xia, and C. Jin, “Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?”arXiv preprint arXiv:2502.00674, 2025. 18

arXiv 2025
[10]

Ensemble Learning for Large Language Models in Text and Code Generation: A Survey,

M. Ashiga, W. Jie, F. Wu, V. Voskanyan, F. Dinmohammadi, P. Brookes, J. Gong, and Z. Wang, “Ensemble Learning for Large Language Models in Text and Code Generation: A Survey,”arXiv preprint arXiv:2503.13505, 2025

Pith/arXiv arXiv 2025
[11]

Combinatorial Reasoning: Selecting Reasons in Generative AI Pipelines via Combinatorial Optimization,

M. Esencan, T. A. Kumar, A. A. Asanjan, P. A. Lott, M. Mohseni, C. Unlu, D. Venturelli, and A. Ho, “Combinatorial Reasoning: Selecting Reasons in Generative AI Pipelines via Combinatorial Optimization,”arXiv preprint arXiv:2407.00071, 2024

arXiv 2024
[12]

LLM-QUBO: An End-to-End Framework for Automated QUBO Transformation from Natural Language Problem Descriptions,

H. Zhang, M. Emu, and S. Choudhury, “LLM-QUBO: An End-to-End Framework for Automated QUBO Transformation from Natural Language Problem Descriptions,”arXiv preprint arXiv:2509.00099, 2025

arXiv 2025
[13]

Quantum Annealing for Machine Learning: Applications in Feature Selection, Instance Selection, and Clustering,

C. Pomeroy, A. Pramov, K. Thakrar, and L. Yendapalli, “Quantum Annealing for Machine Learning: Applications in Feature Selection, Instance Selection, and Clustering,”arXiv preprint arXiv:2507.15063, 2025

arXiv 2025
[14]

Quantum Natural Language Pro- cessing: A Comprehensive Review of Models, Methods, and Applications,

F. Nausheen, K. Ahmed, M. I. Khan, and F. Riaz, “Quantum Natural Language Pro- cessing: A Comprehensive Review of Models, Methods, and Applications,”arXiv preprint arXiv:2504.09909, 2025

arXiv 2025
[15]

Towards Reasoning Ability of Small Language Models,

G. Srivastava, S. Cao, and X. Wang, “Towards Reasoning Ability of Small Language Models,” arXiv preprint arXiv:2502.11569, 2025

arXiv 2025
[16]

Bench360: Benchmarking Local LLM Inference from 360 Degrees,

L. Stuhlmann, M. F. Argerich, and J. F¨ urst, “Bench360: Benchmarking Local LLM Inference from 360 Degrees,”arXiv preprint arXiv:2511.16682, 2025

arXiv 2025
[17]

LEXam: Benchmarking Legal Reasoning on 340 Law Exams,

Y. Fan, J. Ni, J. Merane, Y. Tian, Y. Hermstr¨ uwer, et al., “LEXam: Benchmarking Legal Reasoning on 340 Law Exams,”arXiv preprint arXiv:2505.12864, 2025. Dataset: https://huggingface.co/datasets/LEXam-Benchmark/LEXam

arXiv 2025
[18]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark,

Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, A. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen, “MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark,”arXiv preprint arXiv:2406.01574, 2024

Pith/arXiv arXiv 2024
[19]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,”arXiv preprint arXiv:2201.11903, 2022

Pith/arXiv arXiv 2022
[20]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models,

S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, “Tree of Thoughts: Deliberate Problem Solving with Large Language Models,”Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[21]

Self-critiquing models for assisting human evaluators,

W. Saunders, C. Yeh, J. Wu, S. Bills, L. Ouyang, J. Ward, and J. Leike, “Self-critiquing models for assisting human evaluators,”arXiv preprint arXiv:2206.05802, 2022

Pith/arXiv arXiv 2022
[22]

Generating Sequences by Learning to Self-Correct,

S. Welleck, X. Lu, P. West, F. Brahman, T. Shen, D. Khashabi, and Y. Choi, “Generating Sequences by Learning to Self-Correct,”International Conference on Learning Representations (ICLR), 2023

2023
[23]

A Linear Programming Formulation for Global Inference in Natural Language Tasks,

D. Roth and W. Yih, “A Linear Programming Formulation for Global Inference in Natural Language Tasks,”Proc. CoNLL, 2004

2004
[24]

Incremental Integer Linear Programming for Non-projective Dependency Parsing,

S. Riedel and J. Clarke, “Incremental Integer Linear Programming for Non-projective Dependency Parsing,”Proc. EMNLP, 2006. 19

2006
[25]

Synchromesh: Reliable Code Generation from Pre-trained Language Models,

G. Poesia, A. Polozov, V. Le, A. Tiwari, G. Soares, C. Meek, and S. Gulwani, “Synchromesh: Reliable Code Generation from Pre-trained Language Models,”International Conference on Learning Representations (ICLR), 2022

2022
[26]

SparseMAP: Differentiable Sparse Structured Inference,

V. Niculae, A. F. T. Martins, M. Blondel, and C. Cardie, “SparseMAP: Differentiable Sparse Structured Inference,”International Conference on Machine Learning (ICML), 2018

2018
[27]

On Quadratization of Pseudo-Boolean Functions,

E. Boros, A. Gruber, “On Quadratization of Pseudo-Boolean Functions,”International Symposium on Artificial Intelligence and Mathematics, 2014

2014
[28]

NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for each Benchmark,

O. Sainz, J. A. Campos, I. Garc´ ıa-Ferrero, J. Etxaniz, O. Lopez de Lacalle, and E. Agirre, “NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for each Benchmark,”Findings of EMNLP, 2023

2023
[29]

Data Contamination: From Memorization to Exploitation,

I. Magar and R. Schwartz, “Data Contamination: From Memorization to Exploitation,” Proc. ACL, 2022

2022
[30]

Quantifying Memorization Across Neural Language Models,

N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying Memorization Across Neural Language Models,”International Conference on Learning Representations (ICLR), 2023

2023
[31]

Holistic Evaluation of Language Models,

P. Liang, R. Bommasani, T. Lee, et al., “Holistic Evaluation of Language Models,”Transac- tions on Machine Learning Research, 2023. arXiv preprint arXiv:2211.09110

Pith/arXiv arXiv 2023
[32]

Efficient Benchmarking of Language Models,

Y. Perlitz, E. Bandel, A. Gera, O. Arviv, L. Ein-Dor, E. Shnarch, N. Slonim, M. Shmueli- Scheuer, and L. Choshen, “Efficient Benchmarking of Language Models,”arXiv preprint arXiv:2308.11696, 2023

arXiv 2023
[33]

The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing,

R. Dror, G. Baumer, S. Shlomov, and R. Reichart, “The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing,”Proc. ACL, 2018

2018
[34]

Cooling Schedules for Optimal Annealing,

B. Hajek, “Cooling Schedules for Optimal Annealing,”Mathematics of Operations Research, vol. 13, no. 2, pp. 311–329, 1988

1988
[35]

Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images,

S. Geman and D. Geman, “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6, no. 6, pp. 721–741, 1984. A Prompt Templates We reproduce the prompt templates used in each phase. Variable placeholders are written in <angle brackets>. All prompts are release...

1984
[36]

=== Evidence supporting (B) ===

<fragment_A_1> ... === Evidence supporting (B) === ... Which answer is most strongly supported? Phase 4 (zero-shot baseline).The ZS baseline uses the same Phase 4 system message with the evidence blocks omitted (the user message contains only the question and options). 21 B Additional Details Hyperparameters used throughout the paper are listed in Table 1...

2000

[1] [1]

Quantum Combinatorial Reasoning for Large Language Models,

C. Flores-Garrigos, G. Dev, M. Falkenthal, A. Gomez Cadavid, A. Simen, S. Kumar, E. Solano, and N. N. Hegade, “Quantum Combinatorial Reasoning for Large Language Models,”arXiv preprint arXiv:2510.24509, 2025

arXiv 2025

[2] [2]

Self-Consistency Improves Chain of Thought Reasoning in Language Models,

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-Consistency Improves Chain of Thought Reasoning in Language Models,”arXiv preprint arXiv:2203.11171, 2022

Pith/arXiv arXiv 2022

[3] [3]

Entropy computing, a paradigm for optimization in open photonic systems,

L. Nguyen, M.-A. Miri, R. J. Rupert, W. Dyk, S. Wu, N. Vrahoretis, I. Huang, M. Begliar- bekov, N. Chancellor, U. Chukwu, P. Mahamuni, C. Martinez-Delgado, D. Haycraft, C. Spear, J. R. Huffman, Y. M. Sua, and Y.-P. Huang, “Entropy computing, a paradigm for optimization in open photonic systems,”Communications Physics, vol. 8, article 411, 2025. doi: 10.10...

work page doi:10.1038/s42005-025-02324-6 2025

[4] [4]

Optimal Self-Consistency for Efficient Reasoning with Large Language Models,

A. Feng, M. Alonso, and A. Odonnat, “Optimal Self-Consistency for Efficient Reasoning with Large Language Models,”arXiv preprint arXiv:2511.12309, 2025

arXiv 2025

[5] [5]

Scalable Best-of-N Selection for Large Language Models via Self-Certainty,

Z. Kang, X. Zhao, and D. Song, “Scalable Best-of-N Selection for Large Language Models via Self-Certainty,”arXiv preprint arXiv:2502.18581, 2025

arXiv 2025

[6] [6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,

DeepSeek-AI, “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,”arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025

[7] [7]

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning,

W. Yang, S. Ma, Y. Lin, and F. Wei, “Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning,”arXiv preprint arXiv:2502.18080, 2025

arXiv 2025

[8] [8]

Mixture-of-Agents Enhances Large Language Model Capabilities,

J. Wang, J. Wang, B. Athiwaratkun, C. Zhang, and J. Zou, “Mixture-of-Agents Enhances Large Language Model Capabilities,”arXiv preprint arXiv:2406.04692, 2024

Pith/arXiv arXiv 2024

[9] [9]

Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?

W. Li, Y. Lin, M. Xia, and C. Jin, “Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?”arXiv preprint arXiv:2502.00674, 2025. 18

arXiv 2025

[10] [10]

Ensemble Learning for Large Language Models in Text and Code Generation: A Survey,

M. Ashiga, W. Jie, F. Wu, V. Voskanyan, F. Dinmohammadi, P. Brookes, J. Gong, and Z. Wang, “Ensemble Learning for Large Language Models in Text and Code Generation: A Survey,”arXiv preprint arXiv:2503.13505, 2025

Pith/arXiv arXiv 2025

[11] [11]

Combinatorial Reasoning: Selecting Reasons in Generative AI Pipelines via Combinatorial Optimization,

M. Esencan, T. A. Kumar, A. A. Asanjan, P. A. Lott, M. Mohseni, C. Unlu, D. Venturelli, and A. Ho, “Combinatorial Reasoning: Selecting Reasons in Generative AI Pipelines via Combinatorial Optimization,”arXiv preprint arXiv:2407.00071, 2024

arXiv 2024

[12] [12]

LLM-QUBO: An End-to-End Framework for Automated QUBO Transformation from Natural Language Problem Descriptions,

H. Zhang, M. Emu, and S. Choudhury, “LLM-QUBO: An End-to-End Framework for Automated QUBO Transformation from Natural Language Problem Descriptions,”arXiv preprint arXiv:2509.00099, 2025

arXiv 2025

[13] [13]

Quantum Annealing for Machine Learning: Applications in Feature Selection, Instance Selection, and Clustering,

C. Pomeroy, A. Pramov, K. Thakrar, and L. Yendapalli, “Quantum Annealing for Machine Learning: Applications in Feature Selection, Instance Selection, and Clustering,”arXiv preprint arXiv:2507.15063, 2025

arXiv 2025

[14] [14]

Quantum Natural Language Pro- cessing: A Comprehensive Review of Models, Methods, and Applications,

F. Nausheen, K. Ahmed, M. I. Khan, and F. Riaz, “Quantum Natural Language Pro- cessing: A Comprehensive Review of Models, Methods, and Applications,”arXiv preprint arXiv:2504.09909, 2025

arXiv 2025

[15] [15]

Towards Reasoning Ability of Small Language Models,

G. Srivastava, S. Cao, and X. Wang, “Towards Reasoning Ability of Small Language Models,” arXiv preprint arXiv:2502.11569, 2025

arXiv 2025

[16] [16]

Bench360: Benchmarking Local LLM Inference from 360 Degrees,

L. Stuhlmann, M. F. Argerich, and J. F¨ urst, “Bench360: Benchmarking Local LLM Inference from 360 Degrees,”arXiv preprint arXiv:2511.16682, 2025

arXiv 2025

[17] [17]

LEXam: Benchmarking Legal Reasoning on 340 Law Exams,

Y. Fan, J. Ni, J. Merane, Y. Tian, Y. Hermstr¨ uwer, et al., “LEXam: Benchmarking Legal Reasoning on 340 Law Exams,”arXiv preprint arXiv:2505.12864, 2025. Dataset: https://huggingface.co/datasets/LEXam-Benchmark/LEXam

arXiv 2025

[18] [18]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark,

Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, A. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen, “MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark,”arXiv preprint arXiv:2406.01574, 2024

Pith/arXiv arXiv 2024

[19] [19]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,”arXiv preprint arXiv:2201.11903, 2022

Pith/arXiv arXiv 2022

[20] [20]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models,

S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, “Tree of Thoughts: Deliberate Problem Solving with Large Language Models,”Advances in Neural Information Processing Systems (NeurIPS), 2023

2023

[21] [21]

Self-critiquing models for assisting human evaluators,

W. Saunders, C. Yeh, J. Wu, S. Bills, L. Ouyang, J. Ward, and J. Leike, “Self-critiquing models for assisting human evaluators,”arXiv preprint arXiv:2206.05802, 2022

Pith/arXiv arXiv 2022

[22] [22]

Generating Sequences by Learning to Self-Correct,

S. Welleck, X. Lu, P. West, F. Brahman, T. Shen, D. Khashabi, and Y. Choi, “Generating Sequences by Learning to Self-Correct,”International Conference on Learning Representations (ICLR), 2023

2023

[23] [23]

A Linear Programming Formulation for Global Inference in Natural Language Tasks,

D. Roth and W. Yih, “A Linear Programming Formulation for Global Inference in Natural Language Tasks,”Proc. CoNLL, 2004

2004

[24] [24]

Incremental Integer Linear Programming for Non-projective Dependency Parsing,

S. Riedel and J. Clarke, “Incremental Integer Linear Programming for Non-projective Dependency Parsing,”Proc. EMNLP, 2006. 19

2006

[25] [25]

Synchromesh: Reliable Code Generation from Pre-trained Language Models,

G. Poesia, A. Polozov, V. Le, A. Tiwari, G. Soares, C. Meek, and S. Gulwani, “Synchromesh: Reliable Code Generation from Pre-trained Language Models,”International Conference on Learning Representations (ICLR), 2022

2022

[26] [26]

SparseMAP: Differentiable Sparse Structured Inference,

V. Niculae, A. F. T. Martins, M. Blondel, and C. Cardie, “SparseMAP: Differentiable Sparse Structured Inference,”International Conference on Machine Learning (ICML), 2018

2018

[27] [27]

On Quadratization of Pseudo-Boolean Functions,

E. Boros, A. Gruber, “On Quadratization of Pseudo-Boolean Functions,”International Symposium on Artificial Intelligence and Mathematics, 2014

2014

[28] [28]

NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for each Benchmark,

O. Sainz, J. A. Campos, I. Garc´ ıa-Ferrero, J. Etxaniz, O. Lopez de Lacalle, and E. Agirre, “NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for each Benchmark,”Findings of EMNLP, 2023

2023

[29] [29]

Data Contamination: From Memorization to Exploitation,

I. Magar and R. Schwartz, “Data Contamination: From Memorization to Exploitation,” Proc. ACL, 2022

2022

[30] [30]

Quantifying Memorization Across Neural Language Models,

N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying Memorization Across Neural Language Models,”International Conference on Learning Representations (ICLR), 2023

2023

[31] [31]

Holistic Evaluation of Language Models,

P. Liang, R. Bommasani, T. Lee, et al., “Holistic Evaluation of Language Models,”Transac- tions on Machine Learning Research, 2023. arXiv preprint arXiv:2211.09110

Pith/arXiv arXiv 2023

[32] [32]

Efficient Benchmarking of Language Models,

Y. Perlitz, E. Bandel, A. Gera, O. Arviv, L. Ein-Dor, E. Shnarch, N. Slonim, M. Shmueli- Scheuer, and L. Choshen, “Efficient Benchmarking of Language Models,”arXiv preprint arXiv:2308.11696, 2023

arXiv 2023

[33] [33]

The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing,

R. Dror, G. Baumer, S. Shlomov, and R. Reichart, “The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing,”Proc. ACL, 2018

2018

[34] [34]

Cooling Schedules for Optimal Annealing,

B. Hajek, “Cooling Schedules for Optimal Annealing,”Mathematics of Operations Research, vol. 13, no. 2, pp. 311–329, 1988

1988

[35] [35]

Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images,

S. Geman and D. Geman, “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6, no. 6, pp. 721–741, 1984. A Prompt Templates We reproduce the prompt templates used in each phase. Variable placeholders are written in <angle brackets>. All prompts are release...

1984

[36] [36]

=== Evidence supporting (B) ===

<fragment_A_1> ... === Evidence supporting (B) === ... Which answer is most strongly supported? Phase 4 (zero-shot baseline).The ZS baseline uses the same Phase 4 system message with the evidence blocks omitted (the user message contains only the question and options). 21 B Additional Details Hyperparameters used throughout the paper are listed in Table 1...

2000