Quantum-Inspired Trace-Augmented Evidence Selection for Reasoning over Structured Hypothesis Spaces
Pith reviewed 2026-06-27 21:55 UTC · model grok-4.3
The pith
Treating chain-of-thought selection as higher-order binary optimization preserves minority hypotheses in legal reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EP-HUBO generates multiple CoT traces with a small local model, parses fragments into per-hypothesis evidence pools, solves a higher-order unconstrained binary optimisation per pool with quality-derived weights, and delegates a single adjudication call per question to a frontier model. HUBO-style optimisation gives a principled way to aggregate reasoning fragments while preserving minority-but-correct hypotheses, and is most valuable in low-contamination domains where frontier models have not already absorbed the benchmark material.
What carries the argument
Evidence Pool Higher-Order Binary Optimisation (EP-HUBO), which formulates evidence selection as a HUBO problem with quality-derived weights for relevance, specificity, and distinctiveness.
If this is right
- Well-supported but minority hypotheses can override noisy majorities in evidence-intensive legal tasks.
- The approach applies to two evidence-intensive legal benchmarks and can run via simulated annealing or a photonic entropy-quantum machine.
- The method is most valuable in low-contamination domains where frontier models have not absorbed the benchmark material.
- A single frontier-model adjudication call per question suffices after the optimization step.
Where Pith is reading between the lines
- The same optimization framing could extend to other structured reasoning tasks that require distinguishing subtle evidence differences.
- Comparing performance on contaminated versus uncontaminated benchmarks would isolate when the method adds value beyond frontier-model knowledge.
- Generating fragments with larger local models might strengthen the evidence pools without changing the optimization core.
Load-bearing premise
That quality-derived weights for relevance, specificity, and distinctiveness computed from CoT fragments can be used inside the higher-order binary optimization to correctly identify the strongest evidence set for each hypothesis.
What would settle it
Running the method on the two legal benchmarks and finding that it selects evidence sets leading to lower accuracy than simple majority vote on the same CoT traces would falsify the central claim.
Figures
read the original abstract
Large language models (LLMs) now solve a wide range of expert-level exams at or above human level, yet remain brittle on specialised, evidence-intensive domains such as law. On these tasks, errors arise not only from gaps in world knowledge but also from subtle distinctions between pieces of evidence and inconsistent use of supporting evidence. The most common aggregator over sampled chain-of-thought (CoT) traces, majority vote, returns the most popular answer regardless of whether its evidence is actually strongest. We propose to treat the selection of CoT reasoning fragments into a set of evidence as an explicit combinatorial optimisation problem, allowing well-supported but minority hypotheses to override noisy majorities, and to evaluate the approach on legal-reasoning benchmarks that are particularly sensitive to evidence quality. We introduce EP-HUBO (Evidence Pool Higher-Order Binary Optimisation), which generates multiple CoT traces with a small local model, parses fragments into per-hypothesis evidence pools, solves a higher-order unconstrained binary optimisation per pool with quality-derived weights (relevance, specificity, distinctiveness), and delegates a single adjudication call per question to a frontier model. We evaluate EP-HUBO on two evidence-intensive legal benchmarks using both simulated annealing on classical hardware and the Dirac-3 photonic entropy-quantum machine from Quantum Computing Inc. HUBO-style optimisation gives a principled way to aggregate reasoning fragments while preserving minority-but-correct hypotheses, and is most valuable in low-contamination domains where frontier models have not already absorbed the benchmark material.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes EP-HUBO (Evidence Pool Higher-Order Binary Optimisation), which samples multiple CoT traces from a local model, parses them into per-hypothesis evidence pools, solves a higher-order unconstrained binary optimisation (HUBO) per pool using quality-derived weights for relevance/specificity/distinctiveness, and delegates final adjudication to a frontier model. The approach is positioned as superior to majority vote for preserving minority-but-correct hypotheses on evidence-intensive legal benchmarks and is demonstrated using both simulated annealing and the Dirac-3 photonic processor.
Significance. If the central claim holds, the work supplies a combinatorial formulation for evidence aggregation that can surface well-supported minority hypotheses without requiring the frontier model to re-process all fragments; the hardware demonstration and focus on low-contamination domains constitute concrete strengths.
major comments (3)
- [Abstract] Abstract: the manuscript asserts that quality-derived weights inside the HUBO objective correctly identify the strongest evidence set per hypothesis, yet supplies neither the explicit procedure for computing those weights nor any ablation showing that the selected sets outperform frequency or length baselines.
- [Abstract] Abstract and method description: no quantitative results, error bars, or cross-benchmark comparisons are reported, so the claim that EP-HUBO outperforms majority vote on legal-reasoning tasks cannot be evaluated against the stated weakest assumption.
- [Method] The optimisation is described as independent of the final adjudication call, but without equations showing how the relevance/specificity/distinctiveness weights are derived solely from fragment statistics (rather than fitted to target labels), the independence cannot be verified.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and completeness of the technical details.
read point-by-point responses
-
Referee: [Abstract] Abstract: the manuscript asserts that quality-derived weights inside the HUBO objective correctly identify the strongest evidence set per hypothesis, yet supplies neither the explicit procedure for computing those weights nor any ablation showing that the selected sets outperform frequency or length baselines.
Authors: We agree the abstract is overly concise. The method section will be expanded with the explicit formulas for the three weights (relevance computed via token overlap with the hypothesis statement, specificity via inverse frequency within the evidence pool, distinctiveness via average pairwise Jaccard distance), all derived from fragment statistics only. An ablation comparing HUBO-selected sets against frequency and length baselines will be added to the experiments section. revision: yes
-
Referee: [Abstract] Abstract and method description: no quantitative results, error bars, or cross-benchmark comparisons are reported, so the claim that EP-HUBO outperforms majority vote on legal-reasoning tasks cannot be evaluated against the stated weakest assumption.
Authors: The manuscript reports results on two legal benchmarks, but we acknowledge the absence of error bars and detailed cross-benchmark tables. The revised version will include error bars from repeated sampling runs, quantitative performance tables versus majority vote, and additional benchmark comparisons to allow direct evaluation of the claims. revision: yes
-
Referee: [Method] The optimisation is described as independent of the final adjudication call, but without equations showing how the relevance/specificity/distinctiveness weights are derived solely from fragment statistics (rather than fitted to target labels), the independence cannot be verified.
Authors: The weights are computed exclusively from per-fragment statistics without access to ground-truth labels. We will insert the explicit derivation equations in the method section to demonstrate that the HUBO objective depends only on the evidence pool and is therefore independent of the subsequent frontier-model adjudication call. revision: yes
Circularity Check
No circularity: method described as independent combinatorial aggregation
full rationale
The abstract and description present EP-HUBO as generating CoT traces, parsing evidence pools, applying quality-derived weights (relevance, specificity, distinctiveness) inside a higher-order binary optimization, and delegating final adjudication. No equations, self-citations, or definitions are supplied that would make the optimization output equivalent to its inputs by construction, nor is any weight computation shown to be fitted to target labels or to the final hypothesis selection. The derivation chain therefore remains self-contained against external benchmarks and does not reduce to renaming, self-definition, or load-bearing self-citation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Quality-derived weights for relevance, specificity and distinctiveness can be computed from CoT fragments such that the resulting higher-order binary optimization selects the strongest evidence set.
Reference graph
Works this paper leans on
-
[1]
Quantum Combinatorial Reasoning for Large Language Models,
C. Flores-Garrigos, G. Dev, M. Falkenthal, A. Gomez Cadavid, A. Simen, S. Kumar, E. Solano, and N. N. Hegade, “Quantum Combinatorial Reasoning for Large Language Models,”arXiv preprint arXiv:2510.24509, 2025
arXiv 2025
-
[2]
Self-Consistency Improves Chain of Thought Reasoning in Language Models,
X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-Consistency Improves Chain of Thought Reasoning in Language Models,”arXiv preprint arXiv:2203.11171, 2022
Pith/arXiv arXiv 2022
-
[3]
Entropy computing, a paradigm for optimization in open photonic systems,
L. Nguyen, M.-A. Miri, R. J. Rupert, W. Dyk, S. Wu, N. Vrahoretis, I. Huang, M. Begliar- bekov, N. Chancellor, U. Chukwu, P. Mahamuni, C. Martinez-Delgado, D. Haycraft, C. Spear, J. R. Huffman, Y. M. Sua, and Y.-P. Huang, “Entropy computing, a paradigm for optimization in open photonic systems,”Communications Physics, vol. 8, article 411, 2025. doi: 10.10...
-
[4]
Optimal Self-Consistency for Efficient Reasoning with Large Language Models,
A. Feng, M. Alonso, and A. Odonnat, “Optimal Self-Consistency for Efficient Reasoning with Large Language Models,”arXiv preprint arXiv:2511.12309, 2025
arXiv 2025
-
[5]
Scalable Best-of-N Selection for Large Language Models via Self-Certainty,
Z. Kang, X. Zhao, and D. Song, “Scalable Best-of-N Selection for Large Language Models via Self-Certainty,”arXiv preprint arXiv:2502.18581, 2025
arXiv 2025
-
[6]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,
DeepSeek-AI, “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,”arXiv preprint arXiv:2501.12948, 2025
Pith/arXiv arXiv 2025
-
[7]
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning,
W. Yang, S. Ma, Y. Lin, and F. Wei, “Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning,”arXiv preprint arXiv:2502.18080, 2025
arXiv 2025
-
[8]
Mixture-of-Agents Enhances Large Language Model Capabilities,
J. Wang, J. Wang, B. Athiwaratkun, C. Zhang, and J. Zou, “Mixture-of-Agents Enhances Large Language Model Capabilities,”arXiv preprint arXiv:2406.04692, 2024
Pith/arXiv arXiv 2024
-
[9]
Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?
W. Li, Y. Lin, M. Xia, and C. Jin, “Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?”arXiv preprint arXiv:2502.00674, 2025. 18
arXiv 2025
-
[10]
Ensemble Learning for Large Language Models in Text and Code Generation: A Survey,
M. Ashiga, W. Jie, F. Wu, V. Voskanyan, F. Dinmohammadi, P. Brookes, J. Gong, and Z. Wang, “Ensemble Learning for Large Language Models in Text and Code Generation: A Survey,”arXiv preprint arXiv:2503.13505, 2025
Pith/arXiv arXiv 2025
-
[11]
M. Esencan, T. A. Kumar, A. A. Asanjan, P. A. Lott, M. Mohseni, C. Unlu, D. Venturelli, and A. Ho, “Combinatorial Reasoning: Selecting Reasons in Generative AI Pipelines via Combinatorial Optimization,”arXiv preprint arXiv:2407.00071, 2024
arXiv 2024
-
[12]
H. Zhang, M. Emu, and S. Choudhury, “LLM-QUBO: An End-to-End Framework for Automated QUBO Transformation from Natural Language Problem Descriptions,”arXiv preprint arXiv:2509.00099, 2025
arXiv 2025
-
[13]
C. Pomeroy, A. Pramov, K. Thakrar, and L. Yendapalli, “Quantum Annealing for Machine Learning: Applications in Feature Selection, Instance Selection, and Clustering,”arXiv preprint arXiv:2507.15063, 2025
arXiv 2025
-
[14]
Quantum Natural Language Pro- cessing: A Comprehensive Review of Models, Methods, and Applications,
F. Nausheen, K. Ahmed, M. I. Khan, and F. Riaz, “Quantum Natural Language Pro- cessing: A Comprehensive Review of Models, Methods, and Applications,”arXiv preprint arXiv:2504.09909, 2025
arXiv 2025
-
[15]
Towards Reasoning Ability of Small Language Models,
G. Srivastava, S. Cao, and X. Wang, “Towards Reasoning Ability of Small Language Models,” arXiv preprint arXiv:2502.11569, 2025
arXiv 2025
-
[16]
Bench360: Benchmarking Local LLM Inference from 360 Degrees,
L. Stuhlmann, M. F. Argerich, and J. F¨ urst, “Bench360: Benchmarking Local LLM Inference from 360 Degrees,”arXiv preprint arXiv:2511.16682, 2025
arXiv 2025
-
[17]
LEXam: Benchmarking Legal Reasoning on 340 Law Exams,
Y. Fan, J. Ni, J. Merane, Y. Tian, Y. Hermstr¨ uwer, et al., “LEXam: Benchmarking Legal Reasoning on 340 Law Exams,”arXiv preprint arXiv:2505.12864, 2025. Dataset: https://huggingface.co/datasets/LEXam-Benchmark/LEXam
arXiv 2025
-
[18]
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark,
Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, A. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen, “MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark,”arXiv preprint arXiv:2406.01574, 2024
Pith/arXiv arXiv 2024
-
[19]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,”arXiv preprint arXiv:2201.11903, 2022
Pith/arXiv arXiv 2022
-
[20]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models,
S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, “Tree of Thoughts: Deliberate Problem Solving with Large Language Models,”Advances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[21]
Self-critiquing models for assisting human evaluators,
W. Saunders, C. Yeh, J. Wu, S. Bills, L. Ouyang, J. Ward, and J. Leike, “Self-critiquing models for assisting human evaluators,”arXiv preprint arXiv:2206.05802, 2022
Pith/arXiv arXiv 2022
-
[22]
Generating Sequences by Learning to Self-Correct,
S. Welleck, X. Lu, P. West, F. Brahman, T. Shen, D. Khashabi, and Y. Choi, “Generating Sequences by Learning to Self-Correct,”International Conference on Learning Representations (ICLR), 2023
2023
-
[23]
A Linear Programming Formulation for Global Inference in Natural Language Tasks,
D. Roth and W. Yih, “A Linear Programming Formulation for Global Inference in Natural Language Tasks,”Proc. CoNLL, 2004
2004
-
[24]
Incremental Integer Linear Programming for Non-projective Dependency Parsing,
S. Riedel and J. Clarke, “Incremental Integer Linear Programming for Non-projective Dependency Parsing,”Proc. EMNLP, 2006. 19
2006
-
[25]
Synchromesh: Reliable Code Generation from Pre-trained Language Models,
G. Poesia, A. Polozov, V. Le, A. Tiwari, G. Soares, C. Meek, and S. Gulwani, “Synchromesh: Reliable Code Generation from Pre-trained Language Models,”International Conference on Learning Representations (ICLR), 2022
2022
-
[26]
SparseMAP: Differentiable Sparse Structured Inference,
V. Niculae, A. F. T. Martins, M. Blondel, and C. Cardie, “SparseMAP: Differentiable Sparse Structured Inference,”International Conference on Machine Learning (ICML), 2018
2018
-
[27]
On Quadratization of Pseudo-Boolean Functions,
E. Boros, A. Gruber, “On Quadratization of Pseudo-Boolean Functions,”International Symposium on Artificial Intelligence and Mathematics, 2014
2014
-
[28]
NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for each Benchmark,
O. Sainz, J. A. Campos, I. Garc´ ıa-Ferrero, J. Etxaniz, O. Lopez de Lacalle, and E. Agirre, “NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for each Benchmark,”Findings of EMNLP, 2023
2023
-
[29]
Data Contamination: From Memorization to Exploitation,
I. Magar and R. Schwartz, “Data Contamination: From Memorization to Exploitation,” Proc. ACL, 2022
2022
-
[30]
Quantifying Memorization Across Neural Language Models,
N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying Memorization Across Neural Language Models,”International Conference on Learning Representations (ICLR), 2023
2023
-
[31]
Holistic Evaluation of Language Models,
P. Liang, R. Bommasani, T. Lee, et al., “Holistic Evaluation of Language Models,”Transac- tions on Machine Learning Research, 2023. arXiv preprint arXiv:2211.09110
Pith/arXiv arXiv 2023
-
[32]
Efficient Benchmarking of Language Models,
Y. Perlitz, E. Bandel, A. Gera, O. Arviv, L. Ein-Dor, E. Shnarch, N. Slonim, M. Shmueli- Scheuer, and L. Choshen, “Efficient Benchmarking of Language Models,”arXiv preprint arXiv:2308.11696, 2023
arXiv 2023
-
[33]
The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing,
R. Dror, G. Baumer, S. Shlomov, and R. Reichart, “The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing,”Proc. ACL, 2018
2018
-
[34]
Cooling Schedules for Optimal Annealing,
B. Hajek, “Cooling Schedules for Optimal Annealing,”Mathematics of Operations Research, vol. 13, no. 2, pp. 311–329, 1988
1988
-
[35]
Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images,
S. Geman and D. Geman, “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6, no. 6, pp. 721–741, 1984. A Prompt Templates We reproduce the prompt templates used in each phase. Variable placeholders are written in <angle brackets>. All prompts are release...
1984
-
[36]
=== Evidence supporting (B) ===
<fragment_A_1> ... === Evidence supporting (B) === ... Which answer is most strongly supported? Phase 4 (zero-shot baseline).The ZS baseline uses the same Phase 4 system message with the evidence blocks omitted (the user message contains only the question and options). 21 B Additional Details Hyperparameters used throughout the paper are listed in Table 1...
2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.