pith. machine review for the scientific record. sign in

arxiv: 2605.06219 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization

Yunzhen Yao , Hongye Wang , Yahong Wang , Michael C. Gastpar , Bo Jiang , Lie He

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:02 UTC · model grok-4.3

classification 💻 cs.AI
keywords test-time aggregationLLM reasoningenergy minimizationIsing modeljoint consistencypairwise comparisonsmajority votingreasoning benchmarks
0
0 comments X

The pith

Joint Consistency aggregates LLM reasoning traces by minimizing an energy function that accounts for interactions between candidates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Joint Consistency to combine multiple reasoning traces generated by large language models at test time. It models aggregation as minimizing an energy function where individual trace evaluations act as external fields and pairwise comparisons act as interactions between traces. This unifies simpler methods like majority voting as special cases where interactions are ignored. Experiments show consistent gains over baselines on math and code reasoning benchmarks across different trace counts, judge models, and generation settings. Readers would care because it incorporates comparative information that isolated evaluation methods miss, offering a more principled route to reliable LLM answers.

Core claim

Joint Consistency formulates test-time aggregation as a constrained Ising-type energy minimization problem, where independent evaluation signals serve as external fields and pairwise comparisons from an LLM judge serve as interaction terms. This framework subsumes existing voting and weighted aggregation methods as special cases under particular choices of the interaction matrix. An efficient approximation strategy is developed to make the modeling practical for large numbers of traces.

What carries the argument

A constrained Ising-type energy minimization problem in which external fields come from independent trace evaluations and interactions come from pairwise LLM judge comparisons.

Load-bearing premise

The LLM judge's pairwise comparisons provide meaningful information about the relative consistency or correctness of the candidate answers.

What would settle it

Replacing the interaction matrix with random values or zeros on the same benchmarks and finding that Joint Consistency no longer outperforms majority voting would falsify the contribution of the interaction terms.

Figures

Figures reproduced from arXiv: 2605.06219 by Bo Jiang, Hongye Wang, Lie He, Michael C. Gastpar, Yahong Wang, Yunzhen Yao.

Figure 1
Figure 1. Figure 1: TTA methods on crowdsourced traces for Q28 from the HMMT 2025 (Nov) dataset, where view at source ↗
Figure 3
Figure 3. Figure 3: Sensitivity of Joint Consistency to µ on MathArena-C dataset. Across diverse judge models, JC performs best with a moderate µ ∈ [0.5, 1], suggesting that µ does not require judge-model-specific tuning. Qwen3.5-Flash DS-v4-Flash GPT-5.2 GPT-5.2(R) GPT-OSS-20B GPT-OSS-120B 40 50 60 70 80 Acc (%) +8.8% +7.2% +2.4% +1.8% +11.5% +5.8% Non Reasoning Models Reasoning Models (Low) JC (h only) JC (J only) JC (h+J) view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy–cost trade-off under κ￾approximation on MathArena-C. Only evalu￾ation costs are included. The results are shown in view at source ↗
Figure 5
Figure 5. Figure 5: Pass@1 accuracies of reasoning traces from 57 models on the HMMT-2025 (Feb). Confi view at source ↗
read the original abstract

This paper studies test-time aggregation, an approach that generates multiple reasoning traces and aggregates them into a final answer. Most existing methods rely on evaluation signals collected from candidate traces in isolation or answer frequencies, while ignoring comparative interactions among candidates. We propose Joint Consistency (JC), formulated as a constrained Ising-type energy minimization problem, where independent evaluation signals act as external fields and pairwise comparisons act as interactions. JC provides a unified framework for test-time aggregation that subsumes existing voting and weighted aggregation methods as special cases. Our construction of the interaction matrix leverages LLM-as-a-judge comparisons, and admits a theoretical interpretation under answer-level homogeneity assumptions. Moreover, we develop an efficient approximation strategy that makes interaction modeling practical for large-scale test-time aggregation. Experiments on math and code reasoning benchmarks show that JC consistently outperforms existing baselines across tasks, judge models, trace budgets, and trace-generation settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Joint Consistency (JC), a test-time aggregation method for LLM reasoning traces formulated as a constrained Ising-type energy minimization problem. Independent evaluation signals serve as external fields, while pairwise comparisons from LLM-as-a-judge provide the interaction terms in the energy function. The framework is claimed to subsume voting and weighted aggregation as special cases, supported by an efficient approximation algorithm. Experiments on math and code reasoning benchmarks indicate consistent outperformance over baselines under various conditions including different judge models and trace budgets.

Significance. Should the central claims hold, this work provides a novel energy-based unification of test-time aggregation techniques, potentially enabling more effective use of multiple reasoning traces by explicitly modeling their interactions. This could have implications for improving the reliability of LLM outputs in complex reasoning tasks, moving beyond frequency-based or isolated scoring methods.

major comments (1)
  1. [Abstract and method section] Abstract and method section: The unification claim that JC subsumes voting and weighted aggregation methods as special cases is load-bearing for the paper's positioning as a 'unified framework'. While the abstract notes a theoretical interpretation under answer-level homogeneity assumptions, the manuscript should provide an explicit derivation showing the parameter settings (e.g., interaction matrix J=0) that recover these baselines, and analyze sensitivity to violations of the homogeneity assumption given that reasoning traces for the same answer often vary in structure and quality.
minor comments (2)
  1. [Approximation strategy] The description of the efficient approximation strategy would benefit from pseudocode or a complexity analysis to make the practical implementation clearer.
  2. [Experiments] Experimental tables should report variance or confidence intervals alongside mean performance to support the 'consistent outperformance' claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The unification claim is indeed central to positioning JC as a framework, and we address the request for greater explicitness below. We will revise the manuscript to strengthen this aspect while preserving the original technical contributions.

read point-by-point responses
  1. Referee: [Abstract and method section] Abstract and method section: The unification claim that JC subsumes voting and weighted aggregation methods as special cases is load-bearing for the paper's positioning as a 'unified framework'. While the abstract notes a theoretical interpretation under answer-level homogeneity assumptions, the manuscript should provide an explicit derivation showing the parameter settings (e.g., interaction matrix J=0) that recover these baselines, and analyze sensitivity to violations of the homogeneity assumption given that reasoning traces for the same answer often vary in structure and quality.

    Authors: We agree that an explicit derivation will improve clarity. In the revised manuscript we will add a dedicated paragraph in Section 3 deriving the special cases: setting the interaction matrix J identically to zero recovers unweighted majority voting (the external fields then reduce to per-answer scores), while appropriate scaling of the fields recovers weighted aggregation. This derivation holds exactly under the stated answer-level homogeneity assumption. For sensitivity to violations of homogeneity, our existing experiments already provide supporting evidence: performance gains remain consistent across trace-generation settings that produce structurally diverse reasoning traces for the same answer (see Tables 2–4 and the ablation on trace budgets). We will add a short discussion paragraph acknowledging that strong violations could in principle degrade the interaction terms and outlining that the efficient approximation algorithm remains well-defined regardless. A full formal sensitivity analysis lies beyond the current scope but can be pursued in follow-up work; the empirical robustness already demonstrated mitigates the practical concern. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper formulates Joint Consistency as a constrained Ising-type energy minimization with external fields from independent signals and interactions from LLM-as-a-judge pairwise comparisons. It states that this subsumes voting and weighted aggregation as special cases via appropriate parameter settings in the interaction matrix, and provides a theoretical interpretation only under explicit answer-level homogeneity assumptions. No equations or steps in the abstract reduce a claimed result or performance metric back to a fitted input or self-definition by construction. The unification is presented as a modeling choice rather than a forced equivalence, and empirical results on benchmarks are independent of any self-citation chain. The derivation remains self-contained against external benchmarks with no load-bearing self-citations or ansatz smuggling identified.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central construction rests on the answer-level homogeneity assumption that allows the interaction matrix to be interpreted theoretically; the efficient approximation strategy is presented without derivation details in the abstract.

axioms (1)
  • domain assumption answer-level homogeneity assumptions
    Invoked to give theoretical interpretation to the interaction matrix construction.

pith-pipeline@v0.9.0 · 5457 in / 1266 out tokens · 35988 ms · 2026-05-08T10:02:55.959099+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 14 canonical work pages · 7 internal anchors

  1. [1]

    2025 AIME I

    Art of Problem Solving. 2025 AIME I. Art of Problem Solving Wiki, 2025. Accessed: 2025

  2. [2]

    2025 AIME II

    Art of Problem Solving. 2025 AIME II. Art of Problem Solving Wiki, 2025. Accessed: 2025

  3. [3]

    Balachandran, J

    Vidhisha Balachandran, Jingya Chen, Lingjiao Chen, Shivam Garg, Neel Joshi, Yash Lara, John Langford, Besmira Nushi, Vibhav Vineet, Yue Wu, et al. Inference-time scaling for complex tasks: Where we stand and what lies ahead.arXiv preprint arXiv:2504.00294, 2025

  4. [4]

    Math- arena: Evaluating llms on uncontaminated math competitions.Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmark, 2025

    Mislav Balunovi´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. Math- arena: Evaluating llms on uncontaminated math competitions.Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmark, 2025

  5. [5]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

  6. [6]

    Sets: Leveraging self-verification and self-correction for improved test-time scaling.arXiv preprint arXiv:2501.19306, 2025

    Jiefeng Chen, Jie Ren, Xinyun Chen, Chengrun Yang, Ruoxi Sun, Jinsung Yoon, and Sercan Ö Arık. Sets: Leveraging self-verification and self-correction for improved test-time scaling.arXiv preprint arXiv:2501.19306, 2025

  7. [7]

    Universal self-consistency for large language models

    Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. Universal self-consistency for large language models. InICML Workshop on In-Context Learning, 2024

  8. [8]

    Deep reinforcement learning from human preferences.Advances in neural information processing systems, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 2017

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  10. [10]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  11. [11]

    Llm self- correction with decrim: Decompose, critique, and refine for enhanced following of instructions with multiple constraints

    Thomas Palmeira Ferraz, Kartik Mehta, Yu-Hsiang Lin, Haw-Shiuan Chang, Shereen Oraby, Sijia Liu, Vivek Subramanian, Tagyoung Chung, Mohit Bansal, and Nanyun Peng. Llm self- correction with decrim: Decompose, critique, and refine for enhanced following of instructions with multiple constraints. InFindings of the Association for Computational Linguistics: E...

  12. [12]

    Deep think with confidence

    Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence. In NeurIPS 2025 Workshop on Efficient Reasoning, 2025

  13. [13]

    Critic: Large language models can self-correct with tool-interactive critiquing

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yujiu Yang, Nan Duan, Weizhu Chen, et al. Critic: Large language models can self-correct with tool-interactive critiquing. InInternational Confer- ence on Learning Representations, 2024

  14. [14]

    CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

    Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. Cruxeval: A benchmark for code reasoning, understanding and execution.arXiv preprint arXiv:2401.03065, 2024

  15. [15]

    A survey on llm-as-a-judge.The Innovation, 2024

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

  16. [16]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 10

  17. [17]

    Reward reasoning models

    Jiaxin Guo, Zewen Chi, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, and Furu Wei. Reward reasoning models. InNeural Information Processing Systems, 2025

  18. [18]

    Reason- ing with language model is planning with world model

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reason- ing with language model is planning with world model. InConference on Empirical Methods in Natural Language Processing, 2023

  19. [19]

    Is best-of-n the best of them? coverage, scaling, and optimality in inference-time alignment

    Audrey Huang, Adam Block, Qinghua Liu, Nan Jiang, Akshay Krishnamurthy, and Dylan J Foster. Is best-of-n the best of them? coverage, scaling, and optimality in inference-time alignment. InForty-second International Conference on Machine Learning, 2025

  20. [20]

    Regularized best-of-n sampling with minimum bayes risk objective for language model alignment

    Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, and Kenshi Abe. Regularized best-of-n sampling with minimum bayes risk objective for language model alignment. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025

  21. [21]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

  22. [22]

    arXiv preprint arXiv:2502.18581 , year=

    Zhewei Kang, Xuandong Zhao, and Dawn Song. Scalable best-of-n selection for large language models via self-certainty.arXiv preprint arXiv:2502.18581, 2025

  23. [23]

    Semantic self-consistency: Enhancing language model reasoning via semantic weighting

    Tim Knappe, Ryan Luo Li, Ayush Chauhan, Kaylee Chhua, Kevin Zhu, and Sean O’Brien. Semantic self-consistency: Enhancing language model reasoning via semantic weighting. In The 4th Workshop on Mathematical Reasoning and AI at NeurIPS, 2024

  24. [24]

    Evolving deeper LLM thinking

    Kuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schu- urmans, and Xinyun Chen. Evolving deeper llm thinking.arXiv preprint arXiv:2501.09891, 2025

  25. [25]

    (2025 b ), Pairjudge RM : Perform best-of- N sampling with knockout tournament, arXiv preprint arXiv:2501.13007

    Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. Pairjudge rm: Perform best-of-n sampling with knockout tournament.arXiv preprint arXiv:2501.13007, 2025

  26. [26]

    Large language model guided tree-of-thought

    Jieyi Long. Large language model guided tree-of-thought.arXiv preprint arXiv:2305.08291, 2023

  27. [27]

    Ising formulations of many np problems.Frontiers in physics, 2:5, 2014

    Andrew Lucas. Ising formulations of many np problems.Frontiers in physics, 2:5, 2014

  28. [28]

    Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 2023

  29. [29]

    gpt-oss-120b & gpt-oss-20b model card, 2025

    OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025

  30. [30]

    Introducing gpt-5.2

    OpenAI. Introducing gpt-5.2. https://openai.com/index/introducing-gpt-5-2/ , De- cember 2025. Accessed: 2026-05-05

  31. [31]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 2022

  32. [32]

    Recursive introspection: Teaching language model agents how to self-improve.Advances in Neural Information Processing Systems, 2024

    Yuxiao Qu, Tianjun Zhang, Naman Garg, and Aviral Kumar. Recursive introspection: Teaching language model agents how to self-improve.Advances in Neural Information Processing Systems, 2024

  33. [33]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

  34. [34]

    How do large language monkeys get their power (laws)? InForty-second International Conference on Machine Learning, 2025

    Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, and Sanmi Koyejo. How do large language monkeys get their power (laws)? InForty-second International Conference on Machine Learning, 2025. 11

  35. [35]

    Reasoning under uncertainty: Efficient llm inference via unsupervised confidence dilution and convergent adaptive sampling

    Zhenning Shi, Yijia Zhu, Yi Xie, Junhan Shi, Guorui Xie, Haotian Zhang, Yong Jiang, Congcong Miao, and Qing Li. Reasoning under uncertainty: Efficient llm inference via unsupervised confidence dilution and convergent adaptive sampling. InConference on Empirical Methods in Natural Language Processing, 2025

  36. [36]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

  37. [37]

    arXiv preprint arXiv:2502.06233 , year=

    Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms.arXiv preprint arXiv:2502.06233, 2025

  38. [38]

    Logical reasoning with outcome reward models for test-time scaling

    Ramya Keerthy Thatikonda, Wray Buntine, and Ehsan Shareghi. Logical reasoning with outcome reward models for test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26113–26123, 2025

  39. [39]

    Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InConference on Empirical Methods in Natural Language Processing, 2023

  40. [40]

    Reasoning aware self-consistency: Leveraging reasoning paths for efficient llm sampling

    Guangya Wan, Yuqi Wu, Jie Chen, and Sheng Li. Reasoning aware self-consistency: Leveraging reasoning paths for efficient llm sampling. InConference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3613–3635, 2025

  41. [41]

    Alphazero-like tree-search can guide large language model decoding and training

    Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus Mcaleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training. InInternational Conference on Machine Learning, 2024

  42. [42]

    Self-consistency improves chain of thought reasoning in language models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

  43. [43]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  44. [44]

    From decoding to meta-generation: Inference-time algorithms for large language models.Transactions on Machine Learning Research, 2024

    Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, and Zaid Harchaoui. From decoding to meta-generation: Inference-time algorithms for large language models.Transactions on Machine Learning Research, 2024

  45. [45]

    Large language models are better reasoners with self-verification

    Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification. InFindings of the Association for Computational Linguistics: EMNLP, 2023

  46. [46]

    Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. In International Conference on Learning Representations, 2024

  47. [47]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  48. [48]

    Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 2023

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 2023

  49. [49]

    Confidence-aware reasoning: Optimizing self-guided thinking trajectories in large reasoning models

    Jiaxin Zhang. Confidence-aware reasoning: Optimizing self-guided thinking trajectories in large reasoning models. InConference on Empirical Methods in Natural Language Processing: Industry Track, 2025. 12

  50. [50]

    Sample, scrutinize and scale: Effective inference-time search by scaling verification

    Eric Zhao, Pranjal Awasthi, and Sreenivas Gollapudi. Sample, scrutinize and scale: Effective inference-time search by scaling verification. InForty-second International Conference on Machine Learning, 2025

  51. [51]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 2023

  52. [52]

    Evaluating judges as evaluators: The jetts benchmark of llm-as-judges as test-time scaling evaluators

    Yilun Zhou, Austin Xu, PeiFeng Wang, Caiming Xiong, and Shafiq Joty. Evaluating judges as evaluators: The jetts benchmark of llm-as-judges as test-time scaling evaluators. InInternational Conference on Machine Learning, 2025

  53. [53]

    Accu- racy(%)/$Cost

    Zhi Zhou, Yuhao Tan, Zenan Li, Yuan Yao, Lan-Zhe Guo, Yu-Feng Li, and Xiaoxing Ma. A theoretical study on bridging internal probability and self-consistency for llm reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. A Related Work Test-time scaling (TTS) improves accuracy on complex reasoning tasks for LLMs by ...

  54. [55]

    Reasoning: \n\n

    Is the reasoning process correct? Please choose an evaluation score among 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0. Please only output only the evaluation score. E.1.2 Ising-J Prompt user Suppose there are two responses to the same question. Please output the probability that Response 1 is a better answer than Response 2. #### Question #### {qu...

  55. [56]

    Is the answer correct?

  56. [57]

    Reasoning: \n\n

    Is the reasoning process correct? Please choose an evaluation score among 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0. Please only output only the evaluation score. E.2.2 Ising-J Prompt user Suppose there are two responses to the same Python function and input. Please output the probability that Response 1 is a better answer than Response 2. #### ...