Recognition: unknown
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
Pith reviewed 2026-05-09 22:01 UTC · model grok-4.3
The pith
OptiVerse benchmark shows LLMs drop below 27 percent accuracy on hard optimization problems because modeling and logic errors dominate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OptiVerse supplies 1,000 curated problems across Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control, each at Easy, Medium, or Hard difficulty. Evaluation of 22 LLMs shows rapid performance decline on Hard instances, where even GPT-5.2 and Gemini-3 stay below 27 percent accuracy. Error analysis isolates modeling and logic mistakes as the leading failure mode. The Dual-View Auditor Agent is introduced to audit the modeling stage from two complementary views, raising accuracy on these tasks without large added cost.
What carries the argument
Dual-View Auditor Agent, which examines an LLM's formulation of an optimization problem from two perspectives to catch and correct modeling and logic mistakes before solving proceeds.
If this is right
- LLMs can be evaluated on optimization tasks that involve uncertainty and time dependence rather than only static math programs.
- Modeling and logic errors become the explicit target for future LLM improvements in constraint-heavy domains.
- The auditor agent supplies a lightweight add-on that raises success rates on existing models without retraining.
- Performance gaps on hard problems point to the need for better handling of dynamic constraints and stochastic elements.
Where Pith is reading between the lines
- Training data for future LLMs could be expanded with explicit examples of correct optimization modeling to shrink the identified error type.
- The same dual-view checking idea could be tested on other reasoning tasks that require precise formulation, such as scientific hypothesis building.
- Continued use of OptiVerse may reveal whether the accuracy ceiling rises with model scale or requires architectural changes.
Load-bearing premise
The 1,000 selected problems stand in for the full range of neglected optimization domains and the scoring rules isolate modeling and logic skill without bias from wording or prompt choice.
What would settle it
Apply the Dual-View Auditor Agent to a fresh collection of hard problems drawn from the same domains and check whether accuracy rises above 27 percent on a majority of cases while runtime stays nearly unchanged.
Figures
read the original abstract
While Large Language Models (LLMs) demonstrate remarkable reasoning, complex optimization tasks remain challenging, requiring domain knowledge and robust implementation. However, existing benchmarks focus narrowly on Mathematical Programming and Combinatorial Optimization, hindering comprehensive evaluation. To address this, we introduce OptiVerse, a comprehensive benchmark of 1,000 curated problems spanning neglected domains, including Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control, across three difficulty levels: Easy, Medium, and Hard. The experiments with 22 LLMs of different sizes reveal sharp performance degradation on hard problems, where even advanced models like GPT-5.2 and Gemini-3 struggle to exceed 27% accuracy. Through error analysis, we identify that modeling & logic errors remain the primary bottleneck. Consequently, we propose a Dual-View Auditor Agent that improves the accuracy of the LLM modeling process without introducing significant time overhead. OptiVerse will serve as a foundational platform for advancing LLMs in solving complex optimization challenges.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OptiVerse, a benchmark of 1,000 curated optimization problems spanning Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control across Easy/Medium/Hard difficulty levels. Experiments on 22 LLMs of varying sizes show sharp accuracy degradation on hard problems (top models such as GPT-5.2 and Gemini-3 below 27%), with error analysis attributing the primary bottleneck to modeling and logic errors; the authors propose a Dual-View Auditor Agent to improve LLM modeling accuracy with negligible time overhead.
Significance. If the curation, difficulty calibration, and evaluation protocol prove sound, OptiVerse would fill a documented gap in LLM benchmarks beyond mathematical programming and combinatorial optimization, supplying a reusable testbed and concrete error taxonomy that could guide targeted improvements in LLM reasoning for optimization. The Dual-View Auditor offers an immediately applicable, low-overhead intervention whose reported gains, if reproducible, would be a practical contribution.
major comments (4)
- [Benchmark construction] Benchmark construction section: the manuscript supplies no explicit criteria for problem sourcing, selection, or validation from the target domains, nor any description of how difficulty levels were assigned or calibrated (e.g., via expert review, pilot testing, or quantitative metrics). This is load-bearing for the central claim of performance degradation and error-type dominance, because the reported 27% ceiling and modeling-error attribution cannot be interpreted without evidence that the 1,000 problems are unbiased samples rather than artifacts of curation choices.
- [Experiments] Experiments and evaluation protocol: details are absent on prompt templates, answer extraction procedures, canonicalization rules (especially for stochastic, dynamic, or game problems with multiple valid outputs), and scoring criteria. Without these, the accuracy numbers and the attribution of errors to modeling/logic versus other categories cannot be verified and may reflect prompt sensitivity or grading conventions rather than intrinsic LLM limitations.
- [Error analysis] Error analysis: the process for labeling error types, including any inter-annotator agreement statistics or adjudication protocol, is not reported. This directly affects the claim that modeling & logic errors are the primary bottleneck and the motivation for the Dual-View Auditor.
- [Dual-View Auditor Agent] Dual-View Auditor Agent evaluation: the manuscript does not provide the exact experimental setup (which models were tested, how many problems, statistical significance of gains) or ablation results showing that the accuracy improvement is attributable to the auditor rather than additional prompting or compute. This is required to substantiate the claim of improvement without significant time overhead.
minor comments (2)
- [Abstract and results tables] Model names such as GPT-5.2 and Gemini-3 appear in the abstract and results; clarify whether these are real released models, internal versions, or placeholders, and provide exact version identifiers or API dates for reproducibility.
- [Results] The paper mentions 'full performance tables' in the abstract but the provided text does not include them; ensure all per-model, per-difficulty, and per-domain accuracy numbers are reported in the main text or appendix with confidence intervals or statistical tests.
Simulated Author's Rebuttal
We thank the referee for the thorough and constructive review. The comments identify key areas requiring greater transparency to support the paper's claims. We address each major comment point by point below and will revise the manuscript to incorporate all requested details.
read point-by-point responses
-
Referee: Benchmark construction section: the manuscript supplies no explicit criteria for problem sourcing, selection, or validation from the target domains, nor any description of how difficulty levels were assigned or calibrated (e.g., via expert review, pilot testing, or quantitative metrics). This is load-bearing for the central claim of performance degradation and error-type dominance, because the reported 27% ceiling and modeling-error attribution cannot be interpreted without evidence that the 1,000 problems are unbiased samples rather than artifacts of curation choices.
Authors: We agree that explicit criteria are necessary to substantiate the benchmark's validity and the performance claims. In the revised manuscript, we will expand the Benchmark Construction section with: sourcing from established optimization textbooks, peer-reviewed papers, and public repositories; selection criteria focused on domain coverage, solvability, and diversity; validation via expert review by optimization researchers; and difficulty calibration using pilot testing with graduate students plus quantitative metrics (e.g., variable count, constraint complexity, stochasticity level). These additions will confirm the problems are representative rather than curated artifacts. revision: yes
-
Referee: Experiments and evaluation protocol: details are absent on prompt templates, answer extraction procedures, canonicalization rules (especially for stochastic, dynamic, or game problems with multiple valid outputs), and scoring criteria. Without these, the accuracy numbers and the attribution of errors to modeling/logic versus other categories cannot be verified and may reflect prompt sensitivity or grading conventions rather than intrinsic LLM limitations.
Authors: We acknowledge the absence of these protocol details and their importance for reproducibility. The revised version will add a dedicated Evaluation Protocol subsection (with examples in the appendix) covering: full prompt templates; answer extraction via structured parsing and verification; domain-specific canonicalization (e.g., expected-value equivalence for stochastic outputs, payoff normalization for games); and scoring rules (exact match for deterministic cases, tolerance-based for others). This will allow verification that results reflect model limitations rather than methodological choices. revision: yes
-
Referee: Error analysis: the process for labeling error types, including any inter-annotator agreement statistics or adjudication protocol, is not reported. This directly affects the claim that modeling & logic errors are the primary bottleneck and the motivation for the Dual-View Auditor.
Authors: We will strengthen the Error Analysis section by detailing the labeling process: two authors independently categorized errors using a predefined taxonomy, with a third author adjudicating disagreements. We will also report inter-annotator agreement (Cohen's kappa) and the distribution of error types. These additions will provide quantitative support for modeling and logic errors as the dominant category and the rationale for the Dual-View Auditor. revision: yes
-
Referee: Dual-View Auditor Agent evaluation: the manuscript does not provide the exact experimental setup (which models were tested, how many problems, statistical significance of gains) or ablation results showing that the accuracy improvement is attributable to the auditor rather than additional prompting or compute. This is required to substantiate the claim of improvement without significant time overhead.
Authors: We agree that the current description lacks sufficient experimental rigor. In the revision, we will specify: the models tested (GPT-5.2, Gemini-3, and three open-source LLMs); the evaluation subset (300 hard problems); statistical significance via paired t-tests; and ablation results comparing the Dual-View Auditor to chain-of-thought baselines and compute-matched variants. This will demonstrate that gains stem from the auditor mechanism with negligible overhead. revision: yes
Circularity Check
Empirical benchmark paper with no derivations or self-referential predictions
full rationale
The paper presents a new benchmark of 1,000 optimization problems across domains and difficulty levels, then reports direct empirical results from evaluating 22 LLMs on those problems. No equations, parameter fits, uniqueness theorems, or ansatzes are claimed; the performance degradation finding, error categorization, and Dual-View Auditor proposal all rest on the external experimental outcomes rather than reducing to any input by construction or self-citation chain. The work is self-contained against the benchmark it defines and the LLMs it tests.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Curated textual problem statements faithfully represent real optimization challenges in the listed domains
Reference graph
Works this paper leans on
-
[1]
PhysReason: A comprehensive benchmark towards physics-based reasoning
Zhang, Xinyu and Dong, Yuxuan and Wu, Yanrui and Huang, Jiaxing and Jia, Chengyou and Fernando, Basura and Shou, Mike Zheng and Zhang, Lingling and Liu, Jun. P hys R eason: A Comprehensive Benchmark towards Physics-Based Reasoning. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10...
-
[2]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
CoFFT: Chain of Foresight-Focus Thought for Visual Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[3]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Diagram-Driven Course Questions Generation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
2025
-
[4]
IEEE Transactions on Circuits and Systems for Video Technology , volume=
RPMG-FSS: Robust prior mask guided few-shot semantic segmentation , author=. IEEE Transactions on Circuits and Systems for Video Technology , volume=. 2023 , publisher=
2023
-
[5]
Computer Vision and Image Understanding , pages=
Memory-enriched thought-by-thought framework for complex Diagram Question Answering , author=. Computer Vision and Image Understanding , pages=. 2025 , publisher=
2025
-
[6]
Proceedings of the 33rd ACM International Conference on Multimedia , pages=
Cognitive Predictive Coding Network: Rethinking the Generalization in Raven's Progressive Matrices , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=
-
[7]
IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=
TN-ZSTAD: Transferable network for zero-shot temporal activity detection , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2022 , publisher=
2022
-
[8]
2025 , url =
Gemini 3 Pro , author =. 2025 , url =
2025
-
[9]
2025 , url =
Gemini 3 Flash , author =. 2025 , url =
2025
-
[10]
Google Deepmind , title =
-
[11]
2025 , url =
InternLM3-8B-Instruct , author =. 2025 , url =
2025
-
[12]
2025 , url =
Introducing OpenAI o3 and o4-mini , author =. 2025 , url =
2025
-
[13]
2025 , url =
Introducing GPT-5.2 , author =. 2025 , url =
2025
-
[14]
2025 , url =
Introducing Claude Sonnet 4.5 , author =. 2025 , url =
2025
-
[15]
2025 , url=
Introducing Mistral 3 , author =. 2025 , url=
2025
-
[16]
gpt-oss-120b & gpt-oss-20b Model Card
gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=
work page internal anchor Pith review arXiv
-
[17]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
Codescientist: End-to-end semi-automated scientific discovery with code-based experimentation , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
2025
-
[18]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
We-math: Does your large multimodal model achieve human-like mathematical reasoning? , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[19]
Powell (ed.), Wiley (2022)
Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions: by Warren B. Powell (ed.), Wiley (2022). Hardback. ISBN 9781119815051. , author=. 2022 , publisher=
2022
-
[20]
2025 , eprint=
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing , author=. 2025 , eprint=
2025
-
[21]
Yan, Yibo and Su, Jiamin and He, Jianxiang and Fu, Fangteng and Zheng, Xu and Lyu, Yuanhuiyi and Wang, Kun and Wang, Shen and Wen, Qingsong and Hu, Xuming. A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653...
-
[22]
5 Technical Report , author=
Qwen2. 5 Technical Report , author=. arXiv e-prints , pages=
-
[23]
2024 , eprint=
InternLM2 Technical Report , author=. 2024 , eprint=
2024
-
[24]
LLM s for Mathematical Modeling: Towards Bridging the Gap between Natural and Mathematical Languages
Huang, Xuhan and Shen, Qingning and Hu, Yan and Gao, Anningzhe and Wang, Benyou. LLM s for Mathematical Modeling: Towards Bridging the Gap between Natural and Mathematical Languages. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.146
-
[25]
Kimi K2: Open Agentic Intelligence
Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=
work page internal anchor Pith review arXiv
-
[26]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=
work page internal anchor Pith review arXiv
-
[27]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
A Persona-Aware LLM -Enhanced Framework for Multi-Session Personalized Dialogue Generation
Liu, Dongshuo and Wu, Zhijing and Song, Dandan and Huang, Heyan. A Persona-Aware LLM -Enhanced Framework for Multi-Session Personalized Dialogue Generation. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.5
-
[29]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[30]
NeurIPS 2022 competition track , pages=
Nl4opt competition: Formulating optimization problems based on their natural language descriptions , author=. NeurIPS 2022 competition track , pages=. 2023 , organization=
2022
-
[31]
Operations Research , year=
Orlm: A customizable framework in training large models for automated optimization modeling , author=. Operations Research , year=
-
[32]
Evo-Step: Evolutionary Generation and Stepwise Validation for Optimizing LLMs in OR , author=
-
[33]
The Thirteenth International Conference on Learning Representations , year=
LLMOPT: Learning to Define and Solve General Optimization Problems from Scratch , author=. The Thirteenth International Conference on Learning Representations , year=
-
[34]
Forty-second International Conference on Machine Learning , year=
Autoformulation of Mathematical Optimization Models Using LLMs , author=. Forty-second International Conference on Machine Learning , year=
-
[35]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
OptiTree: Hierarchical Thoughts Generation with Tree Search for LLM Optimization Modeling , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[36]
Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=
A survey on rag meeting llms: Towards retrieval-augmented large language models , author=. Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=
-
[37]
Retrieval-augmented generation for ai-generated content: A survey.CoRR, abs/2402.19473, 2024
Retrieval-augmented generation for ai-generated content: A survey , author=. arXiv preprint arXiv:2402.19473 , year=
-
[38]
The eleventh international conference on learning representations , year=
React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=
-
[39]
NeurIPS 2024 Workshop on Open-World Agents , year=
Rat: Retrieval augmented thoughts elicit context-aware reasoning and verification in long-horizon generation , author=. NeurIPS 2024 Workshop on Open-World Agents , year=
2024
-
[40]
Proceedings of the Workshop on Generative AI and Knowledge Graphs (GenAIK) , pages=
Graphrag: Leveraging graph-based efficiency to minimize hallucinations in llm-driven rag for finance data , author=. Proceedings of the Workshop on Generative AI and Knowledge Graphs (GenAIK) , pages=
-
[41]
LightRAG: Simple and Fast Retrieval-Augmented Generation
Lightrag: Simple and fast retrieval-augmented generation , author=. arXiv preprint arXiv:2410.05779 , year=
work page internal anchor Pith review arXiv
-
[42]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[43]
ACM Transactions on Information Systems , volume=
A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=
2025
-
[44]
Advances in neural information processing systems , volume=
Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=
-
[45]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
MURKA: Multi-Reward Reinforcement Learning with Knowledge Alignment for Optimization Tasks , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[46]
The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
AutoOpt: A Dataset and a Unified Framework for Automating Optimization Problem Solving , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
-
[47]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Logictree: Improving complex reasoning of LLMs via instantiated multi-step synthetic logical data , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[48]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
The twelfth international conference on learning representations , year=
Chain-of-experts: When llms meet complex operations research problems , author=. The twelfth international conference on learning representations , year=
-
[50]
Forty-first International Conference on Machine Learning , year=
OptiMUS: Scalable Optimization Modeling with (MI) LP Solvers and Large Language Models , author=. Forty-first International Conference on Machine Learning , year=
-
[51]
arXiv e-prints , pages=
Mamo: a mathematical modeling benchmark with solvers , author=. arXiv e-prints , pages=
-
[52]
Forty-second International Conference on Machine Learning , year=
OptMATH: A Scalable Bidirectional Data Synthesis Framework for Optimization Modeling , author=. Forty-second International Conference on Machine Learning , year=
-
[53]
The Thirteenth International Conference on Learning Representations , year=
OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling , author=. The Thirteenth International Conference on Learning Representations , year=
-
[54]
IFAC-PapersOnLine , volume=
MILP-based approach to mid-term production planning of batch manufacturing environment producing bulk products , author=. IFAC-PapersOnLine , volume=. 2018 , publisher=
2018
-
[55]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Large Language Models as End-to-end Combinatorial Optimization Solvers , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[56]
Transportation Science , volume=
Heuristic bounds and test problem generation for the time-dependent traveling salesman problem , author=. Transportation Science , volume=. 1995 , publisher=
1995
-
[57]
IEEE Transactions on Sustainable Energy , volume=
A MILP-based battery degradation model for economic scheduling of power system , author=. IEEE Transactions on Sustainable Energy , volume=. 2023 , publisher=
2023
-
[58]
2013 , publisher=
Mathematical modeling , author=. 2013 , publisher=
2013
-
[59]
arXiv preprint arXiv:2507.11737 , year=
Auto-formulating dynamic programming problems with large language models , author=. arXiv preprint arXiv:2507.11737 , year=
-
[60]
arXiv e-prints , pages=
Large Language Models and Operations Research: A Structured Survey , author=. arXiv e-prints , pages=
-
[61]
arXiv preprint arXiv:2503.10009 , year=
Or-llm-agent: Automating modeling and solving of operations research optimization problem with reasoning large language model , author=. arXiv preprint arXiv:2503.10009 , year=
-
[62]
2020 , publisher=
Introduction to Operations Research , author=. 2020 , publisher=
2020
-
[63]
2004 , publisher=
Convex Optimization , author=. 2004 , publisher=
2004
-
[64]
2018 , publisher=
Combinatorial Optimization: Theory and Algorithms , author=. 2018 , publisher=
2018
-
[65]
2011 , publisher=
Introduction to Stochastic Programming , author=. 2011 , publisher=
2011
-
[66]
I & II , author=
Dynamic Programming and Optimal Control, Vol. I & II , author=. 2017 , publisher=
2017
-
[67]
2004 , publisher=
An Introduction to Game Theory , author=. 2004 , publisher=
2004
-
[68]
2022 , publisher=
Optimization: Modeling, Algorithms, and Theory , author=. 2022 , publisher=
2022
-
[69]
OpenAI , title =. , url =
-
[70]
Symbol- LLM : Towards Foundational Symbol-centric Interface For Large Language Models
Xu, Fangzhi and Wu, Zhiyong and Sun, Qiushi and Ren, Siyu and Yuan, Fei and Yuan, Shuai and Lin, Qika and Qiao, Yu and Liu, Jun. Symbol- LLM : Towards Foundational Symbol-centric Interface For Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.a...
-
[71]
Omni-math: A universal olympiad level mathematic benchmark for large language models
Omni-math: A universal olympiad level mathematic benchmark for large language models , author=. arXiv preprint arXiv:2410.07985 , year=
-
[72]
arXiv preprint arXiv:2411.01281 , year=
Varco arena: A tournament approach to reference-free benchmarking large language models , author=. arXiv preprint arXiv:2411.01281 , year=
-
[73]
International Conference on Learning Representations , year=
Scaling LLM Test-time Compute Optimally Can Be More Effective Than Scaling Model Parameters , author=. International Conference on Learning Representations , year=
-
[74]
2024 , url =
OpenAI , title =. 2024 , url =
2024
-
[75]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=
work page internal anchor Pith review arXiv
-
[76]
Measuring Mathematical Problem Solving With the MATH Dataset , author=
-
[77]
The Twelfth International Conference on Learning Representations , year=
Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=
-
[78]
2025 , url =
OpenAI , title =. 2025 , url =
2025
-
[79]
2025 , url =
Google Deepmind , title =. 2025 , url =
2025
-
[80]
2024 , url =
ZhipuAI , title =. 2024 , url =
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.