arxiv: 2604.21510 · v1 · submitted 2026-04-23 · 💻 cs.CL

Recognition: unknown

OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

Xinyu Zhang , Boxuan Zhang , Yuchen Wan , Lingling Zhang , YiXing Yao , Bifan Wei , Yaqiang Wu , Jun Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords optimization benchmarklarge language modelsstochastic optimizationdynamic optimizationmodeling errorsLLM reasoningauditor agent

0 comments

The pith

OptiVerse benchmark shows LLMs drop below 27 percent accuracy on hard optimization problems because modeling and logic errors dominate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OptiVerse, a set of 1,000 problems that covers optimization areas such as stochastic, dynamic, game, and optimal control problems that prior tests mostly ignored. It evaluates 22 LLMs of varying sizes and records sharp drops in success rates once problems reach the hard tier. The main failures turn out to be mistakes in how models formulate the problem and apply logic, rather than computation limits. The authors respond by adding a Dual-View Auditor Agent that reviews the modeling step from two angles and raises accuracy with little extra time. If the pattern holds, developers can use the benchmark to measure progress on real-world tasks that mix uncertainty, timing, and strategy.

Core claim

OptiVerse supplies 1,000 curated problems across Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control, each at Easy, Medium, or Hard difficulty. Evaluation of 22 LLMs shows rapid performance decline on Hard instances, where even GPT-5.2 and Gemini-3 stay below 27 percent accuracy. Error analysis isolates modeling and logic mistakes as the leading failure mode. The Dual-View Auditor Agent is introduced to audit the modeling stage from two complementary views, raising accuracy on these tasks without large added cost.

What carries the argument

Dual-View Auditor Agent, which examines an LLM's formulation of an optimization problem from two perspectives to catch and correct modeling and logic mistakes before solving proceeds.

If this is right

LLMs can be evaluated on optimization tasks that involve uncertainty and time dependence rather than only static math programs.
Modeling and logic errors become the explicit target for future LLM improvements in constraint-heavy domains.
The auditor agent supplies a lightweight add-on that raises success rates on existing models without retraining.
Performance gaps on hard problems point to the need for better handling of dynamic constraints and stochastic elements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training data for future LLMs could be expanded with explicit examples of correct optimization modeling to shrink the identified error type.
The same dual-view checking idea could be tested on other reasoning tasks that require precise formulation, such as scientific hypothesis building.
Continued use of OptiVerse may reveal whether the accuracy ceiling rises with model scale or requires architectural changes.

Load-bearing premise

The 1,000 selected problems stand in for the full range of neglected optimization domains and the scoring rules isolate modeling and logic skill without bias from wording or prompt choice.

What would settle it

Apply the Dual-View Auditor Agent to a fresh collection of hard problems drawn from the same domains and check whether accuracy rises above 27 percent on a majority of cases while runtime stays nearly unchanged.

Figures

Figures reproduced from arXiv: 2604.21510 by Bifan Wei, Boxuan Zhang, Jun Liu, Lingling Zhang, Xinyu Zhang, Yaqiang Wu, YiXing Yao, Yuchen Wan.

**Figure 1.** Figure 1: An illustrative example from OptiVerse. Optimization, serving as a cornerstone discipline underpinning operations research, engineering design, and strategic decision-making (Huang et al., 2025a), presents a quintessential testbed for this inquiry. Far beyond basic arithmetic, optimization demands a sophisticated synthesis of skills. It requires interpreting domain semantics, formulating rigorous mathema… view at source ↗

**Figure 2.** Figure 2: The hierarchical taxonomy of OptiVerse benchmark, which provides comprehensive coverage across six [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Domain distribution in OptiVerse benchmark, with each color representing a distinct optimization domain. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Statistical analysis of question tokens and result counts across optimization problem domain and difficulty, [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of error types across nine representative models. While [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: The workflow of the Dual-View Auditor Agent [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of Data Sources. Raw Corpus 82 Books (26,702 Pages) Problem Extraction 5,483 Candidates Manual Verification 1,786 Valid Instances Final Benchmark 1,000 Problems Roughly Extracted Problems Removal of OCR errors, duplicates, and invalid problems High-quality, curated optimization problems [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Data Construction Pipeline. C Data Source To construct a benchmark that is both comprehensive and rigorously challenging, we curated a massive dataset derived from highly authoritative academic resources. The raw corpus consists of 82 distinct textbooks, specialized problem sets, and graduate-level solution manuals, encompassing a total of 26,702 pages of source material. These documents cover a wide sp… view at source ↗

**Figure 10.** Figure 10: A Medium Example in OptiVerse. ticated understanding of multi-agent interactions and conditional logic. The hard-level problem ( [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: A Hard Example in OptiVerse. F.1 Inference Prompt The solver prompt operates under a Code-based Chain-of-Thought paradigm. To ensure that the generated code is executable within our sandbox environment, the prompt enforces three critical constraints: • Expert Persona & Tool Authorization. The prompt establishes the model’s role as an “Operations Research Expert” and explicitly permits access to domain-… view at source ↗

read the original abstract

While Large Language Models (LLMs) demonstrate remarkable reasoning, complex optimization tasks remain challenging, requiring domain knowledge and robust implementation. However, existing benchmarks focus narrowly on Mathematical Programming and Combinatorial Optimization, hindering comprehensive evaluation. To address this, we introduce OptiVerse, a comprehensive benchmark of 1,000 curated problems spanning neglected domains, including Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control, across three difficulty levels: Easy, Medium, and Hard. The experiments with 22 LLMs of different sizes reveal sharp performance degradation on hard problems, where even advanced models like GPT-5.2 and Gemini-3 struggle to exceed 27% accuracy. Through error analysis, we identify that modeling & logic errors remain the primary bottleneck. Consequently, we propose a Dual-View Auditor Agent that improves the accuracy of the LLM modeling process without introducing significant time overhead. OptiVerse will serve as a foundational platform for advancing LLMs in solving complex optimization challenges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OptiVerse widens the scope of LLM optimization benchmarks to stochastic, dynamic, game, and control problems but leaves the problem selection and scoring rules too vague to trust the reported failure rates.

read the letter

The paper's main contribution is a 1000-problem benchmark that moves past the usual math programming and combinatorial focus to include stochastic optimization, dynamic optimization, game optimization, and optimal control, split into easy, medium, and hard tiers. They ran 22 LLMs on it, documented the expected drop on hard cases (top models stuck below 27 percent), broke down errors into modeling and logic issues, and added a Dual-View Auditor Agent meant to catch modeling mistakes with little extra cost.

Referee Report

4 major / 2 minor

Summary. The paper introduces OptiVerse, a benchmark of 1,000 curated optimization problems spanning Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control across Easy/Medium/Hard difficulty levels. Experiments on 22 LLMs of varying sizes show sharp accuracy degradation on hard problems (top models such as GPT-5.2 and Gemini-3 below 27%), with error analysis attributing the primary bottleneck to modeling and logic errors; the authors propose a Dual-View Auditor Agent to improve LLM modeling accuracy with negligible time overhead.

Significance. If the curation, difficulty calibration, and evaluation protocol prove sound, OptiVerse would fill a documented gap in LLM benchmarks beyond mathematical programming and combinatorial optimization, supplying a reusable testbed and concrete error taxonomy that could guide targeted improvements in LLM reasoning for optimization. The Dual-View Auditor offers an immediately applicable, low-overhead intervention whose reported gains, if reproducible, would be a practical contribution.

major comments (4)

[Benchmark construction] Benchmark construction section: the manuscript supplies no explicit criteria for problem sourcing, selection, or validation from the target domains, nor any description of how difficulty levels were assigned or calibrated (e.g., via expert review, pilot testing, or quantitative metrics). This is load-bearing for the central claim of performance degradation and error-type dominance, because the reported 27% ceiling and modeling-error attribution cannot be interpreted without evidence that the 1,000 problems are unbiased samples rather than artifacts of curation choices.
[Experiments] Experiments and evaluation protocol: details are absent on prompt templates, answer extraction procedures, canonicalization rules (especially for stochastic, dynamic, or game problems with multiple valid outputs), and scoring criteria. Without these, the accuracy numbers and the attribution of errors to modeling/logic versus other categories cannot be verified and may reflect prompt sensitivity or grading conventions rather than intrinsic LLM limitations.
[Error analysis] Error analysis: the process for labeling error types, including any inter-annotator agreement statistics or adjudication protocol, is not reported. This directly affects the claim that modeling & logic errors are the primary bottleneck and the motivation for the Dual-View Auditor.
[Dual-View Auditor Agent] Dual-View Auditor Agent evaluation: the manuscript does not provide the exact experimental setup (which models were tested, how many problems, statistical significance of gains) or ablation results showing that the accuracy improvement is attributable to the auditor rather than additional prompting or compute. This is required to substantiate the claim of improvement without significant time overhead.

minor comments (2)

[Abstract and results tables] Model names such as GPT-5.2 and Gemini-3 appear in the abstract and results; clarify whether these are real released models, internal versions, or placeholders, and provide exact version identifiers or API dates for reproducibility.
[Results] The paper mentions 'full performance tables' in the abstract but the provided text does not include them; ensure all per-model, per-difficulty, and per-domain accuracy numbers are reported in the main text or appendix with confidence intervals or statistical tests.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the thorough and constructive review. The comments identify key areas requiring greater transparency to support the paper's claims. We address each major comment point by point below and will revise the manuscript to incorporate all requested details.

read point-by-point responses

Referee: Benchmark construction section: the manuscript supplies no explicit criteria for problem sourcing, selection, or validation from the target domains, nor any description of how difficulty levels were assigned or calibrated (e.g., via expert review, pilot testing, or quantitative metrics). This is load-bearing for the central claim of performance degradation and error-type dominance, because the reported 27% ceiling and modeling-error attribution cannot be interpreted without evidence that the 1,000 problems are unbiased samples rather than artifacts of curation choices.

Authors: We agree that explicit criteria are necessary to substantiate the benchmark's validity and the performance claims. In the revised manuscript, we will expand the Benchmark Construction section with: sourcing from established optimization textbooks, peer-reviewed papers, and public repositories; selection criteria focused on domain coverage, solvability, and diversity; validation via expert review by optimization researchers; and difficulty calibration using pilot testing with graduate students plus quantitative metrics (e.g., variable count, constraint complexity, stochasticity level). These additions will confirm the problems are representative rather than curated artifacts. revision: yes
Referee: Experiments and evaluation protocol: details are absent on prompt templates, answer extraction procedures, canonicalization rules (especially for stochastic, dynamic, or game problems with multiple valid outputs), and scoring criteria. Without these, the accuracy numbers and the attribution of errors to modeling/logic versus other categories cannot be verified and may reflect prompt sensitivity or grading conventions rather than intrinsic LLM limitations.

Authors: We acknowledge the absence of these protocol details and their importance for reproducibility. The revised version will add a dedicated Evaluation Protocol subsection (with examples in the appendix) covering: full prompt templates; answer extraction via structured parsing and verification; domain-specific canonicalization (e.g., expected-value equivalence for stochastic outputs, payoff normalization for games); and scoring rules (exact match for deterministic cases, tolerance-based for others). This will allow verification that results reflect model limitations rather than methodological choices. revision: yes
Referee: Error analysis: the process for labeling error types, including any inter-annotator agreement statistics or adjudication protocol, is not reported. This directly affects the claim that modeling & logic errors are the primary bottleneck and the motivation for the Dual-View Auditor.

Authors: We will strengthen the Error Analysis section by detailing the labeling process: two authors independently categorized errors using a predefined taxonomy, with a third author adjudicating disagreements. We will also report inter-annotator agreement (Cohen's kappa) and the distribution of error types. These additions will provide quantitative support for modeling and logic errors as the dominant category and the rationale for the Dual-View Auditor. revision: yes
Referee: Dual-View Auditor Agent evaluation: the manuscript does not provide the exact experimental setup (which models were tested, how many problems, statistical significance of gains) or ablation results showing that the accuracy improvement is attributable to the auditor rather than additional prompting or compute. This is required to substantiate the claim of improvement without significant time overhead.

Authors: We agree that the current description lacks sufficient experimental rigor. In the revision, we will specify: the models tested (GPT-5.2, Gemini-3, and three open-source LLMs); the evaluation subset (300 hard problems); statistical significance via paired t-tests; and ablation results comparing the Dual-View Auditor to chain-of-thought baselines and compute-matched variants. This will demonstrate that gains stem from the auditor mechanism with negligible overhead. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivations or self-referential predictions

full rationale

The paper presents a new benchmark of 1,000 optimization problems across domains and difficulty levels, then reports direct empirical results from evaluating 22 LLMs on those problems. No equations, parameter fits, uniqueness theorems, or ansatzes are claimed; the performance degradation finding, error categorization, and Dual-View Auditor proposal all rest on the external experimental outcomes rather than reducing to any input by construction or self-citation chain. The work is self-contained against the benchmark it defines and the LLMs it tests.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmark introduction paper. No mathematical derivations, fitted parameters, or new physical entities are introduced in the abstract; the work rests on the domain assumption that curated textual problem statements can serve as faithful proxies for real optimization tasks.

axioms (1)

domain assumption Curated textual problem statements faithfully represent real optimization challenges in the listed domains
Invoked when claiming the benchmark enables comprehensive evaluation of LLMs on neglected areas

pith-pipeline@v0.9.0 · 5484 in / 1367 out tokens · 43388 ms · 2026-05-09T22:01:09.175889+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

150 extracted references · 31 canonical work pages · 12 internal anchors

[1]

PhysReason: A comprehensive benchmark towards physics-based reasoning

Zhang, Xinyu and Dong, Yuxuan and Wu, Yanrui and Huang, Jiaxing and Jia, Chengyou and Fernando, Basura and Shou, Mike Zheng and Zhang, Lingling and Liu, Jun. P hys R eason: A Comprehensive Benchmark towards Physics-Based Reasoning. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10...

work page doi:10.18653/v1/2025.acl-long.811 2025
[2]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

CoFFT: Chain of Foresight-Focus Thought for Visual Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[3]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Diagram-Driven Course Questions Generation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[4]

IEEE Transactions on Circuits and Systems for Video Technology , volume=

RPMG-FSS: Robust prior mask guided few-shot semantic segmentation , author=. IEEE Transactions on Circuits and Systems for Video Technology , volume=. 2023 , publisher=

2023
[5]

Computer Vision and Image Understanding , pages=

Memory-enriched thought-by-thought framework for complex Diagram Question Answering , author=. Computer Vision and Image Understanding , pages=. 2025 , publisher=

2025
[6]

Proceedings of the 33rd ACM International Conference on Multimedia , pages=

Cognitive Predictive Coding Network: Rethinking the Generalization in Raven's Progressive Matrices , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=
[7]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

TN-ZSTAD: Transferable network for zero-shot temporal activity detection , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2022 , publisher=

2022
[8]

2025 , url =

Gemini 3 Pro , author =. 2025 , url =

2025
[9]

2025 , url =

Gemini 3 Flash , author =. 2025 , url =

2025
[10]

Google Deepmind , title =
[11]

2025 , url =

InternLM3-8B-Instruct , author =. 2025 , url =

2025
[12]

2025 , url =

Introducing OpenAI o3 and o4-mini , author =. 2025 , url =

2025
[13]

2025 , url =

Introducing GPT-5.2 , author =. 2025 , url =

2025
[14]

2025 , url =

Introducing Claude Sonnet 4.5 , author =. 2025 , url =

2025
[15]

2025 , url=

Introducing Mistral 3 , author =. 2025 , url=

2025
[16]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review arXiv
[17]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Codescientist: End-to-end semi-automated scientific discovery with code-based experimentation , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[18]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

We-math: Does your large multimodal model achieve human-like mathematical reasoning? , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[19]

Powell (ed.), Wiley (2022)

Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions: by Warren B. Powell (ed.), Wiley (2022). Hardback. ISBN 9781119815051. , author=. 2022 , publisher=

2022
[20]

2025 , eprint=

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing , author=. 2025 , eprint=

2025
[21]

A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges

Yan, Yibo and Su, Jiamin and He, Jianxiang and Fu, Fangteng and Zheng, Xu and Lyu, Yuanhuiyi and Wang, Kun and Wang, Shen and Wen, Qingsong and Hu, Xuming. A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653...

work page doi:10.18653/v1/2025.findings-acl.614 2025
[22]

5 Technical Report , author=

Qwen2. 5 Technical Report , author=. arXiv e-prints , pages=
[23]

2024 , eprint=

InternLM2 Technical Report , author=. 2024 , eprint=

2024
[24]

LLM s for Mathematical Modeling: Towards Bridging the Gap between Natural and Mathematical Languages

Huang, Xuhan and Shen, Qingning and Hu, Yan and Gao, Anningzhe and Wang, Benyou. LLM s for Mathematical Modeling: Towards Bridging the Gap between Natural and Mathematical Languages. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.146

work page doi:10.18653/v1/2025.findings-naacl.146 2025
[25]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=

work page internal anchor Pith review arXiv
[26]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=

work page internal anchor Pith review arXiv
[27]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

A Persona-Aware LLM -Enhanced Framework for Multi-Session Personalized Dialogue Generation

Liu, Dongshuo and Wu, Zhijing and Song, Dandan and Huang, Heyan. A Persona-Aware LLM -Enhanced Framework for Multi-Session Personalized Dialogue Generation. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.5

work page doi:10.18653/v1/2025.findings-acl.5 2025
[29]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[30]

NeurIPS 2022 competition track , pages=

Nl4opt competition: Formulating optimization problems based on their natural language descriptions , author=. NeurIPS 2022 competition track , pages=. 2023 , organization=

2022
[31]

Operations Research , year=

Orlm: A customizable framework in training large models for automated optimization modeling , author=. Operations Research , year=
[32]

Evo-Step: Evolutionary Generation and Stepwise Validation for Optimizing LLMs in OR , author=
[33]

The Thirteenth International Conference on Learning Representations , year=

LLMOPT: Learning to Define and Solve General Optimization Problems from Scratch , author=. The Thirteenth International Conference on Learning Representations , year=
[34]

Forty-second International Conference on Machine Learning , year=

Autoformulation of Mathematical Optimization Models Using LLMs , author=. Forty-second International Conference on Machine Learning , year=
[35]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

OptiTree: Hierarchical Thoughts Generation with Tree Search for LLM Optimization Modeling , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[36]

Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=

A survey on rag meeting llms: Towards retrieval-augmented large language models , author=. Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=
[37]

Retrieval-augmented generation for ai-generated content: A survey.CoRR, abs/2402.19473, 2024

Retrieval-augmented generation for ai-generated content: A survey , author=. arXiv preprint arXiv:2402.19473 , year=

work page arXiv
[38]

The eleventh international conference on learning representations , year=

React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=
[39]

NeurIPS 2024 Workshop on Open-World Agents , year=

Rat: Retrieval augmented thoughts elicit context-aware reasoning and verification in long-horizon generation , author=. NeurIPS 2024 Workshop on Open-World Agents , year=

2024
[40]

Proceedings of the Workshop on Generative AI and Knowledge Graphs (GenAIK) , pages=

Graphrag: Leveraging graph-based efficiency to minimize hallucinations in llm-driven rag for finance data , author=. Proceedings of the Workshop on Generative AI and Knowledge Graphs (GenAIK) , pages=
[41]

LightRAG: Simple and Fast Retrieval-Augmented Generation

Lightrag: Simple and fast retrieval-augmented generation , author=. arXiv preprint arXiv:2410.05779 , year=

work page internal anchor Pith review arXiv
[42]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[43]

ACM Transactions on Information Systems , volume=

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

2025
[44]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=
[45]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

MURKA: Multi-Reward Reinforcement Learning with Knowledge Alignment for Optimization Tasks , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[46]

The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

AutoOpt: A Dataset and a Unified Framework for Automating Optimization Problem Solving , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
[47]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Logictree: Improving complex reasoning of LLMs via instantiated multi-step synthetic logical data , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[48]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

The twelfth international conference on learning representations , year=

Chain-of-experts: When llms meet complex operations research problems , author=. The twelfth international conference on learning representations , year=
[50]

Forty-first International Conference on Machine Learning , year=

OptiMUS: Scalable Optimization Modeling with (MI) LP Solvers and Large Language Models , author=. Forty-first International Conference on Machine Learning , year=
[51]

arXiv e-prints , pages=

Mamo: a mathematical modeling benchmark with solvers , author=. arXiv e-prints , pages=
[52]

Forty-second International Conference on Machine Learning , year=

OptMATH: A Scalable Bidirectional Data Synthesis Framework for Optimization Modeling , author=. Forty-second International Conference on Machine Learning , year=
[53]

The Thirteenth International Conference on Learning Representations , year=

OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling , author=. The Thirteenth International Conference on Learning Representations , year=
[54]

IFAC-PapersOnLine , volume=

MILP-based approach to mid-term production planning of batch manufacturing environment producing bulk products , author=. IFAC-PapersOnLine , volume=. 2018 , publisher=

2018
[55]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Large Language Models as End-to-end Combinatorial Optimization Solvers , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[56]

Transportation Science , volume=

Heuristic bounds and test problem generation for the time-dependent traveling salesman problem , author=. Transportation Science , volume=. 1995 , publisher=

1995
[57]

IEEE Transactions on Sustainable Energy , volume=

A MILP-based battery degradation model for economic scheduling of power system , author=. IEEE Transactions on Sustainable Energy , volume=. 2023 , publisher=

2023
[58]

2013 , publisher=

Mathematical modeling , author=. 2013 , publisher=

2013
[59]

arXiv preprint arXiv:2507.11737 , year=

Auto-formulating dynamic programming problems with large language models , author=. arXiv preprint arXiv:2507.11737 , year=

work page arXiv
[60]

arXiv e-prints , pages=

Large Language Models and Operations Research: A Structured Survey , author=. arXiv e-prints , pages=
[61]

arXiv preprint arXiv:2503.10009 , year=

Or-llm-agent: Automating modeling and solving of operations research optimization problem with reasoning large language model , author=. arXiv preprint arXiv:2503.10009 , year=

work page arXiv
[62]

2020 , publisher=

Introduction to Operations Research , author=. 2020 , publisher=

2020
[63]

2004 , publisher=

Convex Optimization , author=. 2004 , publisher=

2004
[64]

2018 , publisher=

Combinatorial Optimization: Theory and Algorithms , author=. 2018 , publisher=

2018
[65]

2011 , publisher=

Introduction to Stochastic Programming , author=. 2011 , publisher=

2011
[66]

I & II , author=

Dynamic Programming and Optimal Control, Vol. I & II , author=. 2017 , publisher=

2017
[67]

2004 , publisher=

An Introduction to Game Theory , author=. 2004 , publisher=

2004
[68]

2022 , publisher=

Optimization: Modeling, Algorithms, and Theory , author=. 2022 , publisher=

2022
[69]

OpenAI , title =. , url =
[70]

Symbol- LLM : Towards Foundational Symbol-centric Interface For Large Language Models

Xu, Fangzhi and Wu, Zhiyong and Sun, Qiushi and Ren, Siyu and Yuan, Fei and Yuan, Shuai and Lin, Qika and Qiao, Yu and Liu, Jun. Symbol- LLM : Towards Foundational Symbol-centric Interface For Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.a...

work page doi:10.18653/v1/2024.acl-long.707 2024
[71]

Omni-math: A universal olympiad level mathematic benchmark for large language models

Omni-math: A universal olympiad level mathematic benchmark for large language models , author=. arXiv preprint arXiv:2410.07985 , year=

work page arXiv
[72]

arXiv preprint arXiv:2411.01281 , year=

Varco arena: A tournament approach to reference-free benchmarking large language models , author=. arXiv preprint arXiv:2411.01281 , year=

work page arXiv
[73]

International Conference on Learning Representations , year=

Scaling LLM Test-time Compute Optimally Can Be More Effective Than Scaling Model Parameters , author=. International Conference on Learning Representations , year=
[74]

2024 , url =

OpenAI , title =. 2024 , url =

2024
[75]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

work page internal anchor Pith review arXiv
[76]

Measuring Mathematical Problem Solving With the MATH Dataset , author=
[77]

The Twelfth International Conference on Learning Representations , year=

Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=
[78]

2025 , url =

OpenAI , title =. 2025 , url =

2025
[79]

2025 , url =

Google Deepmind , title =. 2025 , url =

2025
[80]

2024 , url =

ZhipuAI , title =. 2024 , url =

2024

Showing first 80 references.