Recognition: unknown
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
Pith reviewed 2026-05-10 06:57 UTC · model grok-4.3
The pith
An 8B model post-trained via synthetic data and solver feedback matches larger models at turning natural language optimization problems into solver-ready forms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AutoOR shows that verified synthetic data generated from standard linear, mixed-integer, and non-linear optimization forms, paired with reinforcement learning that uses solver execution success as the sole reward signal, enables an 8B model to autoformalize natural-language optimization problems at state-of-the-art or competitive levels across six benchmarks; a curriculum RL variant further renders previously intractable non-linear physical-dynamics problems solvable from limited seed data.
What carries the argument
The AutoOR pipeline, which generates training examples from standard optimization templates and uses solver execution feedback as the reinforcement-learning reward to train the model to produce correct formalizations.
If this is right
- An 8B model becomes competitive with significantly larger models on linear and mixed-integer formalization tasks.
- Non-linear problems involving physical dynamics move from near-zero to usable accuracy through staged curriculum reinforcement learning.
- Industrial decision-making can be accelerated by replacing manual formalization steps with automated model output.
- Training data creation scales without requiring large amounts of human-annotated OR examples.
- The same post-training recipe applies across linear, mixed-integer, and selected non-linear problem classes.
Where Pith is reading between the lines
- The approach could be tested on end-to-end pipelines that take raw sensor or business data and emit both a formalization and a solved schedule.
- If the synthetic-to-real gap proves small, similar template-plus-execution loops might apply to formalizing problems in other domains such as chemical process design or financial planning.
- A practical next measurement would be accuracy on a corpus of actual company problem statements that have never been seen during training.
- Integration with existing solver interfaces might allow non-experts to describe a scheduling task in ordinary language and receive an immediately executable model.
Load-bearing premise
That data produced from clean standard optimization templates together with solver execution feedback will be sufficient to train models that still work when given the varied and often ambiguous wording found in actual industrial problem statements.
What would settle it
A test set of real industrial optimization problems described in natural language where the post-trained 8B model produces formalizations that solvers cannot execute correctly or that yield wrong objective values, while larger frontier models also fail on the same set.
read the original abstract
Optimization problems are central to decision-making in manufacturing, logistics, scheduling, and other industrial settings. Translating complicated descriptions of these problems into solver-ready formulations requires specialized operations research (OR) expertise, making it hard to scale. We present AutoOR, a scalable synthetic data generation and reinforcement learning pipeline that trains LLMs to autoformalize optimization problems specified in natural language across linear, mixed-integer, and non-linear categories. AutoOR generates verified training data from standard optimization forms and uses solver execution feedback as the reward signal for RL post-training. AutoOR applied to an 8B model achieves state-of-the-art or competitive results across six established OR benchmarks, matching significantly larger frontier models. For a non-linear problem class involving physical dynamics, where frontier models score near 0%, we introduce a curriculum RL strategy that bootstraps from limited initial training data to make this class tractable for post-training. We believe that methods such as AutoOR can significantly accelerate industrial decision-making with AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AutoOR, a pipeline combining synthetic data generation from standard optimization forms with reinforcement learning that uses solver execution feedback as the reward signal. The method post-trains LLMs to translate natural language descriptions of linear, mixed-integer, and non-linear optimization problems into solver-ready formulations. It reports that an 8B model achieves state-of-the-art or competitive results across six established OR benchmarks (matching much larger frontier models) and introduces a curriculum RL strategy that bootstraps limited data to make a non-linear physical-dynamics problem class tractable where frontier models score near zero.
Significance. If the empirical claims hold under rigorous scrutiny, the work offers a practical route to scalable autoformalization of OR problems, potentially reducing dependence on scarce OR expertise in industrial settings. The external solver feedback provides a verifiable, non-circular reward signal, and the curriculum approach for previously intractable non-linear classes is a concrete methodological advance. The demonstration that a modest 8B model can compete with frontier systems on established benchmarks underscores the efficiency of the synthetic-data-plus-RL recipe.
major comments (2)
- [§5] §5 (Experimental results): The central performance claims—that the 8B model reaches SOTA or competitive scores on six benchmarks and that the curriculum renders the non-linear class tractable—are presented without tables or text specifying benchmark definitions, exact evaluation metrics (e.g., formulation accuracy vs. solver success rate), number of test instances per benchmark, the precise frontier-model baselines and their scores, or any statistical significance tests. These omissions are load-bearing for the headline result.
- [§5.3] §5.3 (Curriculum RL subsection): The description of the curriculum strategy that bootstraps from limited initial data lacks ablation studies, intermediate performance curves, or controls that isolate the contribution of the curriculum versus simply scaling the synthetic data or RL steps. Without such evidence the claim that this strategy makes the non-linear class tractable remains under-supported.
minor comments (2)
- [Abstract] The abstract would be more informative if it reported at least one quantitative metric (e.g., average accuracy or pass rate) alongside the qualitative “state-of-the-art or competitive” phrasing.
- [§3] Notation for the reward function and the synthetic-data generation process could be made more explicit (e.g., by adding a short pseudocode block or a dedicated equation for the solver-feedback term).
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address the major comments point by point below and will incorporate revisions to improve the clarity and rigor of the experimental sections.
read point-by-point responses
-
Referee: [§5] §5 (Experimental results): The central performance claims—that the 8B model reaches SOTA or competitive scores on six benchmarks and that the curriculum renders the non-linear class tractable—are presented without tables or text specifying benchmark definitions, exact evaluation metrics (e.g., formulation accuracy vs. solver success rate), number of test instances per benchmark, the precise frontier-model baselines and their scores, or any statistical significance tests. These omissions are load-bearing for the headline result.
Authors: We agree that the experimental results section requires additional explicit details to fully support the performance claims. In the revised manuscript we will add a summary table (and accompanying text) that defines each of the six benchmarks, states the precise evaluation metrics (formulation accuracy and solver success rate), reports the number of test instances per benchmark, lists the exact frontier-model baselines together with their scores, and includes statistical significance tests. These additions will make the headline results transparent and reproducible. revision: yes
-
Referee: [§5.3] §5.3 (Curriculum RL subsection): The description of the curriculum strategy that bootstraps from limited initial data lacks ablation studies, intermediate performance curves, or controls that isolate the contribution of the curriculum versus simply scaling the synthetic data or RL steps. Without such evidence the claim that this strategy makes the non-linear class tractable remains under-supported.
Authors: We acknowledge that stronger empirical validation of the curriculum RL strategy is needed. The revised version will include ablation studies comparing the curriculum approach against controls that scale synthetic data volume or RL steps without curriculum, as well as intermediate performance curves that illustrate the bootstrapping process on the non-linear physical-dynamics problems. These additions will isolate the curriculum's contribution and better substantiate the claim that it renders the class tractable. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an empirical pipeline that generates synthetic training data from standard optimization problem forms and applies RL using external solver execution feedback as the reward signal. No equations, derivations, or self-referential metrics are presented in the provided text that reduce predictions or results to fitted inputs or self-citations by construction. The central claims rest on benchmark performance comparisons rather than internal consistency loops or ansatz smuggling. This is a standard data-generation-plus-external-verifier setup with no load-bearing self-definition or renaming of known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Solver execution feedback provides a reliable and scalable reward signal for improving LLM formalization accuracy across problem categories.
Reference graph
Works this paper leans on
-
[1]
Autoformulation of mathematical optimization models using llms.arXiv preprint arXiv:2411.01679, 2024
Nicolás Astorga, Tennison Liu, Yuanzhang Xiao, and Mihaela van der Schaar. Auto- formulation of mathematical optimization models using llms, 2025. URL https: //arxiv.org/abs/2411.01679
-
[2]
Gekko op- timization suite.Processes, 6(8):106, 2018
Logan DR Beal, Daniel C Hill, R Abraham Martin, and John D Hedengren. Gekko op- timization suite.Processes, 6(8):106, 2018
2018
-
[3]
The SCIP optimization suite 9.0
Suresh Bolusani, Mathieu Besançon, Ksenia Bestuzheva, Antonia Chmiela, João Dioní- sio, Tim Donkiewicz, Jasper van Doorn- malen, Leon Eifler, Mohammed Ghan- nam, Ambros Gleixner, et al. The scip optimization suite 9.0.arXiv preprint arXiv:2402.17702, 2024
-
[4]
Chain-of-experts: When LLMs meet complex operations re- search problems
Ziyang Xiao, Dongxiang Zhang, Yangjun Wu, Lilin Xu, Yuan Jessica Wang, Xiongwei Han, Xiaojin Fu, Tao Zhong, Jia Zeng, Min- gli Song, and Gang Chen. Chain-of-experts: When LLMs meet complex operations re- search problems. InThe Twelfth Interna- tional Conference on Learning Representa- tions, 2024. URLhttps://openreview. net/forum?id=HobyL1B9CZ
2024
-
[5]
Optimus: Optimization modeling using MIP solvers and large language models
Ali AhmadiTeshnizi, Wenzhi Gao, and Madeleine Udell. Optimus: Optimiza- tion modeling using mip solvers and large language models, 2023. URL https:// arxiv.org/abs/2310.06116
-
[6]
arXiv preprint arXiv:2503.10009 , year=
Bowen Zhang, Pengcheng Luo, Genke Yang, Boon-Hee Soong, and Chau Yuen. Or-llm- agent: Automating modeling and solving of operations research optimization problems with reasoning llm, 2025. URLhttps:// arxiv.org/abs/2503.10009
-
[7]
Llmopt: Learning to define and solve generaloptimizationproblemsfromscratch,
Caigao Jiang, Xiang Shu, Hong Qian, XingyuLu,JunZhou,AiminZhou,andYang Yu. Llmopt: Learning to define and solve generaloptimizationproblemsfromscratch,
- [8]
-
[9]
Zhicheng Yang, Yiwei Wang, Yinya Huang, Zhijiang Guo, Wei Shi, Xiongwei Han, Liang Feng, Linqi Song, Xiaodan Liang, and Jing Tang. Optibench meets resocratic: Mea- sureandimprovellmsforoptimizationmod- eling, 2025. URL https://arxiv.org/ abs/2407.09887
-
[10]
Vinicius Lima, Dzung T. Phan, Jayant Kalagnanam, Dhaval Patel, and Nianjun Zhou. Toward a trustworthy optimization modeling agent via verifiable synthetic data generation, 2025. URL https://arxiv. org/abs/2508.03117
-
[11]
Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang. Orlm: A customizable framework in training large models for automated optimization model- ing.Operations Research, 73(6):2986–3009, November 2025. ISSN 1526-5463. doi: 10. 1287/opre.2024.1233. URL http://dx. doi.org/10.1287/opre.2024.1233
-
[12]
Evaluating llm reasoning in the op- erations research domain with orqa
Mahdi Mostajabdaveh, Timothy Tin Long Yu, Samarendra Chandan Bindu Dash, Rindra Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou, and Yong Zhang. Evaluating llm reasoning in the op- erations research domain with orqa. InPro- ceedings of the AAAI Conference on Artifi- cial Intelligence, volume 39, pages 24902– 24910, 2025
2025
-
[13]
Zimu Lu, Aojun Zhou, Houxing Ren, Ke Wang, Weikang Shi, Junting Pan, Mingjie Zhan, and Hongsheng Li. Mathgenie: Generating synthetic data with question back-translation for enhancing mathematical rea- soning of llms, 2024. URL https: //arxiv.org/abs/2402.16352
-
[14]
Verification, the key to ai.on-line essay.[Online]
Richard S Sutton. Verification, the key to ai.on-line essay.[Online]. Available: http://www. cs. ualberta. ca/sutton/In- cIdeas/KeytoAI. html, 2001
2001
-
[15]
arXiv preprint arXiv:2412.02674 , year=
Yuda Song, Hanlin Zhang, Carson Eise- nach, Sham Kakade, Dean Foster, and Udaya Ghai. Mind the gap: Examining the self-improvement capabilities of large 10 AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems language models, 2025. URL https:// arxiv.org/abs/2412.02674
-
[16]
Qwen3 technical report,
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, and Chang Gao. Qwen3 technical report,
-
[17]
URL https://arxiv.org/abs/ 2505.09388
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Nl4opt competition: Formulat- ing optimization problems based on their natural language descriptions
Rindranirina Ramamonjison, Timothy Yu, Raymond Li, Haley Li, Giuseppe Carenini, Bissan Ghaddar, Shiqi He, Mahdi Mosta- jabdaveh, Amin Banitalebi-Dehkordi, Zirui Zhou, et al. Nl4opt competition: Formulat- ing optimization problems based on their natural language descriptions. InNeurIPS 2022 competition track, pages 189–203. PMLR, 2023
2022
-
[19]
Augmenting op- erations research with auto-formulation of optimization models from problem de- scriptions
Ramamonjison et al. Augmenting op- erations research with auto-formulation of optimization models from problem de- scriptions. InProceedings of the 2022 Conference on Empirical Methods in Nat- ural Language Processing: Industry Track, pages 29–62, Abu Dhabi, UAE, December
2022
-
[20]
URL https://aclanthology
Association for Computational Lin- guistics. URL https://aclanthology. org/2022.emnlp-industry.4
2022
-
[21]
Mamo: a mathematical modeling benchmark with solvers,
Xuhan Huang, Qingning Shen, Yan Hu, An- ningzhe Gao, and Benyou Wang. Mamo: a mathematical modeling benchmark with solvers.arXiv preprint arXiv:2405.13144, 2024
-
[22]
Ziyang Xiao, Jingrong Xie, Lilin Xu, Shisi Guan, Jingyan Zhu, Xiongwei Han, Xiaojin Fu, WingYin Yu, Han Wu, Wei Shi, Qingcan Kang, Jiahui Duan, Tao Zhong, Mingxuan Yuan, Jia Zeng, Yuan Wang, Gang Chen, and Dongxiang Zhang. A survey of opti- mization modeling meets llms: Progress and future directions, 2025. URLhttps: //arxiv.org/abs/2508.10047
-
[23]
New directions for nonlin- ear process optimization.Current Opinion in Chemical Engineering, 21:32–40, 2018
Lorenz T Biegler. New directions for nonlin- ear process optimization.Current Opinion in Chemical Engineering, 21:32–40, 2018
2018
-
[24]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, and Guangming Sheng. Dapo: An open-source llm reinforcement learn- ing system at scale, 2025. URL https: //arxiv.org/abs/2503.14476
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
h1: Bootstrapping llms to reason over longer horizons via reinforcement learning
Sumeet Ramesh Motwani, Alesia Ivanova, Ziyang Cai, Philip Torr, Riashat Islam, Shi- tal Shah, Christian Schroeder de Witt, and Charles London. h1: Bootstrapping llms to reason over longer horizons via rein- forcement learning, 2025. URL https: //arxiv.org/abs/2510.07312
-
[26]
Pope: Learning to reason on hard prob- lems via privileged on-policy exploration,
Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. Pope: Learning to reason on hard prob- lems via privileged on-policy exploration,
- [27]
-
[28]
McGraw-Hill Education, 2014
Frederick Hillier and G Lieberman.Intro- duction to Operations Research with Student Access Card. McGraw-Hill Education, 2014
2014
-
[29]
Solverllm: Leveraging test-time scal- ing for optimization problem via llm-guided search, 2025
Dong Li, Xujiang Zhao, Linlin Yu, Yanchi Liu, Wei Cheng, Zhengzhang Chen, Zhong Chen, Feng Chen, Chen Zhao, and Haifeng Chen. Solverllm: Leveraging test-time scal- ing for optimization problem via llm-guided search, 2025. URLhttps://arxiv.org/ abs/2510.16916
-
[30]
Zezhen Ding, Zhen Tan, Jiheng Zhang, and Tianlong Chen. Or-r1: Automating mod- eling and solving of operations research optimization problem via test-time rein- forcement learning, 2025. URL https: //arxiv.org/abs/2511.09092
-
[31]
The llama 3 herd of models,
Aaron Grattafiori, Abhimanyu Dubey, Ab- hinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, and Alex Vaughan. The llama 3 herd of models,
-
[32]
URL https://arxiv.org/abs/ 2407.21783
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh, Winnie Xu, Niklas Muen- nighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic 11 AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems optimization, 2024. URLhttps://arxiv. org/abs/2402.01306
work page internal anchor Pith review arXiv 2024
-
[34]
Yang Wu, Yifan Zhang, Yurong Wu, Yu- ran Wang, Junkai Zhang, and Jian Cheng. Step-opt: Boosting optimization modeling in llms through iterative data synthesis and structured validation, 2025. URLhttps: //arxiv.org/abs/2506.17637
-
[35]
Opt- math: A scalable bidirectional data synthe- sis framework for optimization modeling,
HongliangLu, ZhonglinXie, YaoyuWu, Can Ren, Yuxuan Chen, and Zaiwen Wen. Opt- math: A scalable bidirectional data synthe- sis framework for optimization modeling,
- [36]
-
[37]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps: //arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Proximal policy optimization algorithms,
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms,
-
[39]
URL https://arxiv.org/abs/ 1707.06347
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, and Wu. Deepseek-r1 incentivizes rea- soning in llms through reinforcement learning.Nature, 645(8081):633–638, September 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx.doi.org/10.1038/ s41...
-
[41]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understand- ing r1-zero-like training: A critical perspec- tive, 2025. URL https://arxiv.org/ abs/2503.20783
work page Pith review arXiv 2025
-
[42]
Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms,
Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhi- jian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms,
-
[43]
URL https://arxiv.org/abs/ 2506.14245
work page internal anchor Pith review arXiv
- [44]
-
[45]
Stepcoder: Improve code generation with reinforcement learning from compiler feedback
Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Wei Shen, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Xuanjing Huang, and Tao Gui. Stepcoder: Improve code generation with reinforcement learning from compiler feed- back, 2024. URL https://arxiv.org/ abs/2402.01391
-
[46]
Olympiad-level formal mathematical rea- soning with reinforcement learning.Nature, pages 1–3, 2025
Thomas Hubert, Rishi Mehta, Laurent Sar- tran, Miklós Z Horváth, Goran Žužić, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, et al. Olympiad-level formal mathematical rea- soning with reinforcement learning.Nature, pages 1–3, 2025
2025
-
[47]
ZZ Ren, Zhihong Shao, Junxiao Song, Hua- jian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, et al. Deepseek-prover-v2: Advanc- ing formal mathematical reasoning via re- inforcement learning for subgoal decom- position.arXiv preprint arXiv:2504.21801, 2025
-
[48]
Optimization of pump configurations as a minlp problem.Comput- ers & chemical engineering, 18(9):845–858, 1994
Tapio Westerlund, Frank Pettersson, and Ig- nacio E Grossmann. Optimization of pump configurations as a minlp problem.Comput- ers & chemical engineering, 18(9):845–858, 1994
1994
-
[49]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low- rank adaptation of large language models,
-
[50]
LoRA: Low-Rank Adaptation of Large Language Models
URL https://arxiv.org/abs/ 2106.09685. 12 AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
Lora without regret.Think- ing Machines Lab: Connectionism,
John Schulman and Thinking Machines Lab. Lora without regret.Think- ing Machines Lab: Connectionism,
-
[52]
doi: 10.64434/tml.20250929. https://thinkingmachines.ai/blog/lora/
-
[53]
Bo Tang, Elias B Khalil, and Ján Drgoňa. Learning to optimize for mixed-integer non- linear programming with feasibility guar- antees.arXiv preprint arXiv:2410.11061, 2024
-
[54]
Springer Science & Business Media, 2013
Christodoulos A Floudas, Panos M Parda- los, Claire Adjiman, William R Esposito, Zeynep H Gümüs, Stephen T Harding, John L Klepeis, Clifford A Meyer, and Carl A Schweiger.Handbook of test problems in local and global optimization, volume 33. Springer Science & Business Media, 2013
2013
-
[55]
React: Synergizing rea- soning and acting in language models,
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing rea- soning and acting in language models,
-
[56]
URL https://arxiv.org/abs/ 2210.03629
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E. Jimenez, Alexan- der Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024. URL https: //arxiv.org/abs/2405.15793
work page internal anchor Pith review arXiv 2024
-
[58]
QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?
Belinda Z. Li, Been Kim, and Zi Wang. Questbench: Can llms ask the right ques- tion to acquire information in reasoning tasks?, 2025. URLhttps://arxiv.org/ abs/2503.22674
-
[59]
LLMs Get Lost In Multi-Turn Conversation
Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation, 2025. URLhttps: //arxiv.org/abs/2505.06120
work page internal anchor Pith review arXiv 2025
-
[60]
Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, and Mingyi Hong. Reinforcing multi-turn rea- soning in llm agents via turn-level reward design, 2025. URLhttps://arxiv.org/ abs/2505.11821
-
[61]
Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. Agentic rea- soning and tool integration for llms via re- inforcement learning, 2025. URLhttps: //arxiv.org/abs/2505.01441
-
[62]
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategictooluseinllms,2025. URL https: //arxiv.org/abs/2504.11536
work page internal anchor Pith review arXiv 2025
-
[63]
Mixed integer linear programmingformulationtechniques.Siam Review, 57(1):3–57, 2015
Juan Pablo Vielma. Mixed integer linear programmingformulationtechniques.Siam Review, 57(1):3–57, 2015
2015
-
[64]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and AviralKumar. Scalingllmtest-timecompute optimally can be more effective than scaling model parameters, 2024. URLhttps:// arxiv.org/abs/2408.03314
work page Pith review arXiv 2024
-
[65]
TRL: Transformers Reinforcement Learn- ing, 2020
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Reinforcement Learn- ing, 2020. URL https://github.com/ huggingface/trl
2020
-
[66]
Verifiers: Envi- ronments for llm reinforcement learning
William Brown. Verifiers: Envi- ronments for llm reinforcement learning. https://github.com/ PrimeIntellect-ai/verifiers, 2025
2025
-
[67]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schu- urmans, Quoc V. Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025. URLhttps://arxiv. org/abs/2501.17161
work page internal anchor Pith review arXiv 2025
-
[68]
Jie Chen, Xintian Han, Yu Ma, Xun Zhou, and Liang Xiang. Unlock the correlation between supervised fine-tuning and rein- forcement learning in training code large 13 AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems language models, 2024. URL https:// arxiv.org/abs/2406.10305
-
[69]
Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I. Wang. Swe- rl: Advancing llm reasoning via reinforce- ment learning on open software evolution,
- [70]
-
[71]
A deep dive into scaling rl for code generation with synthetic data and curricula, 2026
Cansu Sancaktar, David Zhang, Gabriel Syn- naeve, and Taco Cohen. A deep dive into scaling rl for code generation with synthetic data and curricula, 2026. URL https: //arxiv.org/abs/2603.24202
-
[72]
Compute as teacher: Turning in- ference compute into reference-free super- vision, 2026
Dulhan Jayalath, Shashwat Goel, Thomas Foster, Parag Jain, Suchin Gururangan, Cheng Zhang, Anirudh Goyal, and Alan Schelten. Compute as teacher: Turning in- ference compute into reference-free super- vision, 2026. URLhttps://arxiv.org/ abs/2509.14234. 14 AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems A. Appendix A.1. Add...
work page internal anchor Pith review arXiv 2026
-
[73]
""Solves the farming cooperative linear optimization problem
applied GRPO with proof-checker rewards for theorem proving. A.2. Training Details All experiments use Qwen3-8B (which is an instruction tuned model) [15] as the base model with LoRA adapters [41] and bfloat16 precision on NVIDIA A100 GPUs. Single-turn categories (LP, MILP, NLP) are trained with Dr. GRPO [34] via TRL [54]; adapters are merged between curr...
2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.